Two individual are discussing

A primer on augmented data preparation

Speed your company’s time to insight with machine learning and other augmented analytics.

What is augmented data preparation?

Stated simply, augmented data preparation empowers businesspeople and other workers who lack deep expertise in data science and analytics to create rich, reliable data sets for analysis. Powered by machine learning (ML) and artificial intelligence (AI)—and delivered on an automated, self-service platform—augmented data preparation tools transform the process of finding and examining raw data and converting it into consumable forms. They don’t replace human intelligence and contextual awareness; they enhance it.

To gain competitive advantage, leaders, line-of-business managers, partners, and others rely on business intelligence (BI) and business analytics to provide them with accurate, timely, and relevant insights. Using augmented data preparation, your company can help decentralize and democratize data preparation so that more employees can help create those insights.

How are augmented data preparation tools used?

Augmented data preparation tools streamline the first and perhaps most important step in data processing—creating data sets needed to build, test, and train analytics models.

Traditionally, data preparation fell into the domain of technical teams that wrote code and used specialized software to extract data from internal operational systems, clean and structure it, and load it into data warehouses. Known as data extraction, transformation, and loading (ETL), these processes could be complex, time-consuming, and error prone.

Most average business users didn’t have the skills or time to carry out ETL work themselves. Even citizen data scientists—business analysts, developers, and others who lack formal data science training but perform some advanced analytics work—found themselves relying on data engineers and other data professionals to decide which data to analyze and how.

Times have changed. Now, organizations store huge volumes of structured, semi-structured and unstructured data, including text and images, in multiple siloed applications and systems. Rarely do centralized IT and data management teams have the time and resources to gather and prepare data, much less model and study it, to support all a company’s varied analytics initiatives.

Thanks to augmented data preparation tools, more people can step up and help. Featuring point-and-click, conversational interfaces, the tools steadily guide users through data-driven decisions related to data preparation.

What are the data preparation steps?

Also known as data wrangling or munging, the data preparation process comprises a series of sequential activities for integrating, structuring, and organizing data. The data preparation steps, outlined below in commonly used categories, culminate in the creation of a single, trusted data set to inform one or more specific use cases:

  1. Collection. Guided by the objectives of its intended analysis, the analytics team identifies and pulls relevant data from internal and external data sources. For example, if the goal is to shed light on customer product preferences, the team can draw quantitative and qualitative data from CRM and sales applications, customer surveys, and social media feedback. During this phase, the team should consult with all stakeholders and use reliable data sets, or it risks biased or otherwise skewed results.
  2. Discovery and profiling. Through iterative stages of exploration and analysis, the team examines the raw data it collected to better understand the overall structure of and individual content within each data set. It also studies relationships across data sets. Through data profiling, the team collects and summarizes statistics on anomalies, inconsistencies, gaps, and other issues that must be addressed before the data is used to develop and train analytics models. For example, customer, patient, and other data sets containing names and addresses stored across systems often vary in spelling and other ways.
  3. Cleansing. At this stage, the team must meticulously correct all data quality issues. Cleansing involves activities such as filling in missing values, correcting or removing defective data, filtering out irrelevant data, and masking sensitive data. Time-consuming and tedious, this data preparation step is critical to ensuring data accuracy and consistency. Cleansing is particularly important when working with big data because of the sheer data volumes that must be harmonized.
  4. Structuring. This step entails developing a database schema that describes how to organize the data into tables to enable smooth access by modeling tools. The schema can be considered a permanent structure that will house constantly changing data in a unified manner. All schematic components are defined.
  5. Transformation and enrichment. Once the schema is set, the team must make sure all data conforms. Some existing data formats will need to be altered, such as by adjusting hierarchies and adding, merging, or deleting columns and fields. The team can also enhance the data with behavioral, demographic, geographic, and other contextual information pulled from sources within and outside the organization. An enriched data set enables analytics models to be trained with more comprehensive data sets and hence deliver more precise, valuable insights.
  6. Validation. Now, the team must use written scripts or tools to verify the quality and accuracy of its data set. Also, it confirms that the data structure and formatting align with project requirements so that users and project modeling tools can easily access the data. Depending on the size of the data set, the team might choose to test a data sample rather than the full data set. It should resolve any issues before moving on to the final step of the data preparation process.
  7. Publishing. When the team is confident its data is of high quality, it transfers it to the targeted data warehouse, data lake, or another repository. Here, the team and others within the organization can access it to develop and test analytics models.

How does machine learning enhance data preparation and modeling?

Augmented data analytics is made possible by augmented analytics, including ML, automation, natural-language generation (NLG), and data visualization. For example, augmented data discovery relies heavily on ML—a type of AI that uses algorithms and statistical models to learn from data and adapt without human assistance.

Using ML, discovery tools apply learned knowledge to consider what kinds of data sets are needed given the problem the model must solve and the hypothesis to be tested. They also consider the context in which the data sets were gathered. Then, the tools quickly analyze and draw inferences from patterns in the data sets and intelligently suggest which ones to combine.

Augmented data discovery not only uses ML but also helps ensure effective data preparation for machine learning models. For instance, the discovery tools use ML algorithms to generate recommendations for users on how to cleanse and enrich data and transform it into an appropriate format for ML model analysis.

How can your company benefit from augmented data preparation?

Every day, business leaders and teams across industries identify new, strategic ways to capitalize on data. With augmented data preparation, they can act on innovative ideas for analytics projects without the help of IT professionals.

The benefits of augmented data preparation can reach across your organization:

  • Boosts productivity—Using intuitive, graphical user interfaces with automated, self-service tools, skilled business users can quickly collect data from multiple, disparate sources and run it through profiling, cleansing, and other key data preparation functions. Augmented data preparation also helps reduce or eliminate time-consuming tasks for IT and data professionals.
  • Delivers higher-quality data—When preparing data manually, even experienced data scientists can accidentally introduce inaccurate and irrelevant data—or fail to include important data. Augmented data preparation can automatically locate and correct quality issues, helping ensure your data set produces valid results.
  • Accelerates ROI—Greater productivity at the front end of analytics projects leaves more time and resources for data modeling, mining, and analysis. Rather than get caught up in manual data preparation chores, users can focus on studying insights and applying them to transform business operations and challenges. Once built, a data set can have several applications, further optimizing your investments.
  • Drives data democratization—Equipped to help prepare and publish data for analysis, nonspecialized users can become more comfortable working with raw data. In addition, users most familiar with the analytics problem can draw on their business knowledge and expertise to select statistically significant data sets and help structure and enrich data to support project goals. As data literacy grows in your organization, people gain more confidence in data-driven decisions and strategies.
  • Improves business agility—Able to rapidly prepare comprehensive data sets, users can quickly launch new analytics projects in support of changing business and marketplace conditions. The faster the time to insight, the faster your company can apply those insights to gain competitive advantage.

How are companies applying augmented data preparation?

Across industries, companies use business intelligence and business analytics tools to derive greater value from data. For example, having incorporated augmented data preparation into their workflows, the following organizations efficiently gathered and processed data to fuel their analytics:


To better understand which customers are most likely to use wealth investment services—and then target them with personalized promotions—a large bank quickly gathered and consolidated account, deposit, withdrawal, and credit card data from across its branch and ATM network. It also pulled demographic, socioeconomic, and other contextual data from external sources.


An international pharmacy chain sought to know why its brand name makeup underperformed in some locations but not others. It combined point-of-sale, product category, customer loyalty, net promoter score, and pricing data from its internal systems with external geographic data to build a rich data set for analysis.


A small agricultural technology company wanted to use its proprietary algorithms to study crop yield trends in drought-ridden areas so it could advise small-scale farmers on what crops to plant and when. Capitalizing on big data pools maintained by public and private organizations, it sourced and combined data pertaining to multiple variables, including weather conditions, soil temperatures, moisture content, water use, and crop status.


A legal firm defending a corporate client in a large litigation analyzed millions of client emails and other unstructured documents for pertinent history. By dramatically reducing manual, repetitive data discovery activities, the firm had more time to review and analyze relevant findings.


A US state government wanted to employ predictive maintenance practices to help cut fuel, maintenance, and services costs for its fleet of automobiles and heavy machinery. To better determine which and when vehicles needed servicing, and each vehicle’s real-time proximity to a service facility, the asset management team integrated information from vehicle maintenance records and performance sensors with external GPS data.

How can your company implement an augmented data preparation solution?

Before introducing augmented data preparation to employees, your company should gain their trust. Some individuals could be concerned that the new technologies will change or even eliminate their roles. To promote acceptance, managers can invite affected teams to help define new data preparation processes and discuss how their roles might evolve. Also, proactively fostering data literacy across the organization, especially among teams that aren’t familiar with augmented data analytics, helps boost trust in the resulting insights.

When choosing a self-service data preparation solution, ask the following questions:

  • Will the solution connect to a variety of data sources, either on-premises or in the cloud?
  • Can it work with semi-structured and unstructured raw data?
  • To what extent does it automate the data preparation process?
  • Does it have robust, intuitive tools?
  • Does the solution support cross-organizational collaboration and data sharing?
  • Can it scale to handle big data?
  • Will it support cloud-based analytics platforms? If so, which ones?
  • Will it enable data security and privacy and support regulatory compliance?
  • What will it cost, considering software licenses, processing and storage requirements, and employee onboarding and training?

Once you’ve decided on a solution, start small with the implementation. Ask data science, business, and other stakeholders to select a few data-literate teams with use cases that lend themselves to augmented data preparation. Based on your company objectives for augmented data analysis, gradually roll out the solution to other teams.

Wrangle more value from your data with Microsoft Power BI

Microsoft Power BI can help your company make augmented data analytics a simpler, faster, and more inclusive process. Prompted by NLG queries and recommendations and aided by data visualizations, business teams can more quickly and confidently prepare accurate, comprehensive data sets that generate quality insights.

Frequently asked questions

What is data preparation?

Data preparation involves all stages of creating quality, accurate, and comprehensive data sets for business intelligence and business analytics. It helps ensure an organization can generate insights needed to gain competitive advantage.

What are data preparation tools?

Data preparation tools facilitate data collection, discovery and profiling, cleansing, structuring, transformation and enrichment, validation, and publishing.

What is the augmented data preparation process?

The augmented data preparation process uses augmented analytics—including ML, NLG, and data visualization—to transform traditionally tedious, time-consuming activities into automated, more intelligent workflows.

Why is augmented data preparation important?

Augmented data preparation can deliver several benefits. It can increase productivity, run analyses using higher-quality data, accelerate ROI on analytics projects, democratize data, and improve business agility.

What is data preparation for machine learning?

Effective data preparation for machine learning applications provides quality data sets for building and testing ML models. For example, many augmented data preparation tools employ ML algorithms to make recommendations to users on how to cleanse and enrich data and transform it into an appropriate format for ML model analysis.