Michael Stonebraker
Michael Stonebraker
Co-founder and Turing Award Winner
SHARE
January 29, 2019

Three Generations of AI: From Data Warehouses to Machine Learning

Three Generations of AI: From Data Warehouses to Machine Learning

For years, data unification has been a roadblock to effective data analytics. Data scientists report spending 80% of their time locating data, unifying it from multiple sources, and cleaning it before they can begin their analytics work in earnest. As a result, enterprises are looking for better ways to complete the data unification process. Over the years, organizations have made several advancements, including applying artificial intelligence (AI) to address these challenges. Here is how enterprises have leveraged three generations of AI to solve the data unification problem.

Generation 1

In the mid-to-late 90s, enterprises began adopting data warehouses, and needed a way to load data from disparate sources into these warehouses. A generation of Extract, Transform and Load (ETL) tools came into existence to help organizations with this problem.

Early ETL tools offered enterprises do-it-yourself scripting languages to create static rules for specifying transformation tasks. Users could program tasks—data cleaning, preparation, and consolidation—using the scripting facility. Many early ETL tools, however, had limitations. For example, some did not have dedicated features for data cleaning, making it difficult to detect and remove errors and inconsistencies to improve data quality.

ETL scripting tools were a good first step toward more effective data unification, cleaning, and loading. However, as the volume and variety of data continued to expand, time-consuming manual scripting began impeding organizations’ ability to quickly derive critical insights needed to drive business decisions. By the time data was ready for analysis, the business had often moved on to new issues that needed resolving.

Generation 2

To hasten data analytics, many enterprises adopted ETL products that included more sophisticated rules. These took two different forms:

  1. Cleaning tools were included in ETL suites. These were usually rule-based systems, the preferred AI technology of the time. Still, some data cleaning was performed using hard-coded data lookup tables
  2. The second application of rules-based systems was Master Data Management (MDM). MDM was used to remove duplicates, a necessary process so that analytics don’t generate incorrect results. However, rarely are apparent duplicate entities exact duplicates. Enterprises had to decide on a single representation, or golden record, for each entity.

In Generation 2, organizations used AI-based rule technology to solve a variety of data unification problems. Several vendors today still provide MDM tools. However, although they represent an advancement on first-generation AI, they are at their core human-generated deterministic rules that are unable to scale to meet the data variety challenge of today’s enterprise data unification requirements. Although the number of rules a human can comprehend varies with the application and the complexity, most humans cannot understand more than 500 rules.

Generation 3

To solve problems involving significant data variety, rule-based systems fall short. To overcome the limitations, Tamr Unify uses machine learning (ML). With Tamr, customers can apply the 500 manually created rules as training data to construct a classification model. The model classifies the remaining 18 million transactions. Tamr uses ML to solve all the problems attacked by 2nd generation systems with rules. In other words, Tamr uses ML for schema integration, classification, entity consolidation, and golden record construction.

As long as the training data is reasonably accurate and covers the data set well, an ML model will achieve good results. However, every Tamr customer relies on humans to check a sample of ML output for accuracy. When humans discover an error, they correct it and add the correction to the training set, improving the accuracy of the model through active learning.  

If enterprises have a “small” problem, then they can safely use a 1st or  2nd generation system. However, if scalability is needed (either now or in the future), then a 3rd generation system is a necessity. The innovations presented by 3rd generation data unification enable organizations to harness big data to solve their most pressing issues and drive significant business impact.