Three Generations of AI: From Data Warehouses to Machine Learning

For years, data unification has been a roadblock to effective data analytics. Data scientists report spending 80% of their time locating data, unifying it from multiple sources, and cleaning it before they can begin their analytics work in earnest. As a result, enterprises are looking for better ways to complete the data unification process. Over the years, organizations have made several advancements, including applying artificial intelligence (AI) to address these challenges. Here is how enterprises have leveraged three generations of AI to solve the data unification problem.

Generation 1

In the mid-to-late 90s, enterprises began adopting data warehouses, and needed a way to load data from disparate sources into these warehouses. A generation of Extract, Transform and Load (ETL) tools came into existence to help organizations with this problem.

Early ETL tools offered enterprises do-it-yourself scripting languages to create static rules for specifying transformation tasks. Users could program tasks—data cleaning, preparation, and consolidation—using the scripting facility. Many early ETL tools, however, had limitations. For example, some did not have dedicated features for data cleaning, making it difficult to detect and remove errors and inconsistencies to improve data quality.

ETL scripting tools were a good first step toward more effective data unification, cleaning, and loading. However, as the volume and variety of data continued to expand, time-consuming manual scripting began impeding organizations’ ability to quickly derive critical insights needed to drive business decisions. By the time data was ready for analysis, the business had often moved on to new issues that needed resolving.

Generation 2

To hasten data analytics, many enterprises adopted ETL products that included more sophisticated rules. These took two different forms: 

  1. Cleaning tools were included in ETL suites. These were usually rule-based systems, the preferred AI technology of the time. Still, some data cleaning was performed using hard-coded data lookup tables
  2. The second application of rules-based systems was Master Data Management (MDM). MDM was used to remove duplicates, a necessary process so that analytics don’t generate incorrect results. However, rarely are apparent duplicate entities exact duplicates. Enterprises had to decide on a single representation, or golden record, for each entity.

In Generation 2, organizations used AI-based rule technology to solve a variety of data unification problems. Several vendors today still provide MDM tools. However, although they represent an advancement on first-generation AI, they are at their core human-generated deterministic rules that are unable to scale to meet the data variety challenge of today’s enterprise data unification requirements. Although the number of rules a human can comprehend varies with the application and the complexity, most humans cannot understand more than 500 rules.

Generation 3

To solve problems involving significant data variety, rule-based systems fall short. To overcome the limitations, Tamr Unify uses machine learning (ML). With Tamr, customers can apply the 500 manually created rules as training data to construct a classification model. The model classifies the remaining 18 million transactions. Tamr uses ML to solve all the problems attacked by 2nd generation systems with rules. In other words, Tamr uses ML for schema integration, classification, entity consolidation, and golden record construction.

As long as the training data is reasonably accurate and covers the data set well, an ML model will achieve good results. However, every Tamr customer relies on humans to check a sample of ML output for accuracy. When humans discover an error, they correct it and add the correction to the training set, improving the accuracy of the model through active learning.  

If enterprises have a “small” problem, then they can safely use a 1st or  2nd generation system. However, if scalability is needed (either now or in the future), then a 3rd generation system is a necessity. The innovations presented by 3rd generation data unification enable organizations to harness big data to solve their most pressing issues and drive significant business impact.

To learn more about the three generations of AI and determine your organization’s AI and data analytics progress, download Tamr’s white paper: Three Generations of AI for Data Unification.

Three Generations of AI for Data Unification – Dr. Michael Stonebraker

Whitepaper

Regardless of industry, data unification is a major challenge in any data analytics pipeline. In fact, data scientists spend up to 80% of their time locating data, unifying it from multiple sources, and cleaning it before they can even begin …

Download Now



Michael Stonebraker is an adjunct professor at MIT CSAIL and a database pioneer who specializes in database management systems and data integration. He was awarded the 2014 A.M. Turing Award (known as the “Nobel Prize of computing”) by the Association for Computing Machinery. Professor Stonebraker has been a pioneer of database research and technology for more than 40 years, and is the author of scores of papers in this area. Before joining CSAIL in 2001, he was a professor of computer science at the University of California Berkeley for 29 years. While at Berkeley, he was the main architect of the INGRES relational DBMS; the object-relational DBMS POSTGRES; and the federated data system Mariposa. After joining MIT, he was the principal architect of C-Store (a column store commercialized by Vertica), H-Store, a main memory OLTP engine (commercialized by VoltDB), and SciDB (an array engine commercialized by Paradigm4). In addition, he has started three other companies in the big data space, including Tamr. He also co-founded the Intel Science and Technology Center for Big Data, based at MIT CSAIL.