Data Mastering at Scale

Data mastering (sometimes called Master Data Management or MDM for short) is now 15 years old. It arose because enterprises have been creating independent business units (IBUs) for a long time with substantial freedom of action. This allows IBUs to be as agile as possible; otherwise, every decision must go through headquarters decision making. As a result, IBUs have created “silos” of information, relevant to their specific needs.  

There is considerable business value to integrating certain entities from these silos. For example, one IBU might have Company X as a sales prospect. This IBU would want to know if any other IBU had sold to X to facilitate cross selling. Other examples include removing duplication of effort, generating better customer service, and saving money from better contract negotiation. As such, the common entities of the IBUs must be normalized and shared across the enterprise. Enter data mastering which is used to create common definitions of customers, products, parts, suppliers, etc. During the last 15 years data mastering has spread to most enterprises, and usually means the union of two different capabilities:

Extract, transform and load (ETL).  ETL moves data to a common place and transforms it into a normalized form. This includes correcting units (e.g. Euros to dollars), transformations (e.g. European dates to US dates), and other housekeeping chores. Then the data must be loaded into a predefined common schema, typically in a data warehouse or a data lake. To assist with data preparation and normalization ETL is a precondition for data mastering which comes next.

Data Mastering. Consider an entity, Mike Stonebraker. I may exist multiple times in IBU data silos. Moreover, there are no global keys in the silos, so I could be M.R. Stonebraker in one place, Mike Stonebraker in a second, and Michael Stonebreaker in a third. Data mastering first performs so called “match/merge” to create clusters of records that correspond to the same entity. The second step is to create a “golden record” for each entity by constructing from columns of attribute values the ones which best represent each entity. In other words, one must collapse a collection of records representing an entity into a “golden record” with trusted values.

ETL software is very mature and products have a common thread. A global schema is constructed upfront. Then, for each data source a programmer interviews the business owner to ascertain the format of raw records. Then the programmer writes scripts to convert to the global schema. This process is largely manual.

Traditional data mastering software is similarly mature. One creates clusters of records using a collection of rules (for example if the edit distance between the various renditions of me is smaller than a certain number, then the records represent me). Similarly, golden record construction is also handled by a set of rules (e.g. majority consensus wins, or most recent value wins).

ETL works great on small problems. A typical data warehouse is loaded from a dozen or less sources. Also, the sources are usually semantically similar (for example point-of-sale data from multiple places). Similarly current data mastering software works well on simple problems that can be addressed by the rule system technology in current products.

In the last 15 years these technologies have been very successful. They are in front of   essentially all data warehouses and power many repositories of “golden records” for entities. 

However, in the last few years the scope of ETL/data mastering projects has expanded dramatically. For example: 

  • Small businesses are a key customer for Groupon. Therefore, it has mastered 10,000 sources of small business information, in many cases from the public web, to form a global data set of small businesses. Not only is the number of sources two orders of magnitude greater than a typical data warehouse project, but also the diversity of the public web (dirty data, many languages) must be dealt with. 
  • Toyota Motor Europe is mastering Toyota customers in all of Europe. This is 30M+ raw records in 250 databases in 40 languages; well past the scope of typical data warehouse projects. 
  • Novartis is mastering experimental data from wet Chemistry and Biology experiments. This data comes from 10,000+ scientists’ electronic lab notebooks. 

In summary, the SCALE of ETL/ data mastering projects has increased dramatically over the last 15 years. This has resulted from:

  • The business value of data mastering at scale is now known to be enormous. For example, GE estimated they could save $100M/year by mastering their suppliers and then demanding “most favored nation” status when contracts were renewed.
  • There are now example use cases of successful data mastering at scale.  Previously, it was usually thought to be too expensive.
  • The real-time enterprise. Business decisions increasingly must be made quickly and on current data.

Traditional ETL is too programmer-intensive to operate at scale. In addition, it requires a global schema upfront. It is unimaginable to have a programmer deal with 10,000 Novartis data sources. Also, not even the Chief Scientist of the company has insight into the composition of a global schema. 

Traditional rules-based mastering similarly does not scale. It is well known that rule systems will work as long as the rule base is small (say 500 rules). When a mastering project requires substantially more than this number, traditional mastering projects tend to fail. This leads one to the conclusion that traditional data mastering is fine for small/simple problems; however a new approach is needed for bigger/more complex projects. Such a new approach must have four characteristics:

1 – Data Mastering at Scale requires machine learning

Manual analysis of the similarities and differences between source systems, manual ETL coding to combine data, and manual definition of rules to match source records will be replaced by Machine Learning (ML) models that perform these steps more accurately, and at a fraction of the time and cost.

2 – Data Mastering at Scale ML requires a human-in-the-loop

Data Mastering models require training data and active learning to correct errors and retrain models. Data stewards will not define thousands of rules; instead they will be experts in ML models, training data sets, and model versioning & management. 

3 – Low latency matching

Data operations teams will need to integrate their data mastering systems back into operational data systems. To work in real time, such systems will need access to current mastered data as they enter or update operational data in source systems. This will keep source data better synchronized across the enterprise, and give real-time access to mastered data.

4 – Future data mastering innovations will come to market as ML models

Breakthroughs in data integration are going to come to market as machine learning models. Some of the interesting research on the brink of being commercialized include:

  • Data error detection (see the HoloDetect project) and correction (see the Holoclean project). The resulting ML models can detect and correct errors in data sets with high accuracy.
  • Model management – collaborative tools to help teams debug models, and manage versions of models and training data to increase model accuracy (see the Data Civilizer Project

We call products that embrace these four tenets Data Mastering at Scale. If you have a small problem without real time requirements, then use a traditional products. If your problem requires lots of sources with large numbers of complex records with possibly real time requirements, then look for a product with the above tenets, i.e. Data Mastering at Scale!

 

To learn more about how Tamr’s solutions solve the data mastering at scale challenge, schedule a demo.



Michael Stonebraker is an adjunct professor at MIT CSAIL and a database pioneer who specializes in database management systems and data integration. He was awarded the 2014 A.M. Turing Award (known as the “Nobel Prize of computing”) by the Association for Computing Machinery. Professor Stonebraker has been a pioneer of database research and technology for more than 40 years, and is the author of scores of papers in this area. Before joining CSAIL in 2001, he was a professor of computer science at the University of California Berkeley for 29 years. While at Berkeley, he was the main architect of the INGRES relational DBMS; the object-relational DBMS POSTGRES; and the federated data system Mariposa. After joining MIT, he was the principal architect of C-Store (a column store commercialized by Vertica), H-Store, a main memory OLTP engine (commercialized by VoltDB), and SciDB (an array engine commercialized by Paradigm4). In addition, he has started three other companies in the big data space, including Tamr. He also co-founded the Intel Science and Technology Center for Big Data, based at MIT CSAIL.