Written by Michael Stonebraker
Suppose you are a data steward, responsible for integrating a collection of data sources, S1, …, Sn. Historically, you would perform the following steps:
- Have your best programmer define a global schema GS, which the various sources will accommodate.
- Have n programmers go out and interview each of the n dataset owners to discover what information they have and how it is actually structured.
- The programmer then writes a transformation script (think of this as Python, but it is likely to be in a proprietary scripting language). This script transforms his/her assigned data source, sj, into the global schema. GS.
- The programmer is also responsible for cleaning up the data, typically by writing a collection of cleaning rules, again in a proprietary rule language. A popular rule would be that -999 actually means null.
- The programmer loads the result into a DBMS, which supports GS.
Historically, these steps would be performed by an Extract, Transform and Load (ETL) system available from IBM, Informatica, Talend, Knime, among others.
Since the n data sources may contain duplicates, it is now required to consolidate them or erroneous results will be presented. For example, one recent Tamr customer found that they were over counting suppliers by a factor of two. The difference between 100K suppliers and 200K suppliers is very significant to this particular enterprise. As such, our data steward must remove duplicates.
Traditionally, our data steward would use a Master Data Management (MDM) system, available from “the usual suspects.” Th MDM system supports a “match/merge” process, whereby our data steward writes rules to determine when two records match (i.e. correspond to the same entity). For example, if one is consolidating data about ski areas, a reasonable rule is that two ski areas are the same if they have the same “vertical drop.” This set of rules determines a collection of data clusters, which hopefully represent the same entity.
Now for each cluster, the data steward must find a “golden record”, which distills the collection of attribute values in a cluster down to a single one. Again, this is done through a collection of rules.
- As such the steward writes a collection of match rules which performs this entity consolidation task, followed by
- Writing a collection of merge rules which consolidate attribute values. In our ski area example, a simple rule is “majority consensus,” whereby the value for each attribute is chosen with the most repetition
The result is (hopefully) a cleaned, unified data set. However, our data steward’s technology has three fundamental problems:
- The ETL piece of the process is too human intensive. Hence, it is typically limited to a dozen or so data sources. Toyota Motor Europe wishes to consolidate 250 customer data sets (i.e. n = 250). There is no possibility that the traditional technology will scale to these kinds of numbers
- The rule system piece also fails to scale. Again, Toyota Motor Europe wishes to consolidate 30+ million customer records in 40 languages. They determined it was infeasible to apply rule system technology at this scale.
- The ETL/MDM process is oriented toward programmers, not domain experts. For example does “Merck” in Europe correspond to Merck in the “US”. A programmer has no clue and it takes a domain expert to figure this out. (By the way, these happen to be unrelated companies)
In summary, our data steward’s technology fails badly on integration challenges at scale. In other words, use ETL/MDM on small problems.
Enter machine learning for MDM at scale
When “heavy lifting” is in order, then machine learning must enter the picture:
- Data clusters are found through machine learning, not rules. There is no possibility of writing enough rules to cluster data at scale. Machine learning models use training data (prepared by our data steward) to construct clusters for the whole data set.
- Golden records are found using machine learning. Building a GS up front is no longer necessary. At scale, there is no possibility of building a GS up front. For example, Novartis wants to integrate the electronic lab notebooks of around 10,000 chemists and biologists. Nobody is knowledgeable enough to find a GS in advance. Instead, the schema is pieced together by machine learning as the integration process progresses. (See Tamr for an example of an ML-driven system)
As the MDM technology evolves, so must our data steward. To succeed in the age of machine learning, our data steward will be need to become expert in:
- Data science & ML models – Our data steward will evaluate and operate ML models. They will require knowledge of AI, and how data integration and quality ML models work.
- Training data – Our data steward will create and refine training data for the models, and evaluate the impact of training data changes on ML model effectiveness. This will require the ability to conceive of, generate, test, and refine training data sets.
- Model management – Models change, so do training data, and combinations of each produce different outcomes. Our data steward will need to implement and operate model change management systems that document changes, manage versions, record results (with different training data sets). I’ll write more about model management in the near future…we live this daily at Tamr.
- Feedback management – The number of people and applications consuming data is growing exponentially. A data steward can’t expect to identify all quality issues during a data integration process. Instead, data stewards need to get very good at being able to collect and use feedback coming from all corners of the enterprise to improve data quality.
To learn more about Tamr’s human-guided, machine learning based approach to data unification, schedule a demo. Or to read more about machine learning for data stewardship, download our white paper below.