Scalable Data Integration: Five Tenets for Success

By Michael Stonebraker, Tamr Co-Founder and CTO



Data curation involves:

  • ingesting data sources,
  • cleaning errors from the data (-99 often means null),
  • transforming attributes into other ones (for example, Euros to dollars),
  • performing schema integration to connect up disparate data sources, and
  • performing entity consolidation to remove duplicates.

Any data curation product should be capable of performing these tasks. However, first and second generation extract, transform and load (ETL) products will only scale to a small number of data sources, because of the amount of human intervention required. To scale to hundreds or even thousands of data sources, a new approach is needed. Tamr is an exemplar of this new third generation approach and is guided by two principles.

  1. Use statistics and machine learning to make automatic decisions wherever possible. Hence, “pick all the low hanging fruit” without human intervention
  1. Ask a human expert for help only when necessary. Instead of an architecture with a human controlling the process with computer assistance, move to an architecture with the computer running an automatic process, asking a human for help only when required.

This blog post examines five tenets that are desirable in third-generation data curation systems at scale.

Tenet 1: Data curation is never done

Business analysts and data scientists have an insatiable appetite for more data. This was brought home to me about a decade ago during a visit to a beer company in Milwaukee. They had a fairly standard data warehouse of sales of beer by distributor, time period, brand, … I visited during a year when El Niño was forecast to disrupt winter weather in the US. Specifically, it was forecast to be wetter than normal on the west coast and warmer than normal in New England. I asked the business analysts: “Are beer sales correlated with either temperature or precipitation?” They replied, “We don’t know, but that is a question we would like to ask.” However temperature and precipitation were not in the data warehouse, so asking was not an option.

The demand from warehouse users to correlate more and more data elements for business value leads to additional data curation tasks. Moreover, whenever a company CEO buys somebody, he creates a data curation problem to deal with the acquiree’s data. Lastly, the treasure trove of public data on the web (such as temperature and precipitation) is largely untapped, leading to more curation challenges.

Even without new data sources, the collection of existing data sources is rarely static. Hence, inserts and deletes to these sources generates a pipeline of incremental updates to a data curation system. Between the requirements of new data sources and updates to existing ones, it is obvious that a customer’s data curation problem is never done. Hence, any project in this area will effectively continue indefinitely. Any enterprise should simply realize this and plan accordingly.

One obvious consequence of this tenet concerns consultants. If you hire an outside service to perform data curation for you, then you will have to rehire them for each additional task. This will give the consultant a guided tour through your wallet over time. In my opinion, you are much better off developing in-house curation competence over time.

Tenet 2: A PhD in AI cannot be a requirement for success

Any third-generation system will use statistics and machine learning to make automatic or semi-automatic curation decisions. Inevitably, it will use sophisticated techniques such as T-tests, regression, predictive modeling, data clustering, and classification. Many of these techniques will entail training data to set internal parameters. Several will also generate recall and/or precision estimates.

These are all techniques understood by data scientists. However, there will be a shortage of such people for the foreseeable future, until colleges and universities produce substantially more than at present. Also, it is not obvious that one can “retread” a business analyst into a data scientist. A business analyst only needs to understand the output of SQL aggregates; in contrast, a data scientist is typically knowledgeable in statistics and various modeling techniques.

As a result, most enterprises will be lacking in data science expertise. Therefore, any third-generation data curation product must use these techniques internally, but not expose them in the user interface. Mere mortals must be able to use data curation products, and a PhD in analysis techniques cannot be a requirement for a user of a curation tool.

Tenet 3: Fully automatic data curation is not likely to be successful

Some data curation products expect to run fully automatically. In other words, they translate input data sets into output without human intervention. Fully automatic operation is very unlikely to be successful in an enterprise for a variety of reasons. First, there are curation decisions that simply cannot be made automatically. For example, consider two records; one stating that restaurant X is at location Y while the second states that restaurant Z is at location Y. This could be a case where one restaurant went out of business and got replaced by a second one or it could be a food court. There is no good way to know the answer to this question without human guidance.

Second, there are cases where data curation must have high reliability. Certainly, consolidating medical records should not make errors. In such cases, one wants a human to check all (or maybe just some) of the automatic decisions. Third, there are situations where specialized knowledge is required for data curation. For example, in a genomics application one might have two terms: ICU50 and ICE50. An automatic system might suggest that these are the same thing, since the lexical distance between the terms is low. However, only a human genomics specialist can decide this question.

For these reasons, any third-generation curation system must be able to ask a human expert when it is unsure of the answer. Moreover, human expertise is quite different from the model assumed by Amazon’s Mechanical Turk. Specifically, one must have multiple domains in which a human can be an expert. Within a single domain, humans have a variable amount of expertise from a novice level to enterprise expert. Lastly, an expert sourcing system must avoid overloading the humans that it is scheduling. When considering a third generation data curation system, look for an embedded expert system with levels of expertise, load balancing and multiple expert domains.

Tenet 4: Data curation must fit into the enterprise ecosystem

Every enterprise has a computing infrastructure in place. This includes a collection of DBMSs storing enterprise data, a collection of application servers and networking systems and a set of installed tools and applications. Any new data curation system must fit into this existing infrastructure. For example, it must be able to extract from corporate databases, use legacy data cleaning tools and export data to legacy data systems. Hence, an open environment is required whereby callouts are available to existing systems. In addition, adaptors to common input and export formats is a requirement. Do not use a curation system that is a closed ”black box.”

Tenet 5: A scheme for “finding” data sources must be present

A typical question to ask CIOs is “how many operational data systems do you have?” In all likelihood, they do not know. The enterprise is a sea of such data systems connected by a hodgepodge set of connectors. Moreover, there are all sorts of personal data sets, spreadsheets and data bases. Furthermore, many enterprises import data sets from public web-oriented sources. Clearly, CIOs should have a mechanism for identifying data resources that they wish to have curated. Such a system must contain a data source catalog with information on a CIO’s data resources, as well as a query system for accessing this catalog. Lastly, an “enterprise crawler” is required to search a corporate internet to locate relevant data sources. Collectively, this represents a schema for “finding” enterprise data sources.

Collectively, these five tenets indicate the characteristics of a good third generation data curation system. If you are in the market for such a product, then look for systems with these characteristics.

To learn more about Tamr’s “third-generation” characteristics, please watch Dr. Stonebraker’s presentation from Strata + Hadoop World, available below:

Michael Stonebraker is an adjunct professor at MIT CSAIL and a database pioneer who specializes in database management systems and data integration. He was awarded the 2014 A.M. Turing Award (known as the “Nobel Prize of computing”) by the Association for Computing Machinery. Professor Stonebraker has been a pioneer of database research and technology for more than 40 years, and is the author of scores of papers in this area. Before joining CSAIL in 2001, he was a professor of computer science at the University of California Berkeley for 29 years. While at Berkeley, he was the main architect of the INGRES relational DBMS; the object-relational DBMS POSTGRES; and the federated data system Mariposa. After joining MIT, he was the principal architect of C-Store (a column store commercialized by Vertica), H-Store, a main memory OLTP engine (commercialized by VoltDB), and SciDB (an array engine commercialized by Paradigm4). In addition, he has started three other companies in the big data space, including Tamr. He also co-founded the Intel Science and Technology Center for Big Data, based at MIT CSAIL.