Written by Michael Stonebraker
(In this entry, I explain the three generations of data integration products and note what appears to have caused the transitions between the product families.)
In the 1990s, data warehouses arrived on the scene. Led by the major retailers, customer-facing data (item sales, products, customers) was assembled in a data store and used by retail buyers to make better purchase decisions. For example, pet rocks might be out-of-favor and Barbie dolls might be “in.” As a result, retailers could discount the pet rocks and tie up the Barbie doll factory with a big order. Data warehouses typically paid for themselves within a year through better buying decisions. First-generation data integration systems were termed ETL (Extract, Transform and Load) products and were used to assemble the data from various sources (usually less than 20) into the warehouse. The typical data warehouse project was substantially over-budget and late because of the difficulty of the data integration issues faced in early systems. As such, first-generation ETL products solved only the transformation portion of the warehouse puzzle.
This led to a second generation of ETL systems, whereby the major ETL products were extended with data cleaning modules, additional adaptors to ingest other kinds of data and data cleaning tools. In effect, the ETL tools were extended in the second generation to become data curation tools. In general, data curation systems followed the architecture of earlier first-generation systems, whereby they were tool kits oriented toward use by professional programmers. In other words, they were programmer productivity tools.
There are two substantial weaknesses in second-generation tools. First, scalability is often a big issue. Specifically, enterprises want to curate “the long tail” of enterprise data. In other words, enterprises have several thousand data sources, everything from company budgets in the CFO’s spreadsheets to operational peripheral systems. There is “business intelligence gold” in the “long tail,” and enterprises wish to capture it. Cross-selling of enterprise products is just one use for these additional data sources. Furthermore, the rise of public data on the web leads business analysts to want to curate additional data sources. Anything from weather data to customs records to real estate transactions to political campaign contributions are readily available on the web. In order to capture the long tail as well as public data, curation tools must be extended to deal with hundreds to thousands of data sources rather than the few tens in earlier generations.
A second issue concerns architecture. A professional programmer (usually reporting to central IT) does not know the answers to many of the data curation questions that arise. For example, are “rubber gloves” the same thing as “latex hand protectors?” Is an “ICU50” the same kind of object as an “ICU?” Only business people in line-of-business organizations can answer these kinds of questions. However, they are usually not in the same organization as the programmers running data curation projects. As such, second-generation systems are not architected to take advantage of the humans best able to provide curation help.
These issues had led to a third generation of data curation products, which we term scalable data curation systems, that are designed to scale to thousands of data sources and leverage business experts to assist in curation decisions. To scale, a third-generation system must “pick the low hanging fruit” automatically using a combination of statistics and machine learning. Hence, they are architected as automated systems which ask a human for help only when necessary. Obviously, enterprises differ in the required accuracy of curation, so third-generation systems must allow an enterprise to trade off accuracy and the amount of human involvement.
In addition, they must contain a crowd sourcing component, so business experts can assist with curation decisions. Unlike, Amazon’s Mechanical Turk, such a module must be able to deal with a hierarchy of experts inside an enterprise. It also must be able to cope with various kinds of expertise. As such, we call this component an expert sourcing system to distinguish it from the more primitive crowd sourcing systems in the marketplace. In summary, a third-generation data curation product must be an automated system with an expert sourcing component. Tamr is an early example of this third generation of systems.
In summary, ETL systems arose to deal with the transformation issues in early data warehouses, evolving into second-generation data curation systems as they expanded the scope of their offerings. Third-generation systems, which have a very different architecture, came into existence to address the enterprise need for data source scalability. Third-generation systems can also co-exist with currently-in-place second-generation systems, which can curate the first tens of data sources to generate a composite result that in turn can be curated with the “long tail” by third-generation systems.