Written by Tamr
In his presentation — “Tackling Data Curation” — Mike dives deep into the current and ideal state of data integration technology. (Access his presentation slides below.)
- Mike starts by outlining the current state of affairs in the enterprise: silos everywhere, with an average of 5,000 already and a desire to add public data into the mix.
- He then walks through how the “global data model” approach (which began in the 1990 to integration is insufficient for the reality of today’s enterprise data — unable to deal fully with anything more than 25-50 sources. Using a Big Pharma example of a need to integrate 10,000+ data sources, Mike discusses how putting silos into a data lake doesn’t solve the data integration issue … but results in a data swamp.
- Instead, Mike calls managing enterprise schema proliferation through a Data Curation approach with the following components: Ingest, Validate, Transform, Match Schemas, Consolidate. To achieve scalability in this approach, he recommends “picking the low-hanging fruit” automatically, using machine learning and statistics. This means the system needs to be built from the bottom up (instead of through an upfront global schema) and involve human experts (who aren’t programmers) to help with the cleaning.
- As an example of this approach, Mike points to Tamr and walks through the key elements of the Tamr platform, illustrates some customer success stories and looks to future development.
Mike’s keynote drove some very thoughtful coverage and commentary to check out, including
Forbes (7/29/15) – Gil Press #MITCDOIQ
Turing Award-Winner Stonebraker on the Future of Taming Big Data
SiliconANGLE/TheCUBE (7/22/15) – Paul Gillen, Amber Johnson
The data integration debacle; beyond ‘nirvana’ solutions – #MITCDOIQ (video & post)
SearchCIO/TechTarget – Nicole Laskowski #MITCDOIQ
How big data broke the back of ETL and the rise of data curation (Stonebraker Q&A)
About MIT CDOIQ: In its 9th year, the MIT Chief Data Officer & Information Quality Symposium is jointly hosted by the MIT Information Quality Program at SSRC, the MIT Sloan School of Management, and the International Conference on Information Quality.
To download Mike’s presentation, register below.