MIT Group Cites “Data Prep” as a Data Science Bottleneck

MIT News


We saw this terrific piece in MIT News last week on “Better Machine Learning,” and it warmed our heart. Specifically the quotes from Kalyan Veeramachaneni, a research scientist at MIT CSAIL’s Any Scale Learning for All (ALFA) group:

“Machine-learning systems usually require a covariates table in a column-wise format, as well as a response variable that we try to predict,” says Veeramachaneni. “The process to get these from raw data involves curation, syncing and linking of data, and even generating ideas for variables that we can then operationalize and form.”

Much of Veeramachaneni’s recent research has focused on how to automate this lengthy data prep process. “Data scientists go to all these boot camps in Silicon Valley to learn open source big data software like Hadoop, and they come back, and say ‘Great, but we’re still stuck with the problem of getting the raw data to a place where we can use all these tools,’” Veeramachaneni says.

The proliferation of data sources and the time it takes to prepare these massive reserves of data are the core problems Tamr is attacking. The knee-jerk reaction to this next-gen integration and preparation problem tends to be “Machine Learning” — a cure for all ills. But as Veeramachaneni points out, machine learning can’t resolve all data inconsistencies:

Veeramachaneni and his team are also exploring how to efficiently integrate the expertise of domain experts, “so it won’t take up too much of their time,” he says. “Our biggest challenge is how to use human input efficiently, and how to make the interactions seamless and efficient. What sort of collaborative frameworks and mechanisms can we build to increase the pool of people who participate?”

Tamr has built the very sort of collaborative framework Veeramachaneni mentions, drawing from the best of machine and human learning to connect hundreds or thousands of data sources.

Top-down, deterministic data unification approaches (such as ETL, ELT and MDM) were not designed to scale to the variety of hundreds or thousands or even tens of thousands of data silos (perpetual and proliferating). Traditional deterministic systems depend on a highly trained architect developing a “master” schema — “the one schema to rule them all” — which we believe is a red herring. Embracing the fundamental diversity and ever-changing nature of enterprise data and semantics leads you towards a bottom up, probabalistic approach to connecting data sources from various enterprise silos.

You also have to engage the source owners collaboratively to curate the variety of data at scale, which is Tamr’s core design pattern. Advanced algorithms automatically connect the vast majority of the sources while resolving duplications, errors and inconsistencies among source data of sources, attributes and records — a bottom-up, probabilistic solution that is reminiscent of Google’s full-scale approach to web search and connection. When the Tamr system can’t resolve connections automatically, it calls for human expert guidance, using people in the organization familiar with the data to weigh in on the mapping and improve its quality and integrity.

For more on Tamr’s machine learning, human guided approach, watch our video here.