Andy Palmer MIT CDOIQ Talk Slides: Data Quality Through Curation at Big Data Scale

In his talk at the MIT CDO IQ Symposium on July 23, Tamr Co-Founder and CEO Andy Palmer shared his vision for “Data Quality Through Curation at Big Data Scale.” Click here to view the slides from this talk Palmer starts…

team_face2

In his talk at the MIT CDO IQ Symposium on July 23, Tamr Co-Founder and CEO Andy Palmer shared his vision for “Data Quality Through Curation at Big Data Scale.”

Click here to view the slides from this talk

Palmer starts by urging CDOs to “always start” big data analytics with “questions and context” — describing the three flavors of analytics context as 1) Prescriptive, 2) Predictive, 3) Descriptive.

After outlining big data opportunity for enterprises, Palmer defines their “data source problem”: the number and diversity of data sources (private/public, structured/semi-structured, etc.) are exploding.

Multiple approaches have emerged to deal with this Data Variety problem, with the current state dominated by extreme top-down management (95% deterministic to 5% probabilistic). Palmer predicts that the shear number of data sources and complexity of change is going to drive us toward a bottom-up approach (80% probabilistic to 20% deterministic).

“The only viable way to tame enterprise data variety,” he argues, is through bottom-up, collaborative data curation” that complements traditional MDM, ETL, data profiling and data quality methods.

This is the approach that Tamr is taking in building next-generation data curation technology that creates “rich context” for enterprise data variety:

  • Identify relationships across your sources using a machine learning “bottom up” approach
  • Continuous active learning combining machine/human insight
  • Cost effective as you unify more sources – marginal cost of new source = at least linear
  • Deploy context-rich sources for the different LOBs across the enterprise
  • Enterprise metadata catalog – all your attributes, all your sources
  • Services (e.g. APIs) can also be directly deployed in data warehouses/lakes and operational workflows
  • In-situ curation of large sources