In his talk at the MIT CDO IQ Symposium on July 23, Tamr Co-Founder and CEO Andy Palmer shared his vision for “Data Quality Through Curation at Big Data Scale.”
Palmer starts by urging CDOs to “always start” big data analytics with “questions and context” — describing the three flavors of analytics context as 1) Prescriptive, 2) Predictive, 3) Descriptive.
After outlining big data opportunity for enterprises, Palmer defines their “data source problem”: the number and diversity of data sources (private/public, structured/semi-structured, etc.) are exploding.
Multiple approaches have emerged to deal with this Data Variety problem, with the current state dominated by extreme top-down management (95% deterministic to 5% probabilistic). Palmer predicts that the shear number of data sources and complexity of change is going to drive us toward a bottom-up approach (80% probabilistic to 20% deterministic).
“The only viable way to tame enterprise data variety,” he argues, is through “bottom-up, collaborative data curation” that complements traditional MDM, ETL, data profiling and data quality methods.
This is the approach that Tamr is taking in building next-generation data curation technology that creates “rich context” for enterprise data variety:
- Identify relationships across your sources using a machine learning “bottom up” approach
- Continuous active learning combining machine/human insight
- Cost effective as you unify more sources – marginal cost of new source = at least linear
- Deploy context-rich sources for the different LOBs across the enterprise
- Enterprise metadata catalog – all your attributes, all your sources
- Services (e.g. APIs) can also be directly deployed in data warehouses/lakes and operational workflows
- In-situ curation of large sources