How to Clean Noisy and Erroneous Big Data Using Machine Learning

Data unification/deduplication and repair are proving to be difficult for many organizations. In fact, data unification and cleaning account for about 60% to 70% of the work of data scientists. It’s the most time-consuming, least rewarding data science task. Organizations often put their expensive data scientists to work looking at ugly datasets and running random transformations for months before they can even put machine learning into practice and visualize the data.

Done manually, data unification and curation are also becoming untenable. Why?

When companies only dealt with a few thousand tables or records, it was tedious and extremely slow for humans to, for example, pinpoint all the clusters of records representing the same real-world entity. It could take months or even years. By the time humans finished massaging the data, getting it ready to mine or model, the data and the question at hand were often out of date.

But 2019 is around the corner, and most enterprises have hundreds of thousands or millions of records, making it impossible to unify and fix data manually. The only way today’s organizations can solve data curation problems and answer important business questions in a timely, accurate, and scalable way is using machine learning (ML).

While the principles are well understood in academia, the engineering details in configuring and deploying ML techniques are the biggest hurdle. At the Strata conference in 2018, I provided insights into various techniques and discussed how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution.

The first step is understanding that data preparation belongs to two major classes: 1) data unification, or schema mapping, deduplication, and so on; and 2) data cleaning, or spotting errors and repairing data.

In this talk, I focus primarily on scalability, especially when it comes to schema mapping, as well as deduplication and record linking, as well as other cleaning challenges such as data repair and value imputation. Several solutions existing to help solve these problems automatically and at scale. Among them are:

  • Tamr, a system that makes it easy to use machine learning to unify data silos;
  • HoloClean, a framework for holistic data repairing driven by probabilistic inference; and
  • Trifacta Wrangler, a tool for accelerating data preparation.

Also included in the presentation are practical tips about:

  • Why building an engineering pipeline is the most crucial step in building a machine learning model.
  • How blocking can assist with data deduplication and help avoid n2 problems.
  • How much training data is required and, and where to get more data.
  • Why it is vital to avoid the “cold start” problem.
  • How to use weak supervision (see Snorkel as an example of a way to build a functional pipeline to obtain training data using weak supervision).
  • Which model to select (e.g. Naive Bayes, SVM, Logistic Regression, Deep Learning, or any of a host of others) for implementing classifiers, and the tradeoffs of each.

If your organization can benefit from scaling data unification and curation efforts, I offer hands-on advice and expertise in this video of my presentation from the Strata conference, which culls some of the best from both academic and commercial applications. Check it out, and feel free to contact me or the other data unification experts at Tamr if you have questions or need clarification.



Ihab is a co-founder of Tamr and a professor in the Cheriton School of Computer Science at the University ofWaterloo, where his main research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction.He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton Faculty Fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees, elected SIGMOD vice chair, and an associate editor of ACM Transactions on Database Systems (TODS). He holds a Ph.D. in Computer Science from Purdue University and a B.Sc. and an M.Sc. from Alexandria University.