How to Clean Noisy and Erroneous Big Data Using Machine Learning

Practical Advice from Both Academic and Commercial Applications


Data unification/deduplication and repair are proving to be difficult for many organizations. In fact, data unification and cleaning account for about 60% to 70% of the work of data scientists. It’s the most time-consuming, least rewarding data science task. Organizations often put their expensive data scientists to work looking at ugly datasets and running random transformations for months before they can even put machine learning into practice and visualize the data.

Done manually, data unification and curation are also becoming untenable. Why?

When companies only dealt with a few thousand tables or records, it was tedious and extremely slow for humans to, for example, pinpoint all the clusters of records representing the same real-world entity. It could take months or even years. By the time humans finished massaging the data, getting it ready to mine or model, the data and the question at hand were often out of date.

But 2019 is around the corner, and most enterprises have hundreds of thousands or millions of records, making it impossible to unify and fix data manually. The only way today’s organizations can solve data curation problems and answer important business questions in a timely, accurate, and scalable way is using machine learning (ML).

While the principles are well understood in academia, the engineering details in configuring and deploying ML techniques are the biggest hurdle. At the Strata conference in 2018, I provided insights into various techniques and discussed how machine learning, human expertise, and problem semantics collectively can deliver a scalable, high-accuracy solution.

The first step is understanding that data preparation belongs to two major classes: 1) data unification, or schema mapping, deduplication, and so on; and 2) data cleaning, or spotting errors and repairing data.

In this talk, I focus primarily on scalability, especially when it comes to schema mapping, as well as deduplication and record linking, as well as other cleaning challenges such as data repair and value imputation. Several solutions existing to help solve these problems automatically and at scale. Among them are:

  • Tamr, a system that makes it easy to use machine learning to unify data silos;
  • HoloClean, a framework for holistic data repairing driven by probabilistic inference; and
  • Trifacta Wrangler, a tool for accelerating data preparation.

Also included in the presentation are practical tips about:

  • Why building an engineering pipeline is the most crucial step in building a machine learning model.
  • How blocking can assist with data deduplication and help avoid n2 problems.
  • How much training data is required and, and where to get more data.
  • Why it is vital to avoid the “cold start” problem.
  • How to use weak supervision (see Snorkel as an example of a way to build a functional pipeline to obtain training data using weak supervision).
  • Which model to select (e.g. Naive Bayes, SVM, Logistic Regression, Deep Learning, or any of a host of others) for implementing classifiers, and the tradeoffs of each.

If your organization can benefit from scaling data unification and curation efforts, I offer hands-on advice and expertise in this video of my presentation from the Strata conference, which culls some of the best from both academic and commercial applications. Check it out, and feel free to contact me or the other data unification experts at Tamr if you have questions or need clarification.