Rise of the [Data Preparation] Machines

Machines

 

Can We Trust Probabilistic Machines to Prepare Our Data?

— By Daniel Bruckner

Whether you’re en route to a 360-degree view of your customers, mapping your data lake, or tuning up your supply chain analytics, you need to have confidence that your unified data meets stringent standards of quality and integrity.

Traditional tools for data preparation — ETL, MDM, and so on — guarantee quality by putting highly skilled people in the driver’s seat. IT professionals gather requirements from business users, then translate those requirements into code and complex systems of rules. The results are accurate, to be sure, but complex and costly to implement, especially at scale.

Tamr, by contrast, minimizes complexity and cost by supplementing human input with machine learning algorithms. Business users can tell Tamr directly about their data, and Tamr will learn their requirements.

But with algorithms and machines contributing so much to the process, can we still trust the results?

Rise of the Machines

True story. The other day, I pulled up to a stop light in Mountain View and two things happened. First, a stream of people flowed into the intersection to cross the street. Second, a self-driving car moved confidently into the intersection to make a left turn, veered towards the pedestrians, then stopped. It began to wait.

While the robot idled in front of me, I naturally considered all the ways it might run amok. It could fail to see the road and drive into a telephone pole. It could fail to see the people in the road and squash them. It could see the people, but think it found a path to squeeze between them and floor the accelerator. It could become self-aware and launch some missiles. In any scenario, though, the machine was using advanced statistical algorithms for computer vision to make life-or-death decisions in real time. How on earth are we supposed trust it to do that?

No Fate But What We Make

Trust, of course, must be earned. Self-driving cars are earning it by protecting those on the road with a simple safety net: every self-driving car has a human in the driver’s seat (well, for now).  That control allows them to log thousands of miles of accident-free driving without the risk of machine error.

Tamr earns trust in the same way. On top of state-of-the-art algorithms that automatically unify hundreds or thousands of data sources, Tamr provides a simple but potent user interface for matching, cleaning and understanding disparate data sets. When the machine can’t resolve connections automatically, it calls on “curators” — experts in the organization familiar with the data — to weigh in on the mapping and improve its quality and integrity. In the driver’s seat, a data curator can steer Tamr’s machine intelligence through any danger.

Three core capabilities put you in control as a data curator. Search-and-browseinterfaces allow you to explore data and meta-data to find exactly what you want — or, when cleaning, exactly what you want to eliminate. Rule-based facets let you combine data sets and merge objects according to fixed, well understood rules — or, test hypotheses against the data (for example, “If two company records have the same address, does that mean they’re the same company?”). And metrics and benchmarks let you compare different approaches — hard-coded rules, output from legacy systems, manual clean-up by human experts, learned probabilistic models — to see which give the best data quality, and to find the fastest way to boost that quality.

Let’s look at a specific use case. Suppose you’re building a unified reference data set of, say, people of interest, from a variety of constituent data sets: user account databases, CRM systems and enrichment sources like the US census (for demographics), or social media (for contact info and activity data). You need to find ways to match records across these sources that describe the same person.

The natural way to approach a complex integration is to start with simple hypotheses. For example, when connecting records about people you might start with the hypothesis that matches have the same first_name and last_name.

In Tamr, you can easily test this hypothesis: a rule-based facet will surface records grouped by name for inspection. Then you can search the results for familiar examples — maybe “Sarah Connor” — to verify whether the rule makes sense. Usually, our first assumptions about data aren’t perfect — in this case, Tamr may reveal records like “Sarah Ann Connor,” “Sarah Louise Connor,” and “Sarah Jeanette Connor” — so we need to revise them as we learn. Similarly, as you explore data in Tamr and enrich it — say, by marking that “Sarah Ann” is not the same as “Sarah Jeanette,” but “Sarah J” is — then the system will use that input to benchmark the accuracy of its algorithms.

Anything we do with data — integration, cleaning, analytics — begins with exploration. And when you get down to it, Tamr is a tool to help data scientists, analysts and curators explore data with full control and eyes wide open. That open, directed exploration is how people discover insights, and how technology earns trust.

Daniel Bruckner is a co-founder @ Tamr.  A graduate of the University of Chicago (physics), Daniel was also a Senior Fellow at CERN and studied computer science at MIT and UCal Berkeley. Several members of the Tamr Field Engineering team contributed to this post.

To learn more about Tamr, explore our video.