Written by Gideon Goldin
Organizations tend to collect more data than they can process. The challenge lies in finding and converting raw data into actionable information. To produce complete, accurate, and up-to-date lists of customers, for example, highly concerted curation is needed to locate, clean and organize heterogeneous records from what might be thousands upon thousands of disparate sources of data.
The conventional approach to data curation, with historical ties in information management, has been to employ curators that author “rules.” Rules, like “Customers are the same if their social security numbers match,” are logical assertions evaluated by software applications as they analyze records. The approach is workable for small scales but fails as the number of rules needed to maintain high quality over more and more records grows out of control. This is, at its heart, a design issue–since even the most seasoned experts cannot reason over what becomes a tangled web of hundreds if not thousands of rules. Even if we could design interfaces that met people within their cognitive limits, the financial and temporal investments in building such specialized teams is prohibitive enough in itself.
The Rise of Machine Learning
A smarter design would be to automate most of the work, leaving the human to chime in only when necessary, so that they can spend the bulk of their time on more creative tasks. Automation is at the heart of industrialization and, in recent years, machine learning has been its driver. But automating rule-authorship is difficult–computers do not think like we do. What’s needed is not the automation of existing processes, but rather, a new approach. The solution is a simple but clever combination of machine learning (ML) and expert guidance. This was the realization of Turing Award winner Michael Stonebraker in his MIT project, “Data-Tamer,” the precursor to Tamr.
Tamr’s patented approach has been out of academia and in production with a variety of large customers. Of course, enterprise software adoption presents its own challenges, the most formidable of which are rarely technical, but bureaucratic. In the case of machine learning, one of the biggest challenges is that of designing for trust. Discussions in ML are rife with questions of trust–from institutions questioning the ethics of inferring from historical data, to individuals questioning the recommendations software produces without explanation. With rules, the latter was never a problem because rules, almost by definition, tend to explain themselves. The user experience of writing a rule is often little more than the verbalization of an expert’s hypotheses. For better or worse, most off-the-shelf ML lacks the ability to explain itself. How can organizations trust the black-box promise of machine learning?
Facilitating Trust in Machine Learning
In ML, the most advanced computations are often the least explainable. (One can say the same about human cognition). Since its inception, Tamr’s design team has been prototyping around this complexity–not just from a technical perspective, but from a holistic one that is informed by psychological and cultural considerations. We’ve learned a few things about how to earn and lose trust. Ultimately, we’ve realized the necessity of a multi-pronged approach:
- Present summary metrics: Without a high-level, statistical overview of metrics (such as precision and recall for clustering duplicate records), users are unable to get a general sense of quality.
- Help people find instances: While summary metrics prevent losing sight of the forest for the trees, it’s still easier to think about individual records than a group of them. Showing stats isn’t enough–you must provide users with flexible search capabilities so they can find and validate records they care about.
- Display change over time: Whether reasoning about a record or a group of records, users need to compare the results of ML before and after providing input to the system. Without such a view, there is no way to infer the effects of the work humans do.
- Show how different approaches compare: Perhaps what builds the most trust in ML is allowing people to compare it against other approaches. Easily letting users compare the results of an ML system against those produced by rules, experts, or even other ML systems helps users build a more intuitive sense of the nuances of ML.
ML and, not Versus, Other Technologies
If organizations hope to tame their data, they’ll need to augment familiar technologies with those built upon machine learning. In doing so, they’ll also need to augment familiar modes of thinking–common sense accrued over years in a career–with an open-mindedness toward the naive power of statistics. In our own experience, we’ve observed that while most organizations are eager to modernize their data operations, some are more reluctant than others to embrace ML. These organizations tend to believe that adopting ML requires a leap of faith. But in time, they, like organizations that have already implemented such technologies, realize that machine learning in data unification can not only outperform rules in most scenarios, but it can even (and often) reveal errors and biases in the rules people have expressed, helping experts mature their understanding of the data. At Tamr, we’ve discovered that so long as ML-based tools are designed around trust, users wonder how they unified data without it.
To learn more about Tamr’s machine learning approach to data unification, schedule a demo today.