Michael Stonebraker’s work and contributions to the concepts and practices underlying modern database systems are truly remarkable. I’ve followed Mike’s accomplishments for decades, and have worked closely with him for the past six years—one of the highlights of my career. I’m honored to have contributed a chapter to the new book surveying his career: Making Databases Work: The Pragmatic Wisdom of Michael Stonebraker.
“The Tamr Codeline” chapter describes our early journey at Tamr as we turned the academic work that Mike led at MIT into a commercial reality. The goal of the MIT Data-Tamer project was to figure out a practical, scalable approach to taming data variety, which consumes some 80% of data scientists’ time and effort.
Through a decades-long career, I’ve worked with big data at companies such as Thinking Machines, Torrent Systems, and Endeca, and I’ve seen messy data foil far too many projects. I knew that successfully tackling this problem would have a huge impact across many industries. As a result, I leapt at the opportunity to join Tamr as a founding member in April 2013 to build the development team and deliver the first commercial version of the technology.
The academic project at MIT highlighted that success in taming data variety requires technology and people, so the Tamr solution needed to address both data unification and the people and skills required to accomplish it. From the outset, we realized that the closest domains to data unification are MDM (master data management), with a history of massive budget overruns and too little value delivered; and ETL (extract, transform, load), a generic data processing toolkit with a bottomless backlog and endless appetite for consulting.
From the people standpoint, traditional MDM and ETL tools user experiences are firmly rooted in the role of the data engineer. The Data-Tamer project showed that the people required to make data unification truly effective are not data engineers, but instead highly contextual recommenders—subject-matter experts—who are directly engaged in the unification process and can enable a new level of productivity in data delivery.
This set up a long-standing tension within Tamr. On the one hand, much of what needs to happen in a data unification project is pretty standard data engineering. Mike, in particular, advocated that we have a “boxes and arrows” interface typical of ETL tools to support data engineers in defining the data engineering workflows. On the other hand, moving data from point A to point B is not what makes Tamr special. There was an argument that we should focus on our core innovations—schema mapping, record matching, and classification—and leave the boxes and arrows to some other tool. Mike’s recommendation for resolving this kind of tension is to use the crucible of commercial viability, so we eagerly sought out commercial opportunities for our early system.
Over the course of our early deployments, we saw that an enormous portion of data unification projects could be distilled down to a few simple activities centered around a dashboard for managing assignment and review of expert feedback. This could be managed by a data curator—a non-technical user—who can assess the quality of the data and oversee a project to improve data integrity. We created a simple, pre-canned workflow that addressed these core use cases, but many projects required additional workflow that extended beyond this core. To ensure that the needs of these deployments would also be met, we built good APIs so our core capabilities could be integrated into other workflow systems.
Focusing on our core capabilities while providing good integration with other tools enabled many of our projects to deliver initial results in under a month, with incremental, weekly deliveries after that. Delivering results quickly built demand within the customer organization, helping to motivate further investment in the project and integration into standard ETL and other IT infrastructures. This integration became another project deliverable along the way, rather than a barrier to delivering useful results.
We learned that focusing on our core differentiators and satisfying the needs of data consumers was the best way to define a new product in an emerging market.
“The Tamr Codeline” chapter goes on to look at many other vexing challenges we have encountered at Tamr, how we overcame them, lessons learned, and some surprising opportunities that emerged as a result. As we continue to work closely with Mike to realize his vision and guidance, the ultimate validation is in the vast savings our customers attribute to our projects and the testimonials they give to their peers, describing how what they have long known to be impossible has suddenly become possible.