Tamr’s Take: On Andrew McAfee’s “Big Data’s Biggest Challenge”

Combined JPG

Here’s a simple rule for the second machine age we’re in now: as the amount of data goes up, the importance of human judgment should go down.

So says Andrew McAfee, co-director of the Initiative on the Digital Economy in the MIT Sloan School of Management, in a provocative Harvard Business Review piece this week entitled “Big Data’s Biggest Challenge? Convincing People NOT to Trust Their Judgment.” The gist of McAfee’s argument? While much contemporary big data analytics has experts applying their “judgment to the output of a data-driven algorithm or mathematical model … Things get a lot better when we flip this sequence around and have the expert provide input to the model, instead of vice versa.”

Tamr’s Take: As it relates to data integration, we agree with McAfee … to a certain level.

That level being the point at which our “machine driven, human guided” system determines that its advanced algorithms have done as much as they confidently can in the automated connection of sources, attributes and records. This is when the system calls data source experts into the curation process to provide input that is fed back into the system in a virtuous “machine-human learning” cycle.

It’s the very definition of human-machine collaboration that McAfee celebrates and that has been lacking historically in the field of data integration, as Tamr c0-founder Ihab Ilyas detailed in an earlier post, “From Data Variety to Data Opportunity”:

Unfortunately, historically humans and machines have had only limited success in working together to curate and integrate data at scale; human experts are scattered across the enterprise with very fragmented expertise and with even lower capacity to deal with large volumes.  And while machines can munch on large amounts of data, they have limited insight into data semantics and cannot make final decisions on updating mission-critical data.  Past hybrid human-machines solutions have had programmers writing, for example, data transformation scripts or performing parameter tuning of automated solutions.  While these IT personnel best understand the machine, they have very little exposure to the data and are definitely the wrong people to update enterprise data.

Tamr’s approach to collaborative curation combines powerful machine learning algorithms with collective human insight to identify sources, understand relationships and connect the massive variety of siloed data.  Tamr’s system employs machines to consume large number of data signals from all available sources across silos to come up with best curation suggestions possible.  Data experts are then repeatedly and asynchronously pulled into the curation process to make final update decisions, to provide hints and semantics, or to request more evidence from the machine.

As Andy Palmer wrote in his introductory Tamr post, “the thoughtful connection of machine learning with expert human guidance is the real trick to solving th(e) problem of data variety. Making the interaction between experts and the machine both simple and scaleable has been a fantastic challenge and our most significant achievement over the past few years.” An achievement that one customer, a multi-billion dollar pharmaceutical company , is already using to connect research data from thousands of scientists around the world and gain valuable aggregate research insight.

In sum, Tamr’s approach directly answers the challenge McAfee cites from Ian Ayres‘  book Supercrunchers:“Instead of having the statistics as a servant to expert choice, the expert becomes a servant of the statistical machine.”