Death by Data Variety: Two Decades Working On The Cure

Data variety has been the Achilles’ heel of enterprise BI and analytics projects since long before the three V model of Big Data brought it into the spotlight (along with its siblings ‘volume’ and ‘velocity’). Over the past 20 years, my career in technology has spanned data mining, ETL, Agile BI, and other data-intensive application areas. Throughout those two decades, the challenge of data variety has remained stubbornly entrenched.  Corporate visions of data-driven digital transformation still get knocked down to earth by the reality of data integration challenges, and the only way out seems to be through the interminable IT backlog–which, for practical purposes, often means no way out.

Before the dot-com crash of 2000, I was with Torrent Systems, working with an internet search pioneer customer to apply market basket analysis to improve internet search suggestions. The premise was straightforward: associate a user profile and a narrow window of search history with click-through, and cross-sell the high click-through items for related profiles and searches. There were the normal challenges of re-constructing sessions from web server logs, and identifying click-through (“conversion”). But the client had several web properties, and needed to connect users and sessions across those properties. Of course, the web logs were not synchronized, nor were user accounts (when we had them); and the data was dirty, since sessionization and conversion detection are imperfect. As the project gained momentum, there was one data mining expert building the model, and three of us working full-time to get all of the data to line up. After an enormous investment of time and energy, we did get a system working, but it wasn’t enough to weather the dot-com crash.  Data variety should have been listed as a contributing cause on the death certificate.

Ten years later, I was with Endeca, working on an Agile BI proof-of-concept with a large automobile manufacturer. The idea was to show ad-hoc analytics across manufacturing, marketing, sales, and service, without IT having to go and build OLAP cubes and hand-tune a BI system for a carefully crafted set of queries. The technology worked really well, and we had very talented pre-sales engineers, so we were able to get from data to visualizations within a week or so. For the first time, they were able to visualize their data without it first going through the hands of IT. What they saw was that, across organizations, the dimensions didn’t line up, identities weren’t reconciled, values weren’t conformed, and the resulting charts served primarily to illustrate the diversity of their data, rather than direction for their business. So their first reaction was, ‘wow, my data is a mess! I need to kick off a year-long data cleaning project so I can do Agile BI!’

Which kind of flies in the face of the whole notion of ‘Agile’.

In early 2013, when I started talking to Mike Stonebraker and Andy Palmer about their vision for machine-assisted, bulk data unification, my immediate response was, “if this technology can actually do what you say it does, I want in.” As soon as I saw it in action, I became a believer. Whether the challenge is to bridge operational systems like ERP or CRM, or rationalize data dimensions from multiple departments or divisions, this approach of bringing the data, the subject matter experts, and machine learning into direct collaboration enables rapid insight in a manner that is both value-driven and data-driven. I wish I could bring this approach back to tame the data variety challenges that made those old projects limp along, so that rather than focusing so much on the mechanics of data processing, we could instead see rapid progress towards the insights we so urgently needed.


Nik is a technology leader with over two decades of experience building data engineering and machine learning technology for early stage companies. As a technical lead at Tamr, he leads data engineering, machine learning and implementation efforts. Prior to Tamr, he was director of engineering and lead architect at Endeca, where he was instrumental in the development of the search pioneer which Oracle acquired for $1.1B. Previously, he delivered machine learning and data integration platforms with Torrent Systems, Thinking Machines Corp., and Philips Research North America. He has a master’s degree in computer science from Columbia University.