Stop Putting the AI Cart Before the Data Horse

Flush with Big Data and an accelerated way to capitalize on it (AI), many large enterprises are making a classic mistake. They’re assuming that their data is good enough to leverage as an asset. It is not.  

AI cart enterprise data horse

In a recent article in The Wall Street Journal, IBM executive Arvind Krishna said that data-related challenges were a top reason for IBM clients halting or cancelling AI projects. Further, he said that about 80% of the work with an AI project is collecting and preparing data. This number jibes with long-standing industry estimates for the amount of time data scientists and analysts typically spend on finding and preparing data before they get to the actual science/analytic part. 

While the admission may have been startling, the problem is far from new. I started working in the AI industry back in the 1980s and the same was true then as now with respect to any algorithms (AI or other): “Garbage In, Garbage Out.” Without great data, the math isn’t very useful.

In a June 18, 2019 article by my colleague/co-founder Ihab Ilyas (of the University of Waterloo) and O’Reilly Publishing’s Ben Lorica, the authors noted that a recent O’Reilly survey on the adoption of AI by enterprises “found that those with mature AI practices (as measured by how long they’ve had models in production) cited ‘Lack of data or data quality issues’ as the main bottleneck holding back further adoption of AI technologies.” Meanwhile, worldwide spending on artificial intelligence (AI) systems is predicted to hit $35.8 billion in 2019–a 44% increase over that spent in 2018, says IDC.

Translation: The AI cart is getting loaded up pretty full: we’ve gotta make sure that the enterprise data horse has the legs–and is actually in front of the cart 😉

If traditional methods of data preparation don’t work for AI, they really won’t work for today’s data scientists and analytically empowered front-line workers. The number of data scientists has exploded in the last decade, all of them wanting to do advanced modeling and analytics, many of which also involve AI techniques. At the risk of mixing transportation metaphors: given the high cost of data scientists–with entry-level salaries of ~$100k–ramping up these teams before having your data in order is like buying a Ferrari then putting raw crude oil in the gas tank. Enterprise data needs to be “refined” to make your data science, AI and next-gen analytics purr like a kitten.

Delivering Better Data for AI by Using AI

If you’re betting the future of your company on the predictive-analytics power provided by AI and your analytically empowered workforce, you need great data. Period.

By “great data,” I mean data that’s been rationally organized, unified and curated; is continuously updated; and is readily accessible by everyone who needs it, including data scientists and their models as well as the average person on the front line of your business. This is not so easy, however. The core systems we build reflect the idiosyncrasies of the time and context in which they are built: the more systems, the more idiosyncrasies. We’re coming off of 50+ years of building systems in the enterprise–that’s five decades of idiosyncrasy. Most large companies have thousands of systems that generate data that could be an asset, if it weren’t for the data’s radical heterogeneity. 

As large companies change, their business processes and systems change MUCH more slowly, in effect creating a “data-consistency drag coefficient” that results in database decay. M&A, changing corporate strategies, and other facts of corporate life make this decay even worse. The quantity and rate of decay vastly outstrip humans’ ability to organize it for strategic uses like predictive analytics. 

Traditional data integration/unification methods (Data Warehousing, Data Lakes, Master Data Management) have not kept up with the pace of variety of data in the enterprise. This is because humans are still in the critical path, at multiple levels. Someone (really, many people) must gather requirements from data consumers, build complex rule sets to implement data mastering logic, validate these rules via practical application of the data, and modify the whole pipeline as new data arrive and rules must change.

Popular of late are systems that enable individuals to be more productive with data by improving individual productivity in the “last mile” of analytics–systems such as Alteryx and Trifacta are the most notable. This reflects the tremendous need for more and better preparation of data, but individuals preparing data downstream is only a small part of the problem.  In addition to self- service data prep tools, the enterprise also needs “data refineries.”

The best data refineries use the machine to augment humans to clean/organize/prepare data so that it can be consumed broadly in large companies: by data scientists and developers building AI-enabled applications as well as the average citizen in a big company who is just looking for simple, accurate, up-to-date information to help do his/her job. 

With good human/machine interfaces, machine learning models can recognize relationships among data via simple examples provided by end users, and the incremental effort of incorporating new sources is nominal compared to traditional techniques. The economics of working with varied and voluminous data sets, without compromising either accuracy or speed, change dramatically.

(My co-founder Ihab Ilyas really nailed the different requirements of AI for data cleaning/organization/preparation vs. traditional analytics in this excellent early article: scale, legacy and trust.)

How Businesses Benefit

Forward-looking companies in competitive industries are already using AI this way to clean, organize and prepare data, with impressive business results.

A leader in the oil and gas industry is a sophisticated user of AI to track its wells and those of its competitors across the entire lifecycle–from preparation to completion & first production–to ensure they are investing in the right assets at the right time. The company has a large variety and volume of data (internally and externally sourced) coming in at fast velocity, with no way to easily incorporate new data sources (its traditional, rules-based system took six+ months to add a single new data source). By taking an AI-enabled approach to data unification/integration, it’s since been able to create a Golden Record of trusted data needed for analytics and decision-making. Unified data allows for “mastering” wells across multiple vendor datasets to identify high-interest wells to add to the company’s watch list, reducing the chance that it will invest in the wrong wells. Further, the company has been able to move away from operating based on forecasts, or assumed results, to real data and actual results.

A fast-growing bank operating in multiple jurisdictions struggled to conduct compliance reporting for Anti-Money Laundering (AML) regulations that require them to provide unified views of their customers to regulators. The risk of steep fines, to the tune of 10s-to-100s of millions of dollars, pushed the bank to look for new ways to unify petabytes of customer data residing in 2,500+ data sources spread throughout the world and across multiple business units. By using an AI-enabled approach to data unification/integration, it created a unified customer database within a few months that contains deduplicated records and clear legal holding relationships. They now have an automated data pipeline that matches new customer records to the appropriate holding company, minimizing their exposure to new AML risk as they continue to grow and rapidly onboard new customers.


The simple act of starting with data engineering first–organizing your data, cleaning it using machine-driven, human-guided processes and technology–will set you up to prosecute the coolest AI methods. Not only that, it will ensure that all of the data scientists, developers and citizens who consume data inside your company have the most up-to-date, comprehensive and well-curated data available to them at any given time.  

There are a few safe assumptions that one can make about the future of enterprise data:

  • There is only going to be a LOT more of it
  • It’s only going to get more heterogeneous and unpredictable
  • Control over the sources of data is decreasing
  • AI will be applied broadly in the enterprise but will only be successful if it’s served with high-quality data

Right now, data engineering’s ability to meet the expectations set by AI is like that of a high school sprinter competing in the Olympics  Yeah, he/she might be able to compete but would be wiser to build the required muscle mass first. 

Regardless, great AI will always depend on great data.

To learn more or continue the conversation, please reach out to us or schedule a demo.