Michael Stonebraker
Michael Stonebraker
Co-founder and Turing Award Winner
August 27, 2020

How to Avoid the 10 Big Data Analytics Blunders

How to Avoid the 10 Big Data Analytics Blunders

Leading organizations are leveraging an analytics-driven approach—fueled and informed by data—to achieve marketplace advantages and create entirely new business models. However, even the savviest companies are repeating common missteps. I recently gave a presentation on this very topic at the IQPC Chief Data and Analytics Officer Exchange, which you can watch here.Here are the top 10 blunders we see in working with our customers—plus, insights into how you can work to overcome them.

Blunder #1: Not moving to the cloud.

If your organization isn’t planning to become cloud-exclusive, you could be backing losing technology. The cloud is more elastic than your in-house solution and more cost-effective in the long run.The cloud will save your organization a raft of money, allow your business to take advantage of new technologies with elastic compute, and open your organization to new geographies. Take action now and look into what the cloud offers.

Blunder #2: Not planning for AI/ML to be disruptive.

Make no mistake: AI will displace some of your workers and has the potential to upend how you handle your operations. But there is only one choice: you can be a disruptor, or you can be disrupted.If you want to lead, you must be willing to pay for talent and act quickly because the best talent is being snapped up fast. HR won’t like what you need to pay for machine learning (ML) experts but spending money now on experts nets you a much greater return in the long run. And, don’t make the mistake of contracting this essential skill out.

Blunder #3: Not solving your real data science problem: dirty data.

You’ve hired data scientists, so you think you’ve got big data analytics covered. However, it’s crucial to look at how they are spending their time. Unfortunately, most of their time is spent analyzing and cleaning data and integrating it with other sources. A machine learning expert at iRobot told me that she spent 90% of her time working on data discovery, integration, and cleaning. Of the 10% left of her time, she spent 90% of that fixing data cleaning errors—which left about 1% of her time to the job she was hired for.Without clean data, your data science is worthless. So, have a clear strategy for dealing with data cleaning and integration, and have a Chief Data Officer on staff.

Blunder #4: Believing that traditional data integration techniques will solve issue #3.

Clean, integrated data—at scale—has become nearly impossible to achieve with traditional techniques and technologies. Extract, transform, load (ETL) processes require intensive human effort, and take a lot of time. Every time a new data source is added, a human’s capacity to manage that additional information is diminished. In my experience, I’ve never seen this human-first technique work with more than 20 data sources. Most enterprises need to integrate far more than that. Additionally, once you’ve run ETL processes, you need to match records to find out which ones go together and remove duplicates, traditionally using rules-based Master Data Management (MDM) systems, which also don’t scale. Rules can work to generate training data, but they don’t work for solving big problems.

Blunder #5: Believing that data warehouses will solve all your problems.

Data warehouses are great for structured data from around 10 data sources, but they don’t work for things like text, images, and video. Many companies have bought into traditional data warehouse technology that costs up to seven figures a year. But, they’re only useful in a limited way. If you have a data warehouse, don’t try to shoehorn unstructured data into it.

Blunder #6: Believing that Hadoop/Spark will solve all your problems.

Many companies have invested in Hadoop, the open-source software collection from Apache, or Spark, the company’s analytics engine for big data processing. They have their place, but they are not the answer to everything. Would you use a “lowest common denominator” solution for your company’s “secret sauce”—or the best the industry has to offer?Also, keep in mind that Hadoop and Spark won’t solve your data integration problems, where data scientists spend the bulk of their time. (see Blunder #3).

Blunder #7: Believing that data lakes will solve all your problems.

Many people assume that if a company loads all its data into a data lake—a centralized repository for all data—they’ll be able to correlate all their data sets. But they often end up with data swamps, not data lakes.It’s a problem of garbage in, garbage out. Let’s say HR databases need to account for employees working in two different locations. If two records are simply added together, staff will be over-counted by the number of duplicates. The net result is your analytics will be garbage, and your machine learning models will fail. Companies need to clean their lake data with a data curation system that will solve these problems.

Blunder #8: Outsourcing your new stuff to big data analytics services firms.

Typical enterprises spend about 95% of the IT budget on running legacy code, and they often have their best people doing things like maintenance. The most exciting stuff gets outsourced, often because there is no appropriate talent internally, or because the best people are stuck keeping existing systems running.This is a losing strategy. The “new stuff” is what will propel the business forward and also keep your best, most creative, people engaged. Instead, companies should outsource mundane things like maintenance, email systems, and such, not the promising new technologies.

Blunder #9: Succumbing to the Innovator’s Dilemma.

In his classic book The Innovator’s Dilemma, Harvard Business School professor Clayton Christiansen suggests that when technology changes and you are a vendor that is selling the “old stuff”, it is very difficult to pivot to the new stuff, without losing significant market share in the process.As a business, you have to be willing to change and evolve when it is needed. It’s possible—and even likely—that a reinvention will hurt your business in the short term, but it’s absolutely necessary to stay in business for the long run. There are plenty of examples of this in practice. One that most people are familiar with is the emergence of ridesharing companies like Lyft and Uber, and the negative consequences for legacy taxi companies. Today the cost of a taxi license in the City of Cambridge has dropped from $700K to $10K.

Blunder #10: Not paying up for a few “rocket scientists.”

To address all of the above issues, and the hundreds of others you will inevitably face, companies need to invest in a few highly skilled employees. The new hires are not going to wear suits, but they will be your guiding lights.

(Bonus) Blunder #11: Working for a company that is not trying to do something about the “sins of the past.”

If you work for a company that's falling into any of the above blunders, figure out how to fix it—or start looking for a new job.If you enjoyed this presentation about the ten common big data blunders, you might like my presentation on DataMastering at scale.