Written by Michael Stonebraker
As Big Data continues to evolve, we see companies working to address the amount of data that’s coming in, the rate at which it’s coming in, and the variation in data itself. There are a number of companies out there already addressing some of these disruptions, but there’s still an 800-lb gorilla in the corner of the Big Data room that needs to be addressed. First, let’s start at the top.
When we discuss challenges with Big Data, we’re often referring to one of the three Vs:
- Big Volume: This is when you’ve got too much of data, and you can’t manage the petabytes.
- Big Velocity: The data is coming at you too fast, and you can’t keep up. It’s most analogous to the idea of trying to drink from a fire hose.
- Big Variety: If data comes at you from too many places, there’s a good chance you’ll have a data integration problem.
Big Volume: SQL, Complex Analytics and Disruptors
Big Volume is simply recognizing you have too much data, and you can’t get your hands around all those petabytes. Most people are either doing SQL analytics or more complex analytics. If you want to do SQL analytics, there are (as of today) 20 or so data warehouses that are running multi-petabyte production systems day-in day-out on multiple vendors. By and large, these major vendors are pretty good at running gigantic clusters of hundreds of nodes of petabytes of data. If you want to do SQL analytics on petabytes, the best advice I can give you is to purchase one of these commercial warehouse products.
Note: there’s a fly in the ointment here: the Cloud. Companies will move their data warehouses to the Cloud sooner or later. The reason for that is Microsoft Azure is putting up data centers as fast as they can where power is cheap. If you think about your own data center, it cannot possibly be competitive against what the Cloud companies are doing. This presents its own challenges: companies will be subject to the Cloud vendors who play by their own rules. For example, Amazon Web Services (AWS) has two storage options: S3 and the Enterprise Block Store (EBS). S3 is cheap, and EBS is quite expensive. Because of this, everyone will run on S3 for cost reasons. This presents a huge challenge to Cloud architecture. All of the data warehouse vendors will re-architect their products to run efficiently to run on S3 and other Cloud file systems. This has them currently performing major surgery on their engines to run efficiently on the Cloud.
But warehouses are yesterday’s problem.
Data science is going to supersede business intelligence. If business intelligence is SQL data, data science is complex operations, including machine learning, non-SQL data analytics, etc. This is a more complex solution to a Big Volume problem, but data science, aka Big Analytics, is the future of technology and absolutely necessary. The problem? Data warehouses aren’t prepared for that future or complex analysis, and we don’t have enough data scientists ready to do this level of work. Read on for more on our data scientist challenge.
Big Velocity and Its Disruptors
When we think about Big Velocity, we think about an unbelievable amount of data coming at us every millisecond. This is only increasing. In the US, a perfect example is car insurance. Companies are placing devices in vehicles to measure how people drive and base their insurance rates on that information. In the communications industry, we see 5G on the horizon, increasing the flow of data immensely. The same is true with many of our kids who love to play multiplayer online games.
Velocity is going to shoot through the roof. So what do we do about it? There are two kinds of commercial solutions:
- Complex Event Processing (CEP): Think of this as big pattern, little state. This is best exemplified in electronic trading. We look for patterns in the firehose that goes by on CNBC and elsewhere. We’re looking for instances, such as IBM goes up and within 10 seconds, Oracle goes down. Then, we trade based on those instances. There are a couple CEP products available, including Kafka and, to a lesser degree, Storm, that can handle this.
- Online Transaction Processing (OLTP): This solution is big state, little pattern. Continuing with the electronic trading example, let’s say there’s an electronic trading company with data centers all over the world. That company is trading at millisecond granularity–but there’s one catch. What happens if all systems decide to short IBM at the same time? It generates a huge amount of risk, and the company needs to be able to pull the plug if the risk gets too high.
Because of this, companies are evaluating data real-time and coding alerts. This requires being able to update a database at very high speed. Now, we have a database problem. How do we solve it? You can opt for standard old SQL, which is far too slow for the velocity of data we’re talking about. Another option is non-SQL vendors who suggest to give up SQL and ACID. This means no standards–not a good option for most transactions. Or the third option: New SQL vendors. These vendors suggest retaining SQL and ACID but dumping the architecture used by the old vendors.
Meet the 800-Pound Gorilla: Big Variety
Just a recap: Big Variety is having too much data coming from too many places. And it’s solved with data scientists. If you look at what a data scientist actually does, many data scientists say they have to find relevant data that answers a particular question. Data scientists hit the databases, the data lake, data files, and the relevant information available to them on the web. And once they get the data, they have to perform data integration on resulting data sets. Huge companies are saying their data scientists spend 98% of their time doing data discovery and data integration. The remaining time? Spent fixing cleaning errors. This is a disservice to data scientists, but it has to be done because dirty data cannot be analyzed. This is our 800-pound gorilla.
The other version of this problem is an enterprise one. Enterprises should, ideally, run on a single procurement system. But some companies are running on more than fifty systems. This requires integrating multiple supplier databases along with parts, customers, lab data, etc. It’s the same data integration issue–spread across multiple systems within one company. And it’s an even bigger challenge to do it at scale.
There are a few solutions available:
- Extract, Transform and Load (ETL) packages plus Master Data Management (MDM) tools–These require too much manual effort and MDM depends on user writing rules neither of which scale.
- Data preparation–These are easy-to-use solutions for simple problems. But is your problem simple?
- Machine learning and statistics–These overcome ETL limitations and can be used for the most complex problems.
There are a lot of startups like Tamr solving these complex problems through machine learning, statistics and more. Some startups focus on data preparation, and a few focus on enterprise data integration. Some focus on text and others focus on deep learning. The future is these complex analyses that aren’t hindered by the complexity of the problem or ETL packages.
The Future of Big Data and the 3 Vs
As we look to the future of Big Data, it’s clear that machine learning is going to be omnipresent, including deep learning and conventional ML. Complex analytics are on the horizon no matter what–and both of these will go nowhere without clean data. This requires data integration at scale, the 800-pound gorilla in the corner. It’s time for companies to start worrying about data integration to ensure that volume and velocity are not compromised by the challenges faced with data variety.