As Big Data continues to evolve, we see companies working to address the amount of data that’s coming in, the rate at which it’s coming in, and the variation in data itself. A number of companies are already addressing some of these disruptions, but there’s still an 800-lb gorilla in the corner of the Big Data room that we need to address. However, before we get to that, let’s start at the top.
When we discuss challenges with Big Data, we’re often referring to one of the three Vs of Big Data:
Big Volume: you have too much data and you can’t manage the petabytes.
Big Velocity: the data is coming at you too fast, and you can’t keep up. It’s most analogous to the idea of trying to drink from a fire hose.
Big Variety: if data comes at you from too many places, there’s a good chance you’ll have a data integration problem.
Big Volume: SQL, Complex Analytics, and Disruptors
Big Volume is simply recognizing you have too much data, and you can’t get your hands around all those petabytes. Most people are either doing SQL analytics or more complex analytics. If you want to do SQL analytics, there are (as of today) 20 or so data warehouses that are running multi-petabyte production systems day-in day-out on multiple vendors. By and large, these major vendors are pretty good at running gigantic clusters of hundreds of nodes of petabytes of data. If you want to do SQL analytics on petabytes, the best advice I can give you is to run one of these data warehouse platforms.
There are no gorillas here, but there are two disruptors you could encounter.
First, the Cloud. Companies will move their data warehouses to the Cloud sooner or later. The reason for that is Microsoft Azure is putting up data centers as fast as they can where power is cheap. If you think about your own data center, it cannot possibly compete against what the Cloud companies are doing. This presents its own challenges: companies will be subject to the Cloud vendors who play by their own rules. For example, Amazon Web Services (AWS) has two storage options: S3 and the Enterprise Block Store (EBS). S3 is cheap, and EBS is quite expensive. Because of this, everyone will run on S3 because it’s more cost-effective. This presents a huge challenge to Cloud architecture. All of the data warehouse vendors will re-architect their products to run efficiently on S3 and other Cloud file systems. This has them currently performing major surgery on their engines to run efficiently on the Cloud.
The second disruptor is that warehouses are yesterday’s problem.
Data science is going to supersede business intelligence. If business intelligence is SQL data, data science is complex operations, including machine learning, non-SQL data analytics, etc. This is a more complex solution to a Big Volume problem, but data science, aka Big Analytics, is the future of technology and absolutely necessary. The problem? Data warehouses aren’t prepared for that future or complex analysis, and we don’t have enough data scientists ready to do this level of work. But more on our data scientist challenge in a moment.
Big Velocity and Its Disruptors
When we think about Big Velocity, we think about an unbelievable amount of data coming at us every millisecond. And it’s only increasing. In the US, a perfect example is car insurance. Companies are placing devices in vehicles to measure how people drive and base their insurance rates on that information. In the communications industry, the advent of 5G is increasing the flow of data immensely. The same is true of the multiplayer online video games that so many of our kids love to play.
Clearly velocity is shooting through the roof. So what do we do about it? There are two kinds of commercial solutions:
1. Complex Event Processing (CEP): Think of this as “big pattern, little state.” This is best exemplified in electronic trading. We look for patterns in the firehose that goes by on CNBC and elsewhere. We’re looking for instances, such as IBM goes up and within 10 seconds, Oracle goes down. Then, we trade based on those instances. There are a couple CEP products available, including Kafka and, to a lesser degree, Storm, that can handle this.
2. Online Transaction Processing (OLTP): This solution is “big state, little pattern.” Continuing with the electronic trading example, let’s say there’s an electronic trading company with data centers all over the world. That company is trading at millisecond granularity–but there’s one catch. What happens if all systems decide to short IBM at the same time? It generates a huge amount of risk, and the company needs to be able to pull the plug if the risk gets too high.
Because of this, companies are evaluating data real-time and coding alerts. This requires being able to update a database at very high speed. Now, we have a database problem. How do we solve it? You can opt for standard old SQL, which is far too slow for the velocity of data we’re talking about. Another option is non-SQL vendors who suggest to give up SQL and ACID. This means no standards, which is not a good option for most transactions. Or the third option: new SQL vendors. These vendors suggest retaining SQL and ACID but dumping the architecture used by the old vendors.
Meet the 800-Pound Gorilla: Big Variety
Big Variety occurs when you have too much data coming from too many places. And the way to solve it is with data scientists. But if you look at what a data scientist actually does, many data scientists say they have to find relevant data that answers a particular question. Data scientists hit the databases, the data lake, data files, and the relevant information available to them. And once they get the data, they perform data integration on resulting data sets. Huge companies are saying their data scientists spend 98% of their time doing data discovery and data integration. The remaining time? Spent fixing cleaning errors. This is a disservice to data scientists, but it must be done because you can’t analyze dirty data. This is our 800-pound gorilla.
The other version of this problem is an enterprise one. Take procurement as an example. Enterprises should, ideally, run a single procurement system. But the reality is that some companies are running more than fifty systems. This type of multi-system environment requires integrating multiple supplier databases along with parts, customers, lab data, etc. It’s the same data integration issue, but spread across multiple systems within one company. And it’s an even bigger challenge to do it at scale.
There are a few solutions available:
Extract, Transform and Load (ETL) packages plus traditional Master Data Management (MDM) tools: unfortunately, these require too much manual effort and traditional MDM depends on a user writing rules neither of which scale.
Data preparation: these are easy-to-use solutions for simple problems. But let me ask you – is your problem simple?
Machine learning and statistics: these solutions overcome ETL limitations and can be used for the most complex problems.
There are a lot of companies like Tamr solving these complex problems through machine learning, statistics, and more. Some focus on data preparation, and a few focus on enterprise data integration. Others focus on text or deep learning. The future is complex analyses that aren’t hindered by the complexity of the problem or ETL packages.
The Future the 3 Vs of Big Data
As we look to the future of Big Data, it’s clear that machine learning is going to be omnipresent, including deep learning and conventional ML. Complex analytics are on the horizon no matter what. And both of these will go nowhere without clean data. But this requires data integration at scale, the 800-pound gorilla in the corner. I believe it’s time for companies to start worrying about data integration to ensure that volume and velocity are not compromised by the challenges faced with data variety.