We all know by now that most “big data” challenges can be characterized into one of the “3 V’s”:
Volume: You have too much data
Velocity: It is coming at you too fast
Variety: It is coming at you from too many places/silos
Over the past 5 years, my partner Mike Stonebraker and I have been on a mission to raise awareness of the problems associated with Variety in enterprise data. Two of our earlier projects — Vertica and VoltDB — were designed to address data Volume and Velocity, respectively. Our most recent project — Tamr — is designed to address data Variety for the enterprise. (Shame on us … we couldn’t come up with another company name that started with “V.”)
We started by doing some research at MIT/CSAIL, working closely with Trifacta co-founder Joe Hellerstein. During this time, we constructed an early prototype for what has become the Tamr product. After validating our ideas with three commercial collaborations, we eventually started a new company — initially called Data-Tamer and then renamed Tamr (adopting the Silicon Valley cliche of “when looking for an available domain, remove the vowel”).
Now that we’re three years into Tamr’s commercial efforts, it’s worth sharing a few thoughts about the state of enterprise data and the potential for large enterprises to compete on integrated analytics. Specifically, we’re seeing massive enterprise investments in “big data” focused on data systems that:
- are broad in scope across an entire large enterprise (across all silos)
- run on distributed, shared nothing, multi-purpose hardware infrastructure and/or cloud
- have a foundation of open source software at their core
- embrace access to data via both JSON AND declarative language/SQL
Let me describe why investment in these areas supports the long-term value of enterprise data and will help companies compete on integrated analytics in a sustainable way:
I. Breadth Across All Enterprise Silos
One of the clearest problems that Mike and I have seen is that enterprise IT organizations are fiercely application-centric/silo’d in their organization, incentives and mindset. This works against line-of-business owners looking to use broad analytics as a competitive advantage — a goal that requires a broader enterprise view of data across application, organizational or geographic silos.
Enterprise data warehouses — which aspired to become this broad resource in the enterprise — suffer from being constrained to a small number of sources from a limited number of applications. Since most of these warehouses are built on traditional “scale up” infrastructure — often with proprietary software — almost all enterprise warehouses end up being:
- incredibly expensive to build and maintain (think Teradata, Oracle Exadata)
- limited by the number of sources feeding them with data from highly engineered Extract Transform Load (ETL) systems such as Informatica, IBM Ascential, Talend, Pentaho, etc.)
- rigid and slow to change (i.e., when your business changes, it takes months/quarters/years for these elephant-like systems to adjust), which is most obvious when it takes companies many quarters or years to integrate systems in M&A
The result? These architectures, technologies — and the people/vendors who have a vested interest in them — end up restricting the liquidity of data required for companies to “compete on analytics” as my friend Tom Davenport likes to say.
To realize the analytical potential of the enterprise, systems instead need:
- to be much less expensive;
- to embrace ALL the data in an enterprise, not just the highly engineered sources; and
- to be fed by radically more agile (ETL) processes.
In search of more data liquidity, many companies have turned, naturally, to “data lakes” as holding areas for information from many enterprise sources. This makes terrific sense in theory: data lakes are inexpensive and can hold a ton of data in diverse formats from disparate sources. In practice, lakes can get so large and loose that they risk being unusable.
It’s now time to embrace the broad adoption of data lakes by implementing infrastructures that make them a viable alternative to traditional enterprise data warehouses.
With the right infrastructure, the volume and variety of data within a data lake can help us answer simple, broad — and very powerful — analytical questions such as: “Are we getting the best price for any given product we purchase at our company?”
Simple enough to ask, with enormous financial upside in optimizing spending across every purchase order in a large multinational company. But very difficult to answer without a view and an accounting of all the “long tail” purchasing data in the system. Every large organization comprises many smaller groups — often with their own systems creating idiosyncratic spending patterns. Integrating the many disparate purchasing systems in a way that optimizes spending on every transaction without threatening local purchasing control is precisely the kind of problem we’re tackling at Tamr.
II. Distributed Infrastructure
We think that the traditional “scale up” infrastructure that characterized enterprise infrastructure over the past 30 years has run its course. It’s not that it’ll go away. Quite the opposite. Much like mainframe systems, the scale up systems implemented in the modern enterprise will have a long legacy that stretches across several decades to come.
However, the nature of these scale up, client-server oriented systems is that they were designed and implemented to automate enterprise processes. Which makes them essentially “data producing” systems. As the enterprise begins to focus on “data consuming” analytic systems, two things become clear:
- Traditional “scale up” enterprise infrastructures don’t really work for analytic applications. The big internet companies realized this 15-20 years ago. This is why they began building their own infrastructure — mostly software built from the bottom up to run on “scale-out,” shared nothing, multi-purpose hardware infrastructure. As these systems became successful it was natural for these companies to share these design patterns openly. This is where the Hadoop and the “NoSQL” movements started.
- If large enterprises discipline themselves into building their new analytical systems on distributed, scale out infrastructure, they will gain tremendous flexibility in multiple dimensions. First, they will draw a line in the sand for the historically slow and greedy enterprise software vendors (Oracle, SAP, EMC, etc.), creating alternatives to vendor lock in. Second, they will find it easier to leverage hosted multi-tenant cloud infrastructures. Moving from a system that is designed to run on an internal cluster to a hosted cluster is more achievable than trying to move from systems designed to run on premises to the cloud.
In short, the transition of ALL large enterprises to large-scale distributed, shared nothing, multi-purpose hardware infrastructure has begun. All new products designed to create value for enterprises should be embracing the new design pattern for hardware infrastructure.
III. Open Source Software
A combination of open source — and best-in-breed proprietary software designed to leverage open source — is the future. Most large vendors are prone to overvalue their proprietary code base, restricting access to it through pricing and, as a result, preventing broad adoption. The most effective vendors will recognize that open source communities create incredibly powerful breadth of adoption, while enabling software vendors to augment open source with best-of-breed proprietary software.
This has the added benefit of yet again drawing a line in the sand to discourage potentially greedy software vendors who chronically overestimate the value of their own technology in attempting to lock customers into their proprietary products. This is the primary source of the “data silos” and the root cause of the Variety problem in most enterprises.
IV. “Native” AND Declarative Languages/SQL Access
By now, most of the “no-SQL” community has embraced the value/need for a declarative language — and in the process, most have acknowledged that SQL is that language. The result is that most modern data storage systems will have access methods that include BOTH access via a declarative language (SQL) as well as other more direct methods — primary among them JSON.
All the energy/effort that has gone into bashing SQL and the resulting amplification in the tech press has been pretty much a waste of everyone’s time. I would argue that the primary focus of the effort that people have put into bashing SQL should have been put into bashing the proprietary and closed nature of the large database engines — which happen to use SQL. All of the big database vendors have pursued technical strategies designed to lock their customers into their products by encouraging users to create radically complex spiderwebs of code mixed with their data. Oracle epitomized this approach with their implementations of Materialized views and PLSQL — both of which I consider to be bad practice for most customers. Materialized views just masked the underlying performance problems of the Oracle engine/optimizer for read-oriented applications. PLSQL leads most customers into create a hairy mixture of code and data that is the antithesis of loose coupling and service-oriented architectures.
Additionally, while the optimizers in these most popular engines were significant advancements in the 1980s, they are now a primary bottleneck due to the fundamental failure of database engine companies to abandon their “one size fits all” strategy for database engines. The REAL beef of the NoSQL movement was with the existing database engine vendors. SQL was the proverbial baby that the new generation of database vendors and customers were throwing out with the bathwater. SQL AND JSON are the future.
In summary, companies that are working hard to draw a line in the sand for existing database system vendors and using four principles (wide breadth of data sources, open source software, distributed, shared nothing, multi-purpose hardware and embracing BOTH JSON and SQL) to develop their new systems are setting themselves up for improved productivity and radically better data systems in the future.