Written by Nik Bates-Haus
There is a long-standing debate about appropriateness of traditional RDBMS and NoSQL systems for different applications. One motivation for a non-RDBMS storage tier for an application is data flexibility. The conventional wisdom about applications like Tamr’s that work with a variety of data is that:
1. RDBMSs require up-front schema definition;
2. Changing the schema later is very expensive;
3. It is impossible to anticipate the schema of all the data they will eventually need;
therefore, RDBMS back-ends are not appropriate for applications that work with a variety of data.
The conventional wisdom could use a refresher.Document stores and their kindred allow you to pour your data in without much regard to its structure, then figure out the structure later, as part of analysis. Such a system is called late-binding (or schema-on-read, or schema-on-need), so named after the behavior of type systems in some programming languages. One of the challenges with such late-binding systems is that, typically, the effort spent to figure out the structure of data in the context of one analysis is not re-used for other analyses. Another is that the performance of high-volume data analysis on such a system can be abysmal when compared to that of an early-binding system. Neither of these shortcomings is necessary, and, furthermore, one can have one’s cake (flexibility when loading data) and eat it too (efficiency).
There are several ways to get late-binding semantics in an RDBMS. The most common is to store data as EAV. In this format, each entity (transaction, customer, whatever) gets smeared across multiple rows, where each row has only one attribute (A), and one value (V). To use this data, you need to “pivot” it into a traditional multi-column representation, which also involves parsing the string representation of the value. This transformation is so common that major RDBMSs have built-in support for it, and it actually runs reasonably fast. You create a view representing the query-time structure you want, and run your analysis on that.
The major shortcoming of EAV is that it is hard to represent hierarchical data. In the example above, it is difficult to find a representation of an article with multiple authors that keeps author first- and surnames together, while simultaneously being simple to query. For this, major RDBMSs now support documents – not just as a CLOB, but as JSON or XML. These provide the advantage of storing an entire entity in a single row, and of correctly representing nested and array data structures. The downside is that the entire document now needs to be parsed at query time, which is slow. RDBMSs make up for this by allowing indexes to be defined for particular filters, and some go further, allowing those indexes to be used in aggregates.
Not content to rely on the RDBMSs’ capabilities to support efficient filtering and aggregation on documents, some deployments make use of an additional technique: once an attribute meets some threshold of utility, it is promoted into a new top-level column. This enables the fast filter and aggregation queries that RDBMSs are so good at, while also supporting the schema flexibility of a document store. In most databases, maintaining consistency between the documents and the columns is left as an exercise for the DBA; Vertica, in particular, supports this kind of promotion as a first class operation.
At Tamr, we have found that an RDBMS can be an entirely appropriate back-end for an application that requires data flexibility. Using the techniques outlined here, arbitrary data can be loaded into the RDBMS with no DBA intervention. As the structure of the data becomes apparent, that structure can be persisted in the RDBMS and shared with all users, benefiting future work. And those aspects of the data most important for efficient operation can be promoted into top-level columns. In this way, we get to have our cake and eat it too, and it is delicious!