Written by Michael Stonebraker
Every decade there are three or four new systems who federate data in disparate systems. This has been a regular occurrence for the last forty or so years. In fact, I am responsible for proposing one in the 1970’s (Distributed Ingres) and one in the 1990’s (Mariposa). None of these projects have succeeded, and there is a good reason why not, as explained in this blog post. However, they may well play a role as a piece of a future data integration suite, as subsequently explained. First, let’s look at the common architecture of such federation systems.
Imagine your enterprise has a collection of DBMSs. These days, they range from RDBMSs to NoSQL DBMSs to file system software supporting “data lakes” to array DBMSs to graph DBMSs to … Obviously, one would like to query across these disparate systems with a common query notation. In all proposals I know of, this notation is some form of extended SQL. At the bottom level there are a collection of DBMSs holding data. Each system includes an adaptor that translates between the local query notation and the global notation (SQL). Adaptors are the middle tier. Then, the top tier is a federated DBMS, which can run cross system SQL queries.
So what is wrong with this picture? Let me explain with an example. For a while I was the Chief Technology Officer for Informix, which had operations in 58 countries at the time. A new CEO arrived and turned to the VP of Human Resources in his first staff meeting and asked an obvious question “How many employees do we have?”. The HR person said “I don’t know, but I will find out.” At the next staff meeting, the CEO asked the same question but got a different answer. “I don’t know, and there is no way to find out”. It was not because Informix lacked a federated DBMS, although that would have helped in obtaining data from the 58 subsidiaries. The problem was that there was no uniform definition of an employee. Some subsidiaries considered contractors as employees; some did not and some considered temporary workers as employees; some did not, etc. Hence, the CEO would have to decide what exactly he meant by an employee and then the various datasets would have to be “normalized” to this definition. This would take weeks-to-months. In summary, the problem is that there is was no common “schema” across 58 data sets. Moreover, there was no automatic transformation that could produce a common schema.
In my experience independently constructed data set NEVER have “plug compatible” schemas.
Here are some of the problems that arise, using a simple human resources example. You are the HR person in Paris; I am the one in New York. You have workers; I have workers. You have a worker table; as do I, though they will rarely be named the same thing. Your workers have salaries; mine have wages. Your salaries are in Euros; mine are in dollars. So far, it is conceivable that a smart federated DBMS can figure all this out.
However, your salaries are net after taxes and include a government mandated lunch allowance. Mine are gross with no lunch allowance. You pay your employees every two weeks; I pay monthly. Figuring this out automatically will take a smarter federation system than I can conceive of. Lastly, suppose I am trying to find the average salary across the federation. In this case, how do I treat part-time employees? Worse yet, there may be employees who work part time in both organizations. This will require some sort of “fuzzy” deduplication system. Doing this on the fly at query time is simply fantasy.
Tamr is in the business of unifying such disparate data sets, and sees these problems day-in-day-out. How bad can things be? Very bad, indeed!
One Tamr customer thought he has 200K suppliers. Because of duplicates across the various divisions, he only had 100K. Figuring this out using Tamr software and services, caused the CEO to rethink his whole business model.
Another Tamr customer has distribution in Europe at the country level (or finer granularity). When a customer moved from (say) Spain to France, the company developed amnesia, because there was an independently constructed customer data set in each country. Unifying the various customer data sets to avoid amnesia required unifying 250 data sets in 40 languages with 30+M input records, a daunting task with Tamr data integration software, but inconceivable with data federation software
In summary, there are without doubt simple applications where there is a global schema. E-mail is a conceivable example. Even here, fields other than “to”, “from” and “payload” will be difficult to federate on the fly. Also, “payload” will be in many, many languages.
Let me repeat: In my experience I have never seen independently constructed data sets with schemas that can be assembled by data federation software. So what can future data federation software be used for? Here is one possible answer.
Imagine a large enterprise, for example Merck. It has 4000+/- Oracle data bases, a large data lake, uncountable data files, and they are interested in public data sets from the web.
A Merck data scientist might have a hypothesis, for example, “Does Ritalin cause weight gain in mice?” Somewhere in the data assets of the enterprise there are data sets relevant to this hypothesis. Somehow the scientist must find them and then go through the exercise of data integration described above. “Finding them” is a data discovery problem. Imagine that Merck had a global “data catalog” of data assets. This catalog would record (say) the name and attributes of all tables plus a text description. Further imagine that Merck (somehow) empowered data scientists with read access to all data assets. Then, a simple data federation system might help the scientist by allowing him to browse such data assets to discover data relevance to his hypothesis. The adaptors, as mentioned above, and a common query notation would be very helpful. Less helpful, but conceivably useful, would be cross asset queries. Also helpful would be a powerful visualization system that would render data assets in something other than tabular form.
In summary a future data integration system would include:
- Tamr-style data integration software
- A data catalog
- Data federation software
- Data visualization software
And most importantly an enlightened Chief Data Officer (CDO) who could facilitate read access to all data assets
In the meantime, contact Tamr with your data integration needs, contact other startups for other pieces, and work on enlightening your CDO.