Data Masters Podcast
July 1, 2020

Getting Data Mastering at Scale Right

Mike Stonebraker
Adjunct Professor of Computer Science, MIT

What's required to master large numbers of data sources? First, avoid approaches that require writing rules. Then use machine learning and cloud computing to efficiently handle the workload. That advice comes from Mike Stonebraker, a database pioneer who helped create the INGRES relational database system, won the 2014 A.M. Turing Award, and has co-founded several data management startups, including Tamr. Mike talks about common data mastering mistakes, why traditional tools aren't right for the task, and shares examples of companies that have successful mastered data at scale.

I'd rather read the transcript of this conversation please!

Nate Nelson: Hey, everyone. Welcome to the DataMasters podcast. My name is Nate Nelson. I'm here with Mark Marinelli from Tamr. Who's going to introduce the subject and the guest of this episode of our show. Mark, how are you doing today?

Mark Marinelli: Hey, Nate. Good to talk to you again. So Mike Stonebraker has been a pioneer in database research for more than 30 years. He helped create the first working relational database system, Ingress back in the seventies. And he's kept up with the changes and driven some of them in data management ever since. He's co-founded several startups, including Tamr aimed at helping companies solve complex data challenges. In 2014, he received the Turing Award, which is a most prestigious award in computer science, for his contributions to the concepts and practices underlying modern database systems. Mike is currently an adjunct professor at MIT in computer science, and he's the co-director of the Intel Science and Technology Center that's focused on big data. In this episode, Mike's going to look at the problems companies face around mastering data at scale, why traditional database approaches, rules based approaches, fall short. And he's going to name some organizations that have successful used data to drive their business.

Nate Nelson: Okay, let's listen into Mike Stonebraker. Mike, how do you define data mastering at scale?

Mike Stonebraker: First of all, data mastering is a huge deal. And the reason for that is enterprises decompose themselves into business units so that they can get stuff done and business units are relatively independent. So enterprises create lots and lots and lots of semi autonomous business units. And the reason to do that is so that they can get stuff done and every decision doesn't have to go up to God. The problem is that when you create an independent business unit, you inevitably create a data silo. So information in the enterprise exists in silos all over the enterprise, and there's huge business value in integrating these data silos. So a lot of enterprises want to do cross selling between business units so that if you're a customer of the transportation business unit, the refrigerator business unit may want to try and sell stuff to you. That requires integrating customer databases from multiple silos and et cetera, et cetera, et cetera.

Mike Stonebraker: So integrating products, integrating customers, integrating suppliers, all of that has huge business value. And that's come to be called data mastering, which is find entities that are shared across these business units so that you can obtain business value. So that's data mastering. And at scale, that means, if you have a hundred records and I have a hundred records, master them however you want, use your wristwatch, do it on a piece of paper, but at scale you've got to really think about things. And at scale, usually means lots of records, lots of data sources, lots of entities and so forth. So just for example, Toyota Motor Europe, wants to master customers across all of Europe. Right now, Toyota has distribution units in every country and in many countries, their distribution is at finer granularity. It's at Cantons in Germany, for example. So if you buy a Toyota in Spain and you move to France, Toyota develops amnesia about you.

Mike Stonebraker: So they want to correct that and that requires mastering all European customers. There are 30 million plus of them in 250 databases in 40 languages. So this cannot be done on your wristwatch. So at scale means lots of records, lots of entities, complicated problem.

Nate Nelson: What we're discussing today are the biggest blunders that companies make when data mastering at scale. So let's get into it. MDM, it's a pretty popular data management solution. Why isn't it sufficient to solve the problem of data mastering?

Mike Stonebraker: That's a great question. So MDM stands for Master Data Management. It's the traditional solution pedaled by a bunch of companies, including Informatica and IBM, I'll call them the elephants. The problem with MDM is that you're trying to find, let's go back to the customer in Spain who moves to France. So basically, you want to master people and MDM says, what you want to do is use a rule system. So you write a bunch of rules that say things like, well, if age is the same and social security number is the same, then they're probably the same person. Or if name, the edit distance from the name in one database is less than a certain distance from the edit distance. If the edit distance between two names is less than a certain amount, then they're probably the same person. So you start writing rules. That's the solution of all the elephants MDM products. The problem with rule systems is they don't scale. So the traditional wisdom is that a single person can write and grok about 500 rules. And after that, you just can't keep this all in your head.

Mike Stonebraker: So if you can define a solution in 500 rules, God bless you. So MDM simply does not scale. So by all means, if you're confident, you can do the problem in 500 rules, the technology will be fine. But if you have a problem that cannot be done in 500 rules, you are heading into quicksand if you use MDM.

Nate Nelson: What about ETL?

Mike Stonebraker: So extract. If I have a bunch of data sets then, and I want to do and entity, I want to do data mastering. Well, the first thing I have to do is put these data sets into a common form. So I have to get them to a single place, I need to change attribute names to match, like I have to figure out that salary is the same thing as wages. I have to transform euros to dollars or dollars to euros so that I get salaries into common units. I've got to deal with dirty data, so that's all putting the data sets together so that then you can apply a master data management. So the problem with ETL is that the traditional elephants, the same cast of characters that we talked about previously, all have ETL solutions. And they all work in the following way, which is, I take my smartest team or my smartest person. And I create this global schema to which I'm going to map everything.

Mike Stonebraker: And then I send a programmer out to each data source and he talks to the business unit owner, figures out what's in that data source and figures out how to do all this extract, transform and load stuff. So this is a manual process per data source. There are two problems with ETL. Number one, no one is smart enough to do a global scheme upfront at any scale. So if you have a whole bunch of data sources, no one is smart enough to understand the global scheme up front. Number two, if you have a lot of data sources, there aren't enough programmers in the world to go out and do this manual ETL process. So again, ETL works fine if you have two or three data sources or maybe even 10, twist my arm and I'll give you 25, but it isn't going to scale to large numbers of data sources. So again, the technology is brittle. As long as you don't need to deal with very many data sources, God bless you. But if you do need to deal with a lot of data sources for goodness sake, don't use this.

Mike Stonebraker: So an example is GlaxoSmithKline, the big drug manufacturer. They want to integrate all their research data, which is coming from thousands and thousands of data sources. And they estimate in aggregate, they have more than 10 million attributes that they've got to put together into a global schema. So the technology of picking up an attribute and mapping it to a global schema, you're going to get Carpal tunnel syndrome long before you map 10 million attributes. So the technology just doesn't scale. So the traditional solutions just plain don't scale.

Nate Nelson: Mark, can you explain why rules writing is so cumbersome.

Mark Marinelli: You look at what you're trying to accomplish when you're building roles, you've got a technically sophisticated and skilled person building these roles. And then you've got business users who do not think in terms of conditional logic or sequel or anything like that, who know the most about the data. And neither party knows an awful lot about what the other one is doing. So they have to collaborate. When you look at what you're trying to accomplish, capturing an expert's knowledge of the data into code that they wouldn't understand. And then keeping up with all the changes in both their requirements and the data that they're using, you see lots of problems. There's a translation issue, where some of that business knowledge isn't accurately codified in the rule set, my subject matter expert wasn't available to me this afternoon and I think this is what meant. So here's how I'm going to code it.

Mark Marinelli: You've got a conflict resolution issue, where two of our experts in the data have divergent views of what the actual rules are. And somebody's got to figure out who is right or even worse, maybe they're both right, and we got to figure out how to instrument that in our roles. You've got a maintenance issue, where new data arrive, which break all of the rules, or the business rules change because the business has changed and you need to retrofit them accordingly. With a handful of rules, this is no big deal, but at the scale and scope that we're talking about, this can be hundreds or thousands of rules. They're interdependent, they're volatile, and they need to be rationalized to produce accurate data at the end. So you end up with teams of people doing this, and they're actually never done because constantly day to day, new data or new requirements are going to make for different rules.

Nate Nelson: What are some of the other more common mistakes that you've seen in your time when companies are mastering their data?

Mike Stonebraker: The biggest problem I see is companies try the traditional solutions because that's what the elephants are marketing. And so we've already talked about, it just doesn't work at scale. Second blunder that people make is to say, well, how hard can this stuff be? All I've got to do is map Mike Stonebraker and Michael Stonebraker to the same place. And so the biggest, the second biggest blunder is people who are not sophisticated at mastering think it's easy and they dent their pick on how hard it actually is. Also, to do it at scale, you've got to use machine learning. There's no other technology is going to work and machine learning is not for the faint of heart. At some point, machine learning will get easier to use, but right now it's fairly challenging. So the first thing any company interested in data mastering should do is hire a couple people who are very sophisticated in this technology.

Mike Stonebraker: Third error companies tend to make is political. Mark Ramsey is the Chief Data Officer of GSK, or he was at the time a couple of years ago. And I was talking to him and he said, "When I was approached to become the CDO of GSK, I said I will only take this job if you give, you meaning the company, the president, give me read access to all data in GSK." So if you don't, if you're the CDO who's responsible for doing mastering projects and you don't have read access to everything, you have no chance of succeeding. So politically, you've got to empower first of all, you've got to have a CDO whose scope spans whatever the mastering you want to do. And you've got to empower him to be able to see everything so he can do his job.

Nate Nelson: Now, that we've established some of the pitfalls, let's talk about the good ideas out there. So other guests of this podcast have mentioned that machine learning and automation are critical in data mastering. Mike, would you agree, where does it fit in the industry moving forward?

Mike Stonebraker: Absolutely. So we can go back to GSK. So GSK wants to master research data and its 10 million plus attributes from thousands of data sources. You cannot do this manually. I mean, just hell will freeze over and your Carpal tunnel syndrome will get bad before you ever get close to these kinds of numbers. So it's got to be an automated solution, anything manual cannot scale. And the only technology that's automatic that will scale is machine learning. The problem with ML is first of all, you need training data. So the idea is you have a big problem, you take a small piece of it and you tag matching attributes or matching entities, and then machine learning will then take the tagging that you did and extend it to everything, so in simplistic terms that's what ML does. Well, getting training data is often a problem, getting training data that's not skewed is often a problem. Correcting errors that you get that come from your training data is a problem. Figuring out whether you've got enough training data is a problem. All of these things are things you have to think about.

Nate Nelson: You already mentioned maybe some of the concerns around ML. One of the most notable is that ML will one day replace data scientists and other people who handle data. Mike, do you see that as being the case in the future? Can ML and data scientists work side by side?

Mike Stonebraker: The answer is absolutely. So let me just give you a quick example. So Carnival Cruise Lines is actually nine different cruise companies. So Carnival is one of those cruise companies, but so is Holland American, in Europe, a company called Cousteau. So there's not nine of them that have been aggregated into a single holding company. So Carnival wants to share spare parts among all of these cruise lines. So for example, Carnival has its own spare parts system, Holland American has a different one, Cousteau has a third one and so forth. So there's at least nine of these spare parts systems. You can keep some spare parts on the boat, you can keep some of them on the dock, you can keep some of them in a warehouse and so forth. Well, it turns out everybody uses the same straws in their dining rooms. So if you run out of straw on a Holland, American boat, it may be that you can get them from a Carnival boat. So they want to share spare parts across the traditional cruise lines.

Mike Stonebraker: So that requires you to figure out among at least nine spare parts systems, which parts actually are the same thing. So you have to do parts mastering. And so the trouble with parts mastering is every cruise line identifies their parts differently. So you have to figure out that an XY6Z67 pump of Carnival is the same thing as a ZYQ42 pump in Holland American. That kind of stuff. So you have to do parts mastering. So how are you going to do that? Well, option A, they're about an aggregate across these cruise lines, there are about 10 million total parts. So you can just line up all the parts in a single place and then master 10 million things, so that's option A. Option B, is every one of these cruise lines has a parts' classification system. So there's parts, in Holland American, they have parts, a subset of the parts, a sub class of the parts is computers, subclass of computers is memory and so forth.

Mike Stonebraker: So they have a parts' hierarchy and Holland American parts are in their parts hierarchy, ditto for all the other cruise companies. So they have nine parts hierarchies in which all the parts are classified. So a second option B, to doing parts mastering is to say, what I'm going to do is I'm going to take these nine parts hierarchies and I'm going to master them. In other words, I'm going to figure out that computers to Holland American are the same thing as automatic machines to Carnival. So I will try and master the categories. And then once I do that, then I will have a global classification system. And then I will take every classification bucket, from every cruise line, put it into the correct bucket in the global hierarchy and then I'm going to master those buckets one by one by one. That's the second way to do it. So a smart human has to figure out the algorithm that you're going to attack the overall problem with.

Mike Stonebraker: It's also going to take a sophisticated data scientists to figure out whether you have enough training data. So all the issues that come up in ML ultimately have to get decided by a human.

Nate Nelson: And companies are also pushing more workloads to the Cloud, especially ones that are more compute intensive. Mike, do you see this happening for data mastering as well, or do organizations generally prefer on premises deployments?

Mike Stonebraker: I think that the Cloud in general is going to take over. I mean, it's in the process of taking over and there are two very good reasons for that, that I can illustrate with a couple of vignettes. So the first one is James Hamilton is a very well known computer scientist who works for Amazon. He claims that Amazon can stand up a server for 25% of your costs. The reason is they put servers in the Columbia River Valley, where power is cheap. You put them on raised flooring in Cambridge, Massachusetts, obviously they're going to be cheaper. And that costs and in summary, they're standing up servers by the millions, you're standing them up by the thousands, they're going to be cheaper. And the second vignette comes from Dave Dewitt, who until a couple of years ago was the head of the Jim Gray Systems Lab for Microsoft. He said at the time, as your data centers use the following technology and I suspect it's probably the same today. Azure Data Centers, were shipping containers in parking lots, chilled water in, power in, internet in, otherwise sealed, roof and walls are optional, only needed for security.

Mike Stonebraker: Compare how that cost is going to look relative to your raised flooring in Cambridge. So they're just going to be a lot cheaper and sooner or later prices will reflect their costs and they will be a great deal under yours. So it will make economic sense to move everything you can to the Cloud. And the data warehouse is moving very aggressively to the Cloud. So anything that's decision support is going to move aggressively to the Cloud. I expect mastering will also move aggressively to the Cloud because it can, so anything you can move to the Cloud, you should do. The things that aren't going to move are things like, legacy OLTP systems code in COBOL, running on peculiar hardware.

Nate Nelson: It's clear why ML and Cloud computing are useful, but why are they essential to the process of data mastering?

Mark Marinelli: So machine learning is enormously powerful in its ability to automate away a lot of the work that's previously done. The way we did it before was slow, expensive, often inaccurate by teams of people. It's definitionally no other way to do this with computers than through enhanced algorithmic approaches, machine learning, AI, et cetera, if we want to solve that problem. However, machine learning is also very computationally, intensive and expensive, especially when working at the data scale that we're trying to. So in order to match the ML to the problem, we need enormous resources. That's where Cloud computing comes in. It's usually leverageable, it's really the exclusive modality for making all of this work economically. Machine learning workloads are not just big, but they're also sporadic, especially for mastering. I need a lot of compute for the next few hours then I don't need any for the next few days. So that's another important part where Cloud computing, where I can get these burst mode ephemeral, but large scale computing resources, and then shrink them down to almost nothing and only pay for what I'm using.

Mark Marinelli: That's really the only way to go about these machine learning workloads and do so at a total cost of ownership that is acceptable.

Nate Nelson: So I asked you earlier about the common mistakes in data mastering that companies make. Now, let's do the reverse. So aside from utilizing these more forward technologies like ML and Cloud, what are some proven methods that companies can use to improve their data mastery?

Mike Stonebraker: There's often a legacy, not very well working mastering solution that's high cost, high touch and doesn't work well, but politically there is a enterprise that is committed to that nail, the elderly technology. So I think the best thing that enterprises can do is organize such that they can neutralize these political antibodies that are committed to obsolete technology. So I think hiring a CDO, empowering a CDO, and then figuring out how to empower forward looking technical solutions in the enterprise. And that's all nontechnical stuff.

Nate Nelson: Mike, if you've got a parting word to leave with our listeners.

Mike Stonebraker: Sure. I think, and I like to quote a Google vice president named Alfred Spector. He said the biggest problem is stamping out antibodies that are going to look at change and say, that's going to hurt me politically, so I'm going to stamp it out. So figuring out how to neutralize political antibodies is I think the biggest thing that enterprises can focus on. Is the biggest thing that's slowing down the change that is coming, not only in data mastering, but in all kinds of other areas.

Nate Nelson: All right. Well, that was my conversation with Mike Stonebraker. I'm back here with Mark Marinelli, Mark. You are no novice in what Mike is talking about here. Can you talk about some of the examples you've seen in your work around why some companies maybe get data mastering a little bit wrong?

Mark Marinelli: Yeah, sure. I think one aspect of modern mastering that I've seen companies get wrong is going out and getting the latest and greatest, applying new technologies without applying a modern approach to execution of the projects. It's not just about the tech, new tech like machine learning, artificial intelligence, gives you automation, it removes a lot of the labor intensive work that you've had to do before. And so you can end up with taking months long data modeling and rule writing approaches and distilling them down now to things that can be accomplished in weeks, not months or quarters, all the while still incorporating all of the data or at least the most relevant data. So teams need to reorganize themselves. They need to be thinking in terms of agile, short duration projects that are going to produce quick wins. As opposed to sticking with that traditional waterfall approach, which may not bear fruit for months or quarters after which point maybe the stakeholders have lost interest in the project and continuing to work on it, or the problem has just passed you by.

Mark Marinelli: So if we've got all of these wonderful new technologies, we need to just change the way that we organize our teams and adopt a different approach that's enabled by these technologies. I'm not going to name names, but I've definitely seen this at some of our customers. They aspire to transform their data, but they're not really thinking about how they're going to transform their teams and the project management around all of these appropriately. So that they are delivering new value every couple of weeks or every month, rather than getting the big steering committee together and essentially using agile tools to do waterfall work.

Nate Nelson :Interesting. Okay. So that should just about close up our discussion today. Thanks a lot to Mike Stonebraker for speaking with me and thank you Mark, for speaking with me.

Mark Marinelli: Sure thing, Nate, always a pleasure.

Nate Nelson: This has been the DataMasters podcast from Tamr. Thanks to everybody listening.

Suscribe to the Data Masters podcast series

Apple Podcasts
Google Podcasts