datamaster summit 2020

Changing the Rules on MDM: Data Mastering at Scale

 

Michael Stonebraker

Co-Founder at Tamr and Turing Award Winner

Today’s data challenges need new solutions. MDM needs machine learning. A.M. Turing award winner Michael Stonebraker offers a thought-provoking session on the next generation of MDM and why human-guided machine learning is the critical advancement in big data that unlocks data to drive analytic insights.

Transcript

Speaker 1:
DataMasters Summit, 2020, presented by Tamr.

Mike Stonebraker:
Hi, I’m Mike Stonebraker. I’m the original founder of Tamr. Marketing said the name of this talk is, Changing the Rules on MDM. What I’m really going to talk about is data mastering at scale. Moreover, marketing said I had to wear a dress shirt, and I actually do own one. So you get to see a very rare event, which is me in a dress shirt. Also, we’re here in an empty Fenway Park, and the people who are filming are about 10 feet away. And in the interest of being able to talk and not fog up my glasses, I’m going to take my mask off. So, data mastering at scale. First, I’ll tell you briefly, remind you what data mastering is. Then I’ll tell you why scale is important. And then the meat of the talk will be, why do the traditional solutions to do mastering fail?

Mike Stonebraker:
And then, of course, I have to have a solution, which is what works. All of this in half an hour. So, enterprises are full of data silos. Why is that? Well, CEOs empower independent business units so they can get stuff done. Otherwise, all decisions have to go through God and business agility goes out the window. So, enterprises are full of silos with their own data, which generates data silos. There is huge, huge business advantage in integrating such silos. I’ll just give you one quick example, cross selling between IBUs. I’m the refrigerator guy and you’re the air conditioning division. I have customers, you have customers. If you want to sell to a new customer, say IBM, you’d love to know if I have IBM as the customer so that you can use cross selling, use me as a reference, dot, dot, dot. Huge value in integrating such business silos.

Mike Stonebraker:
And remember that this occurs after the fact this is independent business units with independent databases who want to combine their databases for business value. So, tattoo this on your brain, independently constructed schemas for these independently constructed databases are never plug compatible. I have never seen two independently constructed databases with the same schema. It just doesn’t happen. So, your problem is, you’ve got to integrate silos. Well, what does that entail? Well, the easy part is, move the two data sets or, and data sets to a common place. Think about that as a data warehouse or a data Lake. And once you get it to a common place, then you can do the following steps.

Mike Stonebraker:
You’ve got to perform transformations to get data into common units and common meaning. You’re the HR person in Paris, I’m the HR person in New York. You have employees, I have employees. You have salaries, I have salaries. Your salaries are in euros, mine are in dollars. Yours are after tax with a lunch allowance, mine are gross before taxes. We’ve got to get this stuff into common meeting. Then you’ve got to lineup the various columns that mean the same thing, you call it wages, I call it salary. And then a step that most people often overlook is that, it’s guaranteed that both of our data sets are dirty. Think of it as 10% is missing or wrong.

Mike Stonebraker:
In scientific domains, scientists often use minus 99. What they really mean is no, you cannot do aggregation on minus 99, you’ll get the wrong answer. So you’ve got to clean the data. As often as not, when you try and put data sets together, you’d love to get some more attributes. You’d love to get the Dow Jones code for each of entities and so forth. So, add more attributes to make the next step easier. The hardest part, heavy lifting, is to perform entity consolidation, which, the whole idea is, you want to find IBM in the two datasets on refrigerators and air conditioners. So, you’ve got to be able to consolidate duplicates. They are fuzzy with missing values, and often have errors in them. So, that’s called entity consolidation.

Mike Stonebraker:
You’ve got to often find golden values for clusters of records. If Mike Stonebraker is one record and Mr. Stonebraker is the other one, I do have a common age. I’m only one age. And so if you record multiple values for age, you’d like to try and find the golden value, which is my real age. After the fact, if you’re consolidating suppliers, you might want to classify them as international or local. And then once you get done with all this, then you’ve got to do this on an ongoing basis as both of these source data sets get updated over time. So that’s what silo integration is all about. And it involves a pipeline of some number of the previous actions in some order.

Mike Stonebraker:
And it’s often called data mastering, master data management, data curation, data unification. We’ll just use the term data mastering to mean this integration task. So that’s what data mastering is all about. Integrate existing data silos with a pipeline of some number of the previous actions. In this talk, we’ll worry about three components of the pipeline, the three hard ones; schema integration, entity, consolidation, and classification. Just because I only have half an hour. So that’s what data mastering is all about. That’s what we’re going to talk about.

Mike Stonebraker:
What does it mean to do data mastering at scale? Means hundreds of entities, millions of records, hundreds to thousands of data sets and not one and done with ongoing stewardship. So do it at scale, meaning lots of entities, lots of records, lots of data sets. Well, who’s got this problem? Well, let’s just quickly look at GlaxoSmithKline, GSK. They’re a Tamr customer. They’re initially focused on integrating clinical trial data over lots of datasets. However, the gleam in their eye is, they want to master research data worldwide so that they can facilitate scientists interacting with each other. So a scientist in London is making a particular kind of book. A scientist in New York is making the same book out of different reagents. You want to hook these two scientists up so they can collaborate.

Mike Stonebraker:
So GSK has many, many, many such tables. They figure they’ve got more than 10 million column names. So do schema integration at scale 10 million plus. So what does it mean to do entity consolidation at scale? Well, Toyota Motor Europe is another Tamr customer. The gleam in their eye is, they want to master customer data across all of Europe. The problem they face is that, you buy a Toyota in Spain, you’re in the Spanish distributors database. He has all your repair records. You then move to France because that’s perfectly okay. You take your car, your Toyota in for service in Paris, and the Paris dealer has no knowledge of you because the Paris distributor is totally independent of the Spain distributor. So they want to put these various customer databases together, and aggregate that’s 30 million plus customers in 250 databases, most of which are run by dealers or service centers. Toyota doesn’t control many of these.

Mike Stonebraker:
And by the way, just to make the problem more interesting, they’re in 40 languages. So do this at scale, 30 million from 250 source datasets. Entity consolidation, what’s data mastering at scale? Well, there’s a media company whose name I wish I could tell you but I can’t. They are mastering titles of content. Think of that as Star Wars, Star Wars seven, Luke Skywalker, film, all that stuff. So they’re mastering titles from lots and lots of different sources, and they want to de-duplicate them for legal royalty reasons so they can charge the correct royalty. And they want to understand what content is being consumed. They have almost 8 million records that they are consolidating into about a one and a half million golden records. So again, do entity consolidation at scale 8 million.

Mike Stonebraker:
Last example is General Electric. They are classifying spend transactions worldwide. So what does that mean? Well, I go to the next conference, when I can have one in person. I take a taxi from the airport, and item on my expense report is, taxi from Logan to Fenway Park, with an amount and the taxi company. That’s a detail in the spend report. And they want to classify that spend into the right category, which is local transportation. So they’ve got 20 million such spend transactions that the CFO wants to classify into a preexisting classification hierarchy. So do classification at scale of 20 million.

Mike Stonebraker:
Okay. So that’s the problem. Do entity consolidation, classification, schema integration at scale millions. So, why don’t the traditional solutions work? Let’s look first at scheme integration. The traditional elephants who are in this data mastering space, almost all of them put a table up on the left-hand side of the screen, table up on the right-hand side of the screen, and lets you draw lines in the gooey between pairs of matching attributes. Notice that this is human powered. And imagine doing this at scale of 10 million, Carpal tunnel syndrome for sure. And it’s going to take absolutely forever. So anything that is human powered is guaranteed not to scale to the kind of numbers that GSK has. So anything that’s human powered, you should reject out of hand.

Mike Stonebraker:
So, what about entity consolidation? Well again, what do the elephants do? The traditional solution is to use a rule system to do what’s called match merge. Match means find two source records and try and decide if they’re the same. And a particular rule might be, if the edit distance between names of content items is less than a certain amount, then they’re the same title. So, that’s one rule. And write as many of those rules as you need to. And then once you got a collection of things that match, if you want to find golden records, one such rule to find golden records is take the most frequent value that you see in a column when you’re merging down clusters of like records. So, use a rule system to do entity consolidation and golden record construction.

Mike Stonebraker:
So, why does this fail? Well, I’ll just point to this unnamed media company. They wrote, believe it or not, hold onto your seat belt, 200,000 such rules in a Homebrew rule system that they’ve been working on for 13 years times two people. So think of them as into this for at least $5 million over a long period of time. They’re finding this totally unmaintainable. What does this mean? Well, if they have to add a new data source, it takes forever. If they have to change anything, it takes forever. I just want to give you… make this discussion very concrete, I just want to do a very quick example. Here’s a screen and Tamr’s clustering visualization tool. This turns out to be a collection of records that Tamr thinks correspond to B&G foods incorporated. No idea who they are. But take a quick look at the five raw data records that we’re putting together.

Mike Stonebraker:
Notice that there’s caps and smalls. Some are B&G, some are B and G without the ampersand. Some of them have incorporated, some of them have inc. You can imagine writing a bunch of rules that put that stuff together. But now look over at the city. One record says, Parsippany-Troy Hills, whatever that is. Second one says, Parsippany. Third one says, Troy Hills. So this looks like it’s a little more challenging to put together. And then you’ve got the same issue with address and state. So you could imagine doing rules that would determine that these five records all belong to B&G.

Mike Stonebraker:
But now let’s just quickly switch to another entity, Deutsche Bank Capital Funding Trust VIII, whatever that is. Now, notice that this one is going to require some more rules, because one record says capital funding, the other says cap funding. Some records leave out the trust, some don’t, and so forth. Some say the address is Wall Street, some say it’s 60, wall street. Some say it’s 60 Wall St, some say NYC, some say New York, some say New York City. So you can imagine having to deal with a new entity, you got to write some more rules. So, you can readily imagine that you’d have to write a lot of rules to do this sort of entity consolidation at scale. So, what’s wrong with traditional rule systems? The answer is, they don’t scale. The general wisdom, tattoo this on your brain, is that you and I can understand around 500 rules. After that, it gets really hard and it gets exponentially more complicated.

Mike Stonebraker:
So, that’s why it took forever for two engineers to write 200,000 rules, is that it gets exponentially harder to do this. So, the minute you go over about 500 rules, life gets really, really, really hard. So, just remember that if you have a problem that’s going to require more than 500 rules, you’re in deep yogurt. So, let’s just quickly look at the General Electric classification problem. They said, “Heck, we can write this in a rule system.” So they wrote about 500 rules in a rule notation. And those 500 rules turned out to classify 2 million out of their 20 million transactions. Once they did this, they said, “Oh my God, we are in deep quick sand here because we only have 10% of the problem done, and we’ve written the modicum number of rules that a human can understand. And we’d have to write 5,000 plus rules to get the rest of the problem and that would be really daunting.”

Mike Stonebraker:
So they realized that their solution won’t scale. Rule systems, just plain, don’t scale. So, tattoo that on your brain, if you get nothing else out of this talk, rule systems don’t scale. So, what did they do instead? Well, they came to talk to us and Tamr used their machine learning system and an ML system in general. It says, give me some training data, in this case, when you do classification, you need pairs of records and their classification bucket. And your problem is to take the training data and extend it to the rest of the data you have at hand. So, the training data is used to fit an ML model. The ML model then predicts the desired value for the rest of the data. And Tamr uses active learning to correct mistakes, look at a sample of the model output, fix mistakes that becomes a feedback loop that gets you more training data. So ML can get fairly sophisticated. Tamr’s process is very sophisticated.

Mike Stonebraker:
So where does the training data come from? Well, if you think you can manually generate it, think again. At scale, getting manual training data is probably a non-starter, and by and large, you can’t delegate this to unskilled personnel. So if you’re trying to figure out whether, for example, Merck with an address in Germany is the same thing as Merck with an address in the US, a housewife in Cedar Rapids, Iowa using Mechanical Turk, is going to have no idea. It’s going to take a sophisticated financial services person to know that the answer is no, those are independent companies. So, training data cannot be generated manually. It tends to require sophisticated personnel. Some enterprises have it lying around. If they do, then you are really lucky. If you don’t, you need to get it.

Mike Stonebraker:
And constructing training data, it’s fine to use rules to construct training data. And that’s exactly what GE did. They used their 500 rules to classify 2 million records. They used those 2 million records as training data to classify the rest of their data set. So it’s fine to use rules to construct training data. You don’t need to tag everything. You just need to get a training data sample. And, of course, Tamr used active learning to fix any errors that came to light. So, that’s exactly what GE did. They fitted in… well, we and they, fitted in an ML model, which classified the remaining 18 million records. Worked like a charm. So rule systems don’t scale, find or use a rule system to construct training data, which you then use to fit an ML model. And ML will work at scale, rule systems will not.

Mike Stonebraker:
Some people that I talk to say, “Well, why can’t I use an off the shelf Python package to do this stuff?” And the trouble is that will work fine for a thousand records. It will not work fine for 30 million. So here are some things that you have to think about at scale. First of all, ongoing stewardship. Toyota continues to sell cars. These 250 databases are getting updated. So you’ve got ongoing stewardship. At scale, you’re going to have feedback from scores of experts. Think of these 250 databases each coming with their own expert who is tagging or supervising tagging of data. And you’ve got to worry about drift among the experts. You’ve got to worry about model classification errors. You’ve got to worry about humans that think about things differently. And then you’ve got to think about new data sources. So ongoing stewardship is a problem.

Mike Stonebraker:
You’ve got to be careful that the training data is not skewed. Because otherwise, if you have skewed data, you’re going to create a skewed model, and that’s going to mean your ML is going to produce lousy results. And then you’ve got to worry about resolving errors and training data. Tamr finds that the training data is always dirty, whether it’s human constructed or rule constructed. And then you need a scalable parallel architecture to do this at scale millions. So it gets messy at scale. So, you have to think carefully about how to do this at scale. And, of course, algo is important. So what algorithm are you actually going to use? Well, let’s back up and look at scheme integration just for a sec. If you have GSK’s scheme integration problem, well, we could build up a training set of pairs of matching attributes using rules or some other means, and then I could fit an ML model to that training data. So that will directly take pairs of attributes and build an ML model to find the rest of the pairs of attributes that fit.

Mike Stonebraker:
That’s one way to do it. Second way to do it is to say, well, there are a lot of these tables in the GSK database that turn out to be about drug trials. So drug trial data looks fairly similar. So, why don’t I classify every table into some bucket? And, so, I can do that as a classification problem without having the result that I’m looking for a fixed. So, that’ll turn this into a classification problem. I get a bunch of buckets that the classifier thinks are the same table, and then I can de-duplicate those by running an ML model, or by a manual process. So that’s another way to do exactly the same problem, and it’s scale. It makes sense to try and think carefully about which way you’re going to do things. So, a scale algo deserves careful thought. So, at scale ,things get kind of complicated, make sure that you know what you’re doing, or you get some expert help.

Mike Stonebraker:
So, what have I talked about today? Well, data mastering at small scale, use whatever you want. Mastering a thousand records on your wristwatch is just fine. It’s not very hard. Run any of the traditional data mastering products, at small scale, it doesn’t matter. At large scale, you’ve got to use ML, and is often an engineering challenge. Now, many of you say, “Well, I’ve got small scale for now, but I can see big scale coming.” If you use a solution that isn’t going to scale, then you’re going to have to redo it when you have to scale up your problem. So if you see scale in your future, you better start off with an ML solution. And if your chosen solution provider does not use ML for these three things we’ve talked about, for scheme integration, for deduplication, for classification, and by the way, that’s a hundred percent of the traditional data mastering vendors that we know about.

Mike Stonebraker:
They all have their marketing people tout ML, but it’s for something else, it’s not for scheme integration, not for deduplication and not for classification. So if your chosen solution provider is not using ML for these problems, then you are in deep quick sand now, or when you try and scale. And if you see quick sand in your future, then come and see us. This is what we do, is deal with mastering at scale. Thank you very much. I look forward to your questions. If you’re not a Tamr customer, go check us out at tamr.com. Schedule a demo. Talk to us. Get some details from one of our experts. And, thank you very much. I’m done. And I think there’s a question and answer session after this talk, which will be when you guys are actually listening to this talk, thank you very much. I’m done.