datamaster summit 2020

How to Avoid the 10 Big Data Analytics Blunders — Best Practices for Success in 2021

 

Michael Stonebraker & Anthony Deighton

Michael Stonebraker, Co-Founder, Tamr
Anthony Deighton, Chief Product Officer, Tamr

 

As a steward for your enterprise’s data and digital transformation initiatives, you’re tasked with making the right choice. But before you can make those decisions, it’s important to understand what not to do when planning for your organization’s big data initiatives.

Michael Stonebraker shares the top 10 big data blunders that he has witnessed in the last decade or so. As a pioneer of database research and technology for more than 40 years, Michael understands the mistakes enterprises often made and knows how to correct and avoid them. By learning about the major blunders, you’ll know how best to future-proof your big data management and digital transformation needs. Common blunders include problems from not planning on moving everything to the cloud to believing that a data warehouse will solve all your problems to succumbing to the “innovator’s dilemma.” To illustrate the blunders, he shares a variety of corrective tips, strategies, and real-world examples.

 

Transcript

Shannon Kempe:
Hello, and welcome. My name is Shannon Kempe. And I’m the chief digital manager of the university, and I’m the chief digital manager of the university. We’d like to thank you for joining this Dataversity webinar, How to Avoid the 10 Big Data analytics Blunders, best practices for success in 2021, brought to you today by Tamr.

Shannon Kempe:
Just a couple of points to get us started, just because of the large number of people that attend these sessions, you will be muted during the webinar. For questions, we will be collecting them by the Q&A at the bottom right-hand corner of your screen. Or if you like to tweet, we encourage you to [inaudible 00:00:28] your questions by a tweet using #Dataversity. And if you’d like to chat with us or with each other, we certainly encourage you to do so. Just click [inaudible 00:00:36] on the bottom right-hand corner of your screen for that feature. And as always, we will send up a follow-up email within two business days containing the slides, the recording of this session, and additional information [inaudible 00:00:46] throughout the webinar.

Shannon Kempe:
Now, let me introduce to you our speakers for today, Dr. Michael Stonebraker and Anthony Deighton. Michael is an adjunct professor at MIT Computer and Science and Artificial Laboratory, and a database pioneer, and he specializes in database management systems and data integration. He has been a pioneer of database research and technology for more than 40 years, and is the author of scores of papers in this area. In addition, he has started three other companies in the big data space, including Tamr. He also co-founded CSAIL’s Intel Science and Technology Center for Big Data, based at MIT.

Shannon Kempe:
Anthony is the chief product officer at Tamr, overseeing product and solution strategy for Tamr’s growing data mastering solution. Anthony was most recently CMO at [inaudible 00:01:29] and senior vice president [inaudible 00:01:30], and has over 20 years of experience building and scaling enterprise software companies. He has a bachelor’s degree from Northwestern University, and an MBA with high distinction from Harvard Business School. With that, I hope you’re looking forward to our distinguished speakers at today’s webinar started. Hello, and welcome.

Michael Stonebraker:
Hi, this is Mike Stonebraker speaking. Anthony and I are going to sort of ping pong a little bit. So he’ll be breaking in to tell me whenever I make a mistake. I also want to mention that Anthony and I are probably the only two people on this call who have never watched Breaking Bad. So you’ll have to excuse our ignorance in this area. Also a bunch of these slides are not mine. They come to you from Tamr Marketing [inaudible 00:02:30] here on the screen.

Michael Stonebraker:
Anyway, I’m going to go through what I consider the 10 biggest blunders that I’ve seen enterprises committing in this data space. And I’ll just do them one, by one, by one. The first one is not planning to move most everything to the cloud. Next slide.

Michael Stonebraker:
Now, it may take a while. This is not something you’re going to accomplish this year. It may take a decade, but it’s the right thing to do. Why do I say that? Let me give you two quick vignettes. The first one is from Dave DeWitt, who until recently was the head of the Microsoft Jim Gray systems lab in Madison, Wisconsin. So they’ve said, “Here’s the technology that [inaudible 00:03:36] using in its data centers. They are shipping containers in a parking lot. Chilled water in, internet in, power in. Otherwise, sealed. [inaudible 00:03:50] optional, are only there if you need them for security. Now [inaudible 00:03:56] that with whatever you guys are doing, with vague scoring, in Boston or New York. You’ve just got to believe that the cloud guys are going to do what you do.

Michael Stonebraker:
Another vignette comes from James Hamilton, and he works for AWS. And James, I have no reason to just believe it, but he claims that Amazon [inaudible 00:04:26] a mode for 25% of your cost. Now, prices may not use it, may not accurately track cost in the short run. But in the long run, they will. And if they’re a factor or a shaper, then I don’t see how you’re going to continue on more.

Michael Stonebraker:
Moreover, if you move to the cloud, the big deal these days is elasticity. You use one node for end of month processing. You use 20 nodes for the day before that, when you’re getting ready for end of month processing. You use three nodes on the first day of the next month, and so forth. So you can scale your resources with your load. You can’t do that with data centers.

Michael Stonebraker:
So of course what everybody asks is, “How do I manage to move stuff [inaudible 00:05:34] to the cloud. Well, data will move easily. Just moved the data. So decision support will be the first to move. And if you’re not well along with moving all of your decision support to the cloud, then I think you’re making a big mistake. Next slide.

Anthony Deighton:
Before we jump to the next slide, I’ll just chime in. I think we’ve talked a lot about this idea of moving data to the cloud, and I think partly because the cloud [inaudible 00:06:06] who are aiming to capture that data care deeply about getting your data in the cloud because it creates lock in. I think there’s a dual point here. And Mike, you make this point about elasticity. This is the point I think we need to underscore. Not only is data moving to the cloud, but compute is moving to the cloud as well.

Anthony Deighton:
And so you’re right that you can scale up and scale down. But you can also do things on the cloud the would be just impossible to do if you were doing it on your own data center. Running 1000 [inaudible 00:06:43], completely reasonable. We could be doing it five minutes from now on any of the three major clouds. Running 1000 [inaudible 00:06:51] cluster on Premise may be impossible, or certainly beyond the scope of most IT departments.

Anthony Deighton:
So the change in capability that [inaudible 00:07:00] the cloud enables is actually dwarfed, I think, by the changing capability that having compute on the cloud is, in particular of what you can do with machine learning, which I know is a theme that’ll come up in future slides. But I know you’re next slides because I’ve seen the slides before. There’s a lot of … I hear a lot of angst from customers when it comes to moving stuff to the cloud, a lot of objections. Maybe you knock off some of those objections, Mike.

Michael Stonebraker:
Okay, well you did say one thing, which I want to underscore, which is, if you move your applications to the cloud, you can either do that in a cloud independent way, or an AWS dependent way. So you have to decide very quickly whether you’re going to avoid lock in or not. And so that’s just a decision you have to make. You can make it either way. And just knowingly make that choice.

Michael Stonebraker:
I hear lots of people say, “I can’t move to the cloud because …” And I’ll just mention a couple of these, and then we’ll go on. It turns out that I hang out most of the time at MIT in the computer science and artificial intelligence lab. We run a data center [inaudible 00:08:45] in Cambridge, Massachusetts. Our data center guys claim that they are cheaper than the cloud, which is to say they claim that the cloud is not less expensive than running their data center in Cambridge.

Michael Stonebraker:
And the answer is that’s technically correct because they are not paying them for square foot. And they’re not paying for power or air conditioning. So they’re cheating in the sense that they are taking advantage of externalities that should not be there. So if it’s apples to apples, chances are, you’re going to be more expensive.

Michael Stonebraker:
People often mention security. The cloud security is likely better than yours. We hear enough horror stories about this configuration’s rogue employees [inaudible 00:09:49] prem stuff. So chances are, their security is better than yours. And maybe your CEO, under other restrictions, doesn’t like the idea. [inaudible 00:10:05] I’m going to talk about that again. And item 11, which is your bonus blunder to come. Anyway, I think the cloud is in your future. The sooner you get going on it the better.

Michael Stonebraker:
Of course, as Anthony pointed out, where is your application? If you’re running decision support, just move whatever your decision support is to the cloud. Other stuff, like legacy OLTP, may well be mired in sins of the past. [inaudible 00:10:54] move to ROTP. And so do it gingerly, and it may take you a decade or more. But sooner or later, you are not going to run a data center on prem. I just don’t think that’s going to happen. Next slide.

Anthony Deighton:
Before we leave this one, I would just also add there’s a theme I’ve seen with customers, is that as they move data to the cloud, they also have used that as an opportunity to think about consolidating data, especially from the perspective of theses decision supports or analytic applications [inaudible 00:11:35] return to later. But the analogy I always draw here is you don’t move a dirty house. So if you’re going to move houses, use the opportunity to also going through the stuff, throw things out, consolidate stuff before you pay for movers to come move it. Similarly, to the data in the cloud, it’s also an opportunity to take a look at those sources, discard ones that are no longer relevant, consolidate around key entities that matter you, et cetera.

Michael Stonebraker:
Okay, now there’s just lots and lots of talk on machine learning, and more generally under AI. Expect ML to be disruptive, in just all businesses. Next slide.

Michael Stonebraker:
So ML, whether it’s deep learning, or deep neural networks, or conventional machine learning, which has been around for about three decades, there’s an enormous amount of research in both conventional and deep learning, and it’s getting much, much, much better. And it’s guaranteed to displace workers’ easy to explain jobs. So think autonomous vehicles, think automatic check out in the grocer story. Think flower delivery, think getting your taxes done, think actually calculations. All of that is going to be replaced by computer programs.

Michael Stonebraker:
Your choice in looking at machine learning is you are either going to be a disruptor, meaning somebody pushing ML, or you will be a disrupt-ee, meaning somebody else, one of your competitors, is going to disrupt you. Your choice. So you have a choice in the matter, with one or number two. You can either be a taxi cab owner, or you think like Uber and Lyft. One or the other. And I my opinion, it’s going to be much more fun often, in the future, to be a disruptor, than to be a disrupt-ee. Next slide.

Michael Stonebraker:
So what do you do? The answer is ML is fairly arcane. Now you’re not going to hire Aunt Maude from Cedar Rapids, Iowa to be your ML expert. So you’re going to have to pay up to get some ML expertise. You’re in short supply and very expensive. Don’t hire [inaudible 00:14:45] to do that. We’ll come back to that in a bit. And get going on the coming arms race by hiring expertise. Pay whatever it takes to get world class talent. Next slide.

Anthony Deighton:
So I would add here that the first wonder and the second are kind of related, in the sense that the ability to … So prior to having data in the cloud and elastic compute in the cloud, then the idea of machine learning and AI as a disruptive platform shift may have been true. But it would have felt a bit out of reach, out of touch, and really only the purview of companies that we capable of standing up large infrastructures, I think Google, Uber, et cetera.

Anthony Deighton:
Taken together, blunder one and blunder two, taken together, what we’re really saying here is that everybody, the entire world, any company now, has access to the kind of platform that, even as recently as a few years ago, would’ve required quite a bit of expertise to stand up. And so doing things the old way, doing things they way we’ve always done it, doing things the way that we’ve been successful in the past, is a surefire way to extinction. So the idea here is the playing field has changed, and you either change with it and take a ML/AI based approach, or you’re roadkill. Again, a theme we’ll come back to in a future blunder.

Michael Stonebraker:
Next slide. Okay, here is my favorite blunder. A lot of you say, “I’ve got to get going on data science,” ML data science, more generally. And so, “I’m going to empower a data scientist group, and they’re going to change the world.” Well, your real data science problem is not ML expertise, as I’ll explain right now on the next slide.

Michael Stonebraker:
I talk to a lot of data scientists. And no one claims they spend less than 80% of their time doing finding the data they want to analyze, doing data integration to put it together, and cleaning up the mess that that data may well be. Most people say 90-plus percent time. So for example, the chief data scientist at iRobot, they’re the folks that bring you the vacuum cleaners that runs around the floor. She says, “I spend 90% of my time doing data discovery, data integration, and data cleaning, leaving me 20% of my time to do the job for which I was hired. However, of the 10%, I spend 90% fixing my data cleaning errors.” Meaning she spends 99% of her time on data discovery, data integration, and data cleaning.

Michael Stonebraker:
So she is not data science or machine learning for a living. She does data integration, data cleaning, data discovery. The chief data scientist in Merck, which has another 1000 data scientists, say exactly the same thing. 95-plus percent of his data scientists’ time is doing data integration. So Anthony said it really clearly a couple of minutes ago. Without clean data, or clean enough data, your machine learning is worthless. So garbage in, garbage out. And so ML is not going to pay off unless you solve the data integration problems that are in front of you.

Michael Stonebraker:
So what should you do? Well, obviously stop viewing data integration as a piecemeal thing to be solved by each individuals data scientist in his or her project. This is an enterprise wide problem, getting your data scientists good data. And start by making sure that your chief data officer has read access to everything that your enterprise has. So if he doesn’t have read access to all enterprise data, then you’re working for the wrong company. So, next slide.

Anthony Deighton:
Mike, maybe I suspect that’s a very scary statement for many people on the call. Maybe you could … Can you share a little bit more about what you mean by read access to the enterprise data, and what that’s so important?

Michael Stonebraker:
Sure. So the chief data scientist at [inaudible 00:20:28], when he was hired, he made a deal with the CEO. “I’m not going to take this job unless I have read access to everything.” And the CEO of course said, “Why do you need read access to everything?” “Because everything is the thing that my data scientists are going to want to have to get access to. And if it’s [inaudible 00:20:58], I want to know. If something doesn’t exist, I want to know.” So somebody has to be able to figure this out. [inaudible 00:21:10] data scientists to spend all their time trying to get access to corporate information. That’s you’re wasting your time, and you might as well solve this at an executive level.

Anthony Deighton:
If cleaning up data is annoying, then getting to data to clean it up is even more annoying. The other thing to think about is, one common theme I hear from customers I speak to is, at its core, every one of our businesses is fundamentally a data business. You may think you’re in the business of making drugs, or in the business of logistics, or whatever. But at it’s core, the real asset you sit on is that data. And the chief data officer, by nature, needs to have access to that core asset. [inaudible 00:22:09]

Michael Stonebraker:
Blunder number four, and I hear this just all the time. So you say, “Okay, I understand. Data integration is a problem. I got that solved. I have ETL in place. I have a master data management system from one of the major elephants in place. I’m all set.” Now, unfortunately, the answer is you’re not. Slide.

Michael Stonebraker:
The real blunder is the belief that traditional data integration is going to solve this issue. So traditional data integration means extra, transform, and load ETL and [inaudible 00:23:08] variety of vendors [inaudible 00:23:10], dot, dot, dot. Or a belief that master data management, also available from the usual suspects, has solved your data integration challenge. Why is that? Next slide.

Michael Stonebraker:
What is ETL all about? What is extract, transfer, and then load? Well here’s the way it’s sold. So you decide what data sources you want to integrate. So that comes down from God, or somehow you decide. You build a global data model up front from these data sources, get your best person on it. And that will get you global data [inaudible 00:24:01]. And then for every individual data source, send a programmer out to interview the data center owner. Figure out what he’s got, how it’s formatted, figure out how to extract it. [inaudible 00:24:15], typically in proprietary scripting language, and loads data into this [inaudible 00:24:25], typically in the data warehouse.

Michael Stonebraker:
[inaudible 00:24:31] vendors. And I can just tell you from 30 years with experience, I’ve never seen this technique work for more than 20 data sources. Why is that?well, it’s too human intensive. Number one, you got to build a normal global scheme up front. And that’s way too [inaudible 00:24:58] a statement. And you guys all tried this 20 years ago, building enterprise-wide data models. They all failed. They all failed because you sent a team off to do it. It took two years. And by then the whole business had changed to something else. So I’d never seen this technique work at scale.

Michael Stonebraker:
So if you have 20 data sources, and that’s all you ever want to integrate, now [inaudible 00:25:30]. But most enterprises I know have way more than the data services. So work, for example, [inaudible 00:25:40] has 4000, plus or minus, [inaudible 00:25:45] databases. They don’t even know how many they have. [inaudible 00:25:48] and data les results are important. The scope of possible integration is all this stuff, way more than 20 data sources. So ETL simply doesn’t work at scale. Next slide.

Michael Stonebraker:
Once you manage to do ETL, however you do it, [inaudible 00:26:19]. So you want to be able to find out if [inaudible 00:26:29] from multiple data sources, you need to match up my source records match process, which is you need to do consolidation of entities. And that’s typically called match. So let’s put together all the [inaudible 00:26:53] respond to a single entity. And then you typically want to merge those into what’s called a golden record, which is the definitive selling [inaudible 00:27:05] definitive address, and so forth.

Michael Stonebraker:
So the MDM vendors all suggest doing match-merge by using [inaudible 00:27:18] system. So implement [inaudible 00:27:21] example. Two entities are the same if they have the same address, for example. And using rules, you merge. Take the most recent value [inaudible 00:27:36] and so forth. So the MDM guys will all suggest the rule systems to solve match-merge. Now, the general thinking that I have is that you can build about 500 rules. And rules, by the way, are “if x, then y”. And they’re not ordered into a program. They’re just a bunch of rules. So you can block about 500 of them. You sort of stare at it really hard. Okay, I’ll give you 1000. I’ll give you 2000. But no one I’ve seen has been able to build and manage a rule [inaudible 00:28:28] with 20,000 rules.

Michael Stonebraker:
So just for example, if you required more than 500 rules to solve your problems, then you’re mediocre with an MDM system. Who needs more rules than this? Well, GE, the conglomerate, they have about 20 million spend transactions that they want to classify. And a spend transaction is, you spend 50 bucks taking a cab from the airport to your home. So they have rebuilt classification hierarchy for spend. So you can spend on everything, a subset of everything, whether it’s travel, a subset of travel [inaudible 00:29:30]

Michael Stonebraker:
Well, they started writing rules in an MDM system. They wrote 500 rules, which is what you can reasonably expect to block all by yourself. And that classified 10% of their spend. What about the other 90%? They have to write at least 5000 more rules. And they quickly realized that there was no way they could write and maintain a rule base of 5000 rules. So MDM just doesn’t scale large numbers of rules. Next slide.

Anthony Deighton:
Yeah, so let me just quickly add here, and maybe link together the first five blunders, slightly, which is so one question you might reasonably have in your mind is, “Are the ETL and MDM vendors full of bad engineers, and they just built really bad software?” And I would argue that they’re not. Obviously they’re full of very smart people. It’s that they architected the approach 10, 15, 20 years ago, in an environment where the only reasonable mechanism of achieving the outcome was to attempt a rule base system. Processing was relatively slow. The data was stuck in databases, and relatively hard to get access to. And frankly, the overall IT strategy at the time, was to see if you could get all of your data into the world’s biggest data warehouse, or god forbid, into the world’s largest SAP implementation, again a theme we’ll come back to, I think, in a moment.

Anthony Deighton:
In any case, in that environment, this idea of a rules approach was a reasonable way to attack the problem. What we’ve seen is a platform shift, which has enabled a disruption in the market. And that platform shift, is number one, which is this move for data to the cloud and compute to the cloud, which has opened up new possibilities, so that if you were to start a company today and attack the problem, you wouldn’t architect with a rules based approach. And the elephants in this industry are saddled with the decisions of their past on how they’ve architected.

Michael Stonebraker:
Okay. You’ve pretty much [inaudible 00:32:17] to the next slide, which is, if traditional ETL and MDM don’t scale, then what do you do instead? At scale, you need to run ML. You have no [inaudible 00:32:32]. This is an ML problem at scale. You cannot [inaudible 00:32:37] traditional techniques. And, as Anthony said, an easy path to ML, is just run ML [inaudible 00:32:46]. We took GE’s 500 rules, which classified 10% of their data, used it as training data for an ML system, which is being absolved. And it classified the remaining 90%.

Michael Stonebraker:
So that’s [inaudible 00:33:11]. Data integration is an ML problem. [inaudible 00:33:16], that’s what you need to do. Traditional solutions just don’t scale. Now, if you have a small problem now, but you expect a big problem later, then you’re heading toward deep quicksand if you use a traditional solution these days. So ML is the answer [inaudible 00:33:39] scale, ETL, and out of scale, MDM. And the traditional vendors don’t do it because they’re on a legacy at this point. Next slide.

Michael Stonebraker:
Okay, Anthony already covered this. I hear a lot, not so much anymore, they said, “Well, I am the world’s best data warehouse. I’ve got everything in order. My data warehouse guys are solving my analysts’ needs. Life is good.” Next slide.

Michael Stonebraker:
Well, data warehouse is good for somethings. They are good for putting structured data, lots and lots of it, in a few data sources, not from thousands. And they’re good at customer facing structure. Structure data. That’s what they were built for in the nineties, and that’s what they’re really good at. They’re not good at text, they’re not good at images, they’re not good at video. And they don’t do anything about data integration.

Michael Stonebraker:
So use the technology for what it’s good for. Don’t try and make your data warehouse do unnatural acts. And by the way [inaudible 00:35:19] support, which is what we’re talking about, is going to move to the cloud if it hasn’t already. So if you’re moving to the cloud, you get to change vendors. And so get rid of the high priced proprietary spread, if you bought into it already. So that’s people like Teradata. So move to the cloud, and remember that your warehouse data, your apps, are moving to the cloud, and you get to choose a new vendor and make that decision carefully. Next slide.

Anthony Deighton:
I would add here that the blunder is that your data warehouse isn’t going to solve all your problems. I would equally add that your major ERP vendor is also not going to solve all of your problems. In fact, quite the opposite, what we see in customers is a move to best of breed approach, and thinking of taking advantage of, and utilizing, the best application, the best operational application, for the task at hand, and running those in the cloud. And I think it’s a really important strategy and shift for organizations today to think about optimizing their business processes, optimizing the way they work, the work they engage customers, the way they engage their employees, how they do business, by taking advantage of best of breed applications. That strategy is, I think, one that creates tremendous business value. And it’s quite at odds with the strategy of making do with the operational application from one of the big vendors.

Michael Stonebraker:
That’s a lead in to the next slide, which is to say, “Well, maybe my warehouse isn’t going to solve all my problems. But five years ago, I got told that Spark is the answer. And so I set up a Spark cluster and/or the [inaudible 00:37:48] cluster. And that’s going to solve all my problems.” And that’s simply not going to happen. Next slide.

Michael Stonebraker:
So Hadoop, especially, is not good for [inaudible 00:38:03]. It isn’t very good at anything. Best of breed solutions [inaudible 00:38:11]. Spark is newer technology, and is better. But it’s still not that terrific at stuff. So Spark [inaudible 00:38:23] is not competitive against the best [inaudible 00:38:28] not competitive against the best streaming solutions. So, as Anthony just said, you should use the best of breed, not the lowest common denominator, at least for your secret sauce. That’s the stuff that’s going to differentiate you from your competitors.

Michael Stonebraker:
This is universal blunder, which is prepare to use only one vendor. And that means you’re on lowest common denominator, which means that your lowest common denominator is not that good at anything. And for your secret sauce, that’s not just a good idea. Also, Spark is useless on data integration, which is one of your biggest problems. Next slide.

Michael Stonebraker:
So I hear lots of stories of people who say, “Well, I’m running a big cluster, and it’s empty. No one’s using it. So what do I do?” Well, repurpose it to be a data lake. That’s the way [inaudible 00:39:50]. Repurpose it to be your computer engine for data integration. Or better yet, throw it away. After all, hardware lifetime is three years. You probably bought that cluster five years ago. And the thing to always remember is that you’ve got to move with the times, and being stuck in legacy in a legacy world is not a great idea.

Anthony Deighton:
I would add here that the idea of distributed compute, orchestrating distributed compute, which is at the core of both ideas. And I agree with Mike that Spark is a more recent implementation. Those are good ideas, good design principles. And in particular, good ideas in the context of data when it’s sitting in the cloud next to a highly elastic computer. And it turns out to be a good foundation for thinking about machine learning. But that’s a minor point.

Michael Stonebraker:
Okay. So about four years ago, Cloudera realized that Hadoop was not good for anything. And that’s a big problem for Hadoop vendors who make most of their money off of Hadoop. So while they’re a superb marketing company, and they said, “Well, what we want to do is, we want to switch into telling people to use their Hadoop cluster data lake.” So basically, they switched to marketing data lakes. And therefore, data lakes are the solution to all your problems. Next slide.

Michael Stonebraker:
What does that blunder really mean? Just load all your data into a data lake, and you’ll be able to correlate anything within it. Well, more recently, Amazon and others have said, “Well, let’s start calling it a lake house.” So a data lake and a lake house, to me, are synonymous. And the thing you should tattoo on your brain is that independently constructed data sets are never ever sub-compatible. They are just not. So you are not going to be able to take two independently constructed databases, load them into your data lake and [inaudible 00:42:47]. That just is not going to happen. Why is that not going to happen? Well, tell you next slide.

Michael Stonebraker:
Well, first of all, your schemas don’t match. So if you’re the human resources guy in Paris, I’m the human resources guy in New York, you call it salary, I call it wages. Units don’t match. You use euros, and I use dollars. The semantics to salaries don’t match. In New York, my salary is the gross before taxes. In Paris, your salary is your net after taxes in euros. [inaudible 00:43:28] Time granularities often don’t match. You have annual data. I have [inaudible 00:43:36] data.

Michael Stonebraker:
The killer is, of course, data is dirty. So sometimes numeric data, 99, or minus 99, turns out to mean no. If I’m using a system with no null, and you’re using a system where a specific value, like minus 99 is null, if I average your numbers with my numbers, I’m going to get garbage. So the data has varied meaning, it’s missing, or it’s wrong. Figure on the average 10% of your data is missing alone. And your data is dirty. Therefore you can’t just cull it.

Michael Stonebraker:
Also duplicates must be removed. If you’re the HR guy in Paris, I’m the HR guy in New York, my strong record could work for both subsidiaries, and my name could be misspelled on one data set and not the other. And therefore, there are no keys, and I’ve got to do entity consolidation. And entity consolidation is just not trivial.

Michael Stonebraker:
So my favorite example is a Tamr customer who wanted to … He asked the question, “How many suppliers do I have?” And so he added up all the suppliers from all the various data sets. And he got a number. And after Tamr got done removing the duplicates, he had one-fourth that number. So there are often a large number of duplicates. And if you’re counting customers or counting suppliers, those duplicates may make a huge difference. So you’ve got to remove duplicates. The data’s dirty. You got you do [inaudible 00:45:42] integration, and so forth. Next slide.

Michael Stonebraker:
So the next result if you just put your data in a data lake, and you start doing correlations, your analytics won’t be garbage. And so what happens is that your analysts spend 95% of their time, or 99% of their time, finding, fixing, and integrating their data. And your [inaudible 00:46:11] models will fail if you don’t do this. Next slide.

Michael Stonebraker:
So what do you do? I am a huge fan of data lakes. If you want to correlate your data, you got to put it somewhere. Or you just don’t have a data lake, you have a swamp. You need a data integration system which will deal with all the aforementioned problems. And they are not trivial. Do not think that you can put a junior programmer on this problem. And so the traditional technology [inaudible 00:46:55] is likely to fail. So this, in my opinion, is one of your major 800 pound gorillas. How to organize integration. You’ve got to put your best people on it. At Tamr, we see a lot of in-house solutions that we’re getting to replace. Chances are, they’re crap. So chances are, whatever you built in-house is crap. And you’ve got to use modern technology. And that does not come from the legacy MDM and ETL vendors, and certainly does not come in your hardware system. So if you want to deal with the best technology, you’ve got to deal with smallish companies like Tamr. Next slide.

Anthony Deighton:
I would just add here that the data lake failure blunder is really the blunder of assuming that the answer is a data storage problem. And the swamp is obviously a good analogy in that just consolidating your data into an environment does not solve the problem. It’s a good start. There’s nothing wrong with it. As Mike said, you need to have that. It’s sort of a necessary but not sufficient condition to success.

Anthony Deighton:
And then his latter point, which is this is an incredibly difficult problem, technology difficult problem, means that it’s also ripe for technical innovation. And, in a sense, if Tamr can take credit for anything, it’s for working hard from its academic roots at MIT, through to today, to working out the math of solving this problem, which it turns out is nontrivial. So it’s a great example of where, when you build on the new architecture, you end up with a radically different solution.

Michael Stonebraker:
Okay, blunder number eight. Lots of you outsources your shiny new stuff to consultants. In my opinion, this is a likely company ending blunder. Why is that? Next slide. If you’re a typical enterprise, you spend 95% of your IT budget just keeping the lights on. Most of you are dug in pretty deep. [inaudible 00:49:49] And so the shiny new stuff gets outsourced, often because there’s no one available internally who can deal with it. Next slide.

Michael Stonebraker:
In my opinion, this is company ending. Number one, your maintenance is poor. So creative people quit. And so you have no good talent to work on your new stuff. And you have a hard time hiring great talent, even if you tried. Takes great people to hire great people. So your new stuff was your secret sauce over the next decade or so. As you move to an ML powered world, please don’t outsource it. This is long term sort of stuff. Instead outsource the daily crap, the stuff that you’ve got to do to keep your lights on. For sure, outsource email and anything else that you can outsource. Software is your secret sauce. [inaudible 00:50:54] Hire a few wizards. They can hire some other wizards. And that is going to be what is going to be your differentiator against your competitors 10 years from now. Next slide. Unless Anthony … So what should you do? Well, start by hiring some ML expertise. Outsource the boring maintenance and cancel your [inaudible 00:51:26]. Okay. Now we’re [inaudible 00:51:31]

Anthony Deighton:
I’ll just briefly add here, this is a tale as old as time, which is every time we see a platform disruption in the technology market, that the big consultancies kind of custom build solutions on top of that. There was a time when I began my career at Siebel Systems, when in order to build essentially a CRM system, the strategy was to hire a big consultancy, and build it from scratch, on first principles. And along came Siebel in that case and said, “Actually, you know what? This is a solvable software problem. We can still build a standard solution to this problem.” And I think that’s exactly what we’re seeing in this market as well. So custom building AI and ML solutions is not the strategy.

Michael Stonebraker:
Okay, next slide. All of you should read a book called Clayton Christensen called The Innovator’s Dilemma. Next slide. So basically a lot of you are mired in the past because you simply say, “I can’t move on. I can’t deal with disruption.” And Clayton Christensen analyzes this in some detail. And he calls it the innovator’s dilemma, which is you’re selling the traditional stuff, along comes some innovation, and it threatens to disrupt your market. And I don’t have time, since we’re almost out of time, to go through an example. But he goes through a whole bunch of examples in his book that says it’s a real dilemma because it’s very difficult for you to hold on to your customer base, or all of your customer base, while you’re moving from the old stuff to the new stuff.

Michael Stonebraker:
So if you succumb to the innovator’s dilemma. And you say, “Therefore I can’t move to the new stuff,” next slide, then in my opinion, you’re dead in the long. Now you’ve got to be willing to give up your current business model and reinvent yourself. Otherwise you’re going to give up business in the long run. You’re going to get disrupted. So you might as well do the best of it. So read Christensen’s book, and then act on it. Realize that you may well lose some of your current customers in the process. You only avoid going out of business in the long run. I don’t have to remind you, if you’re a taxi driver in Cambridge, medallions in Cambridge were worth 700K five years ago. Now they’re work 10K, going on zero. So you need to be able to reinvent yourself. Next slide.

Michael Stonebraker:
And who’s going to help you reinvent yourself? You got to pay up for a few rocket scientists. Next slide. So who’s going to help you avoid these blunders, reinvent yourself, move to data integration, other than as a purview of individual data scientists? Et cetera, et cetera. So pay up for a few rocket scientists, people who are way off scale. Your HR folks won’t like it. Chances are, they will be weird. I know a bunch of them. They tend not to wear shoes. They certainly don’t wear a tie. They put their feet on the table. [inaudible 00:55:45] in the way. Instead you’ve got to nurture these rocket scientists because they’re the ones with good ideas, who are going to be your salvation long term. Next slide.

Michael Stonebraker:
Okay. Now, you might say, “Gee, Mike. Mike, I work for a company, and we’re succumbing to blunders two, four, six, and seven.” So, next slide, if you’re working for a company that succumbs to any of these blunders, then you should be part of the solution, not part of the problem. So you should be fixing it. If you’re not fixing it, chances are your company’s going to go out of business long term. And so you should be looking for a new employer. And of course Tamr’s hiring, if you’re looking for work. That’s the end of my slides. We are four minutes from the end. Maybe Anthony has some [inaudible 00:56:55], or Shannon wants to see if there’s some questions.

Anthony Deighton:
Yeah, in the remaining time, I’d love to ask any questions. And yeah, what’s the best way to do that?

Shannon Kempe:
There’s lots of questions coming in. So let me get to as many as I can here, and just to answer the most commonly asked questions. Just a reminder, I will send a follow up email to all registrants by end of the day Thursday for this webinar. Great presentation as always, you guy. Diving in here, so you suggest moving most all of our data to the cloud. What types of data should remain on premise?

Anthony Deighton:
On prem data is going to be data that is buried in [inaudible 00:57:39] legacy systems from 1969, for which you have lost the source code. And so there will be silos of data that you just can’t realistically move. It’s just too expensive. And if it’s too expensive to move, then put a big box around it, tie it up in a bow, and leave it running wherever it is. And your successor in your job will hate you, but if it’s not economically viable to move it, then you’re stuck. Move anything that [inaudible 00:58:27] make a return on investment case for moving.

Shannon Kempe:
And when you mean cloud, could it also be a Docker or Kubernetes?

Michael Stonebraker:
Kubernetes and Docker are container technologies in which you can put your application. They can run on prem. They can run on cloud. They are simply enablers. They’re simply [inaudible 00:59:00] technology, that generally speaking you should embrace. But that doesn’t depend on whether you want to move to the cloud or not. Kubernetes and Docker run on the cloud, run on prem. They run freely everywhere.

Anthony Deighton:
It’s a good example of design decisions you would make differently knowing that you intend to run on a highly elastic computer infrastructure, knowing that way, that design principles like leveraging Kubernetes would be worth the investment versus other approaches you might take, for example.

Shannon Kempe:
Well, Mike and Anthony, thank you so much for this great presentation. But I’m afraid that is all the time we have scheduled for this. Again, thanks to all of our attendees for being the foundation of everything that we do. I’ll get the questions over to Tamr to help get the rest of those answered for you. And thanks for being so engaged. Just, again, a reminder, I will send a follow up email by end of day Thursday for this webinar with [inaudible 01:00:11] slides and the recording. Thank you both so much. Thank you all. Hope everybody has a great day. Stay safe out there. Thank you.

Anthony Deighton:
Thank you everyone for joining. It was a lot of fun.

Shannon Kempe:
Thanks all.