datamaster summit 2020

Fireside Chat with Turing Award Winner, Dr. Michael Stonebraker

 

Michael Stonebraker

Dr. Michael Stonebraker, Adjunct Professor at MIT and Co-Founder of Tamr

In this session, you’ll get straight talk from Mike Stonebraker on why you need to plan for internal IT to be dramatically restructured, which areas of data mastering are key in the future (spoiler alert: most everything), and how you can establish a data organization with the chops to lead your organization in building a broadly data-driven culture.

Transcript

00:00 – 00:43

Andy Palmer

DataMasters Summit 2021 presented to you by Tamr. It’s a pleasure to be here today with my good friend and great partner for the last, what, 17, 17 years now through many start ups and a number of academic projects. Mike Stonebraker As many of you know, Mike is a tremendously accomplished academic computer scientist as well as a serial entrepreneur and most notably here in the last few years, Mike had the privilege of winning the Turing Award, which is a huge, huge accomplishment. Congrats, Mike.

00:43 – 01:02

Andy Palmer

Oh, thanks. So it’s fun to be here. And, you know, I figured we’d just spend the time talking about what you’ve been seeing, both from an academic and a commercial standpoint in terms of what’s happening in the enterprise data landscape. And maybe look forward a little bit as to what you see is happening next.

01:03 – 01:50

Andy Palmer

I remember very distinctly back in 2007, when you were very vocal about the potential for PDFs and data leaks or the lack of potential, I should say not on potential confidential. And so, you know, I know I’m anxious to hear, and I bet everybody would be really curious to hear. If you’ve got strong feelings about other technologies and trends that you see that either have a bright future or, you know, could be the next dismal failure. So but maybe, you know, we’ll start off and and, you know, sort of talk about what are the things that are on your mind? I know cloud sort of, you know, is is is important. What do you think is happening with the cloud?

04:04 – 04:05

Andy Palmer

No more rack and stack. That’s right.

04:13 – 04:39

Andy Palmer

It’s an amazing change, right? As yeah, when we were working on the early academic project data, Tamr that became the company, Tamr and I was over at Novartis.. We were trying to move a bunch of systems over onto a WC, but the general feeling in the I.T. organization was they wanted to maintain the status quo. They weren’t really anxious for those workloads to move out.

04:45 – 04:53

Andy Palmer

Right. But you think that we’ve passed some tipping point and now it’s just it’s just got to happen and the I.T. organizations have to embrace it.

05:51 – 06:22

Andy Palmer

Well, it’s amazing. You described that I experienced the same thing in many cases where when the I.T. people do this, compare comparison, of course, with the cloud, they don’t. They leave out all these massive costs that are just assuming as these shared resources that are real expenses that they just never account for. And especially when it comes to power, I’m sure that the the big cloud providers are much better and more efficient at consuming power than any local I.T. shop.

06:39 – 06:47

Andy Palmer

It’s a little disappointing for me because I always thought of myself as kind of an it guy. So no more racking and stacking. You’re going to learn DevOps.

06:58 – 07:01

Andy Palmer

Yeah, nice and cool. Yeah, exactly. That’s great.

07:02 – 07:15

Andy Palmer

Well, let’s switch a little bit and talk a little more about data integration and data science. So what have you seen happening in the, you know, sort of maturing of data science and data engineering?

07:59 – 08:00

Andy Palmer

Yeah. Roomba.

09:00 – 09:03

Andy Palmer

You’ve called this like the 800 pound gorilla in the corner.

09:18 – 09:19

Andy Palmer

Wow.

09:24 – 09:24

Andy Palmer

It’s amazing.

09:33 – 10:05

Andy Palmer

So amazing. Well, it’s consistent to one of our mutual friends who’s over it at Databricks, you know, told me a few months ago, he said that something like 80 percent of the workloads on Databricks are actually ETL extract, transform, load. And as opposed to I mean, because theoretically, Databricks is supposed to be like this massive spark enable data science platform. But most of. What it’s used for seems to be kind of classic, you know, data data management and and data integration.

10:23 – 10:27

Andy Palmer

Yeah. So we have to move to move data science forward. We have to figure out data integration. Yes.

10:38 – 10:38

Andy Palmer

Incredible.

10:52 – 11:08

Andy Palmer

It’s amazing. So how so? How do you think about that relative to everything we’ve been through and data integration from, you know, the data warehouses in the in the 90s and the master data management tools, the rules based tools to kind of where we are now and where we might go in the future.

11:30 – 11:33

Andy Palmer

All right. I like that excuse to go to Tahiti. That’s good.

11:39 – 11:43

Andy Palmer

Okay. I’m not as interested in going to the sea, the Eskimos.

13:14 – 13:15

Andy Palmer

Yeah, yeah.

13:36 – 13:42

Andy Palmer

Yeah, for people doing it and using the machine to do a lot of the heavy lifting. Absolutely. Yeah. Yeah.

13:43 – 14:24

Andy Palmer

Well, it’s fascinating because there there are so many ways to attack these, these data silos, but they’re they’re going to persist no matter what, and they kind of get worse. How do you how do you think about, you know, all these data silos that exist and now federated methods have become popular again? The Facebook guys built the Presto thing, and now there’s companies that are doing presto and arguing for data fabrics or data meshes or whatever you want to call it. Like, do you think that data fabrics and data meshes are like the next data lakes? Is this going to be we’re going to we’re going to look back and regret this stuff ten years from now? Or or is it like finally time for data federation to kind of do its thing?

14:56 – 14:59

Andy Palmer

So you still need these joint keys across all these different.

16:25 – 17:00

Andy Palmer

It’s incredible. So, so like, what else do you think that these, you know, people that are doing the actual work on the ground are going to have to worry and think about so the, you know, the traditional kind of person that people think about when they think about data is that the DBA, you know, sequel and managers, you know, using products like Oracle, where these things, what do you think of the skills and the things that that matter for these data integration, folks like what, what, what kind of stuff matters to them?

17:20 – 17:21

Andy Palmer

Yes. Yeah.

19:04 – 19:07

Andy Palmer

So it’s like embracing the fact that it’s going to change. Yeah.

19:44 – 19:47

Andy Palmer

And but they don’t care, but they don’t care.

19:57 – 20:26

Andy Palmer

So this is I’ve heard you call this like database decay. A and yes, like so these things, like they just they’re constantly changing and evolving and all this ways. Do you think that there’s a set of technologies that are going to help people do schema design more automatically? I know we played around with this a bit at Vertica back in the day, but Turing ordered more automated database design.

20:49 – 21:08

Andy Palmer

But I think that that’s not going to get automated anytime soon when a lot of our a lot of the stuff we do, a Tamr is trying to sort of link up, you know, all the data. However, it’s physically designed with sort of the the way that people think of it logically, kind of dynamically putting those two together and trying to make it as easy as possible.

21:30 – 21:38

Andy Palmer

But they they have all this legacy, right? Like, it’s just a long, long tail of junk that they all have to deal with every day.

22:19 – 22:39

Andy Palmer

And so they’re kind of advocates. Maybe if you were if you had no choice, right, you couldn’t quit your job or you had to figure out how to deal with it, that maybe you’d try and start with more of a green field kind of an approach rather than trying to change some of the stuff that’s going on just kind of cut, cut bait and run on a legacy and just start start as much as you can from scratch.

23:02 – 23:04

Andy Palmer

Yeah. No money for that. No way.

23:13 – 23:36

Andy Palmer

Yeah. Yeah. I think we see that a lot where if anything, the price is going to be in business for a long, long time. Yeah, yeah. We see a lot of our customers whose the price or their pressure on their I.T. budgets is extremely down. I mean, they’re really forcing them to spend less money, and yet they want to become data driven and do more so.

23:36 – 23:51

Andy Palmer

So let’s switch a little bit, talk a little bit more about the people side of, you know, a data driven culture and what’s involved in in in building that kind of a culture. What what have you seen that’s worked? And what what are what are the common pitfalls?

24:17 – 24:17

Andy Palmer

OK.

24:33 – 24:34

Andy Palmer

Yeah, yeah.

24:49 – 24:51

Andy Palmer

And and not wearing suits.

24:57 – 24:58

Andy Palmer

Yeah. Got it.

25:50 – 25:51

Andy Palmer

Wow, that’s a bold statement.

26:03 – 26:15

Andy Palmer

It’s kind of like the the the chief financial officer of the chief financial officer didn’t have access to where all the cash was and move to be able to move it around like that would be that would be crazy to expect them to do their job.

26:32 – 26:46

Andy Palmer

Yeah, I bet you that there isn’t a CDO out there in the in the in the world that done that proactively. Maybe our friend Mark Ramsey did that at GSK back in the day. But yeah, but not many people that do it.

27:20 – 27:30

Andy Palmer

Yeah, yeah. Then he didn’t have to negotiate with all these other people that thought they owned the data because otherwise you spent all your life trying to find out what you got.

27:41 – 27:43

Andy Palmer

Yeah, I own my silos. That’s right.

27:50 – 28:05

Andy Palmer

Right, right. Well, it’s amazing what it sounds like when you when we talk about this and trying to make that make this a priority, that it’s going to take not only a huge amount of technology and great technical people, but also a lot of cultural change. Yeah.

28:39 – 28:56

Andy Palmer

Well, this seems like this is what what Abby did over at Fidelity when she hired Mona Vernon and who we both respect so deeply. And it sounds like Mona’s been doing some amazing work over at Fidelity, which obviously has incredible data assets, but also a lot of legacy.

29:03 – 29:48

Andy Palmer

Incredible. That’s really great. Well, Mike, it’s been awesome connecting, especially post-COVID. I know we’ve both been isolated in all kinds of ways. It’s good to be back together again and hopefully we’ll be able to do more of this over the coming months. And thanks for thanks for being here with us at Data Masters. We’re really excited to wrap things up here over the coming session or two. But thanks for the thanks for the advice. It’s been great. Great to see you those facts. And thanks to all of you for joining us. I think I’m going to pass it off to Larry and Melissa, who are going to run a final session to kind of get things completely wrapped up and really appreciate your time. Thank you.

01:50 – 02:15

Michael Stonebraker

I think I think the major major trend these days is the cloud. And I think, you know, realistically, anybody who doesn’t isn’t trying to move everything they possibly can to the cloud is making a huge mistake. And I think that’s going to have just gigantic implications.

02:16 – 02:56

Michael Stonebraker

So I think first of all, the cloud, why? Why should you move to the cloud? Well, my favorite vignette is from Dave DeWitt, who who used to be the the head of the Jim Gray Microsoft Systems Lab in Wisconsin. And he said as of three years ago, the technology being used by Azure for their data centers was shipping containers in a parking lot, power in chilled water, in internet, in otherwise sealed roof. And so it was optional only there.

02:56 – 03:28

Michael Stonebraker

If you need security, put them in low rent places with cheap power like the Columbia River Valley, and compare that with what you’re currently doing, which is raised flooring in Cambridge. I mean, you just cannot possibly compete. And what’s more, if you need 30 servers on the first day of the month and three servers during the rest of the month, you know, on Prem you have to provision for 30 in the cloud, do it dynamically.

03:28 – 04:04

Michael Stonebraker

So I think the advantages are just overwhelming and I think ignoring the cloud at your peril. And I think assuming the only thing that won’t move are things that are tied to like legacy COBOL applications, which won’t run on the cloud but plan on moving everything you possibly can to the cloud. And that’s going to change dramatically what your IT staff does because they’re no longer going to be running around with screwdrivers and, you know, picking up the raised flooring.

04:06 – 04:12

Michael Stonebraker

And and so to the extent you can get out of the data center business, God bless you.

04:39 – 04:45

Michael Stonebraker

Of course, I mean, it’s it’s called your job security is going away, right?

04:54 – 05:50

Michael Stonebraker

Well, my favorite example is, you know, I work in a laboratory at MIT. It runs its own data center, and the head of that data center claims that he is cheaper than the cloud. And when you sit him down and say, well, explain that to me, he says, Well, I get free rent and I get free power free air conditioning. Yeah, so so under any, you know, apples to apples comparison, yeah, it’s just wildly more expensive. And so I think and I think, you know, there were, you know, the people who are listening, you know, you should distrust your IT organization because they are, of course, interested in job security. And there’s going to be a huge transition away from way as you call rack and stack to doing other things, probably with half as many people. Yeah.

06:22 – 06:39

Michael Stonebraker

Well, I mean, they are building your your enterprise builds a data center every decade or so. Yeah. And the elephants are building as many as they can every year. And so they’re just really good at it.

06:48 – 06:58

Michael Stonebraker

And, you know, I think if if it was technically feasible, they would put data centers in the middle of the water and Hudson Bay at the Arctic Circle.

07:15 – 07:59

Michael Stonebraker

Well, my favorite vignette is that, you know, lots and lots of of C-level people say, Well, you know, we got to get into this data science stuff. Mm hmm. And so I’m going to go out and hire a bunch of data scientists. And I said, OK, that sounds like a good idea. Do you realize that your data scientist don’t do data science and they kind of look at me with a glaze over there and say, Huh? So my favorite vignette is the the one of the main data science people at iRobot. That’s those are the guys that have the vacuum cleaner that runs around your floor.

08:00 – 08:31

Michael Stonebraker

Yeah. Yeah. So. So she said, I spend 90 percent of my time finding data sets that I’m interested in analyzing integrating them together because any multiple data sets are never plug compatible. Mm hmm. And then cleaning the data so that I don’t get garbage in, garbage out. Incredible. So 90 percent of my time I do. Let’s call it data managing data and data integration.

08:31 – 09:00

Michael Stonebraker

Yeah. So then so that leaves 10 percent to do the job for which I was hired of that 10 percent. I spend 90 percent of that 10 percent fixing my data cleaning errors so that my models produce something interesting. Wow. So ninety nine percent of the time she does data integration, WOW. Data cleaning, one percent of the time she does the job description for which she was hired.

09:03 – 09:18

Michael Stonebraker

Yes. So I think people just need to realize that moving to a data driven culture, moving to data science is a big thing that the 800 pound gorilla in the corner is data integration.

09:19 – 09:23

Michael Stonebraker

Because you always have to do it and it’s always a big deal.

09:25 – 09:33

Michael Stonebraker

So just tattoo on your brain. Yeah, that independently constructed data sets are never plug compatible.

10:06 – 10:23

Michael Stonebraker

No, I’ve never met a data scientist who claimed he spent more than 20 percent of his working life doing data science. So, so it’s 80 percent or more other stuff. So actually, data integration?

10:28 – 10:37

Michael Stonebraker

And so. And so if you have a 20 person data science group, you should have 16 of them being experts at doing data integration.

10:38 – 10:52

Michael Stonebraker

And that. And if anything, it’s it’s 20 people who are data who are data science trained, who say, Oh crap, I have this data data managing that I need to do and what a pain in the butt it is.

11:09 – 11:30

Michael Stonebraker

Yeah. Well, I think there’s every, every reason why people build data silos. And it’s very simple because let’s say your enterprise currently sells plumbing supplies to the tuition’s.

11:33 – 11:39

Michael Stonebraker

And you decide I’m going to I’m going to set up a thing to try and sell igloos to the Eskimos.

11:43 – 12:34

Michael Stonebraker

But OK, so so what do you do? Is you hire somebody or empower one of your current people say here’s a budget. Go see if you can sell igloos. And so that person says, OK, I can calculate my runway, which is I’ve got to deliver within x months. And so time’s a wasting. Let’s get at it. So they cobble together whatever they need in the way of stuff to go, try and sell igloos to the Eskimos. If they succeed, they have a data silo if they fail. Of course, it all goes away. Right? So that you have that going on? Mm hmm. And what’s more, your CEO decides he’s going to buy a company that’s selling river barges on the Mississippi.

12:34 – 13:14

Michael Stonebraker

Yeah. So the company you buy, the company, it’s now in your stable. It comes with all kinds of it stuff, and almost nobody is willing to stop the world, take all of their data and integrated with the stuff you have. So you say, get going and you have another data silo. Yeah. So the average enterprise has tens to hundreds of data silos. And the thing that I find astonishing is that the business value of integrating the silos is just gigantic.

13:15 – 13:35

Michael Stonebraker

And and so the CFO says, Oh crap, I’ve got all these silos, this huge business value and integrating them, but it’s too expensive. Hmm. And so I think what Taymor is doing is fabulous, which is driving down the cost of doing this data integration so that there’s an ROI.

14:25 – 14:56

Michael Stonebraker

I mean, I think decent. Some some companies believe in centralization, you know, sort of put all your data in one place, and some people don’t believe in that and they say, leave the data where it is. Mm hmm. In either case, you have a data integration problem. Right? That doesn’t solve. Yeah, put it. Put it, constructing a data lake doesn’t solve your data integration problem. Yeah, it gives you a swamp. Yeah.

14:59 – 15:33

Michael Stonebraker

And there is no such thing as global key, say, independently constructed data sets are never a plug compatible, they never have a global key. Yeah. And so you get to deal with that no matter what. And I think the thing everybody should realize is that the business value of doing this integration is high for suppliers because it means that you can demand most favored nation status.

15:34 – 16:25

Michael Stonebraker

You know, if you have to, you have 75 procurement systems as General Electric has, then if when your contract with staples come out comes up for renewal, if you can find what the other seventy four guys managed in the go or sorry guys or girls or women managed to negotiate and demand most favored nation status, then you save a huge amount of money. But that means you have to do supplier integration now. And the upside to doing customer integration parts integration is just huge. Yeah. So I think what everyone should realize is they’re going to be in the data integration business for a long time to get really good at it because you’re going to do it for a lot of your main entities.

17:01 – 17:20

Michael Stonebraker

I think to me, the if you look at divas. Mm hmm. So, uh, they get very good at tweaking the hundred or so tuning parameters that are in your favorite database, like our friend Neville.

17:21 – 17:40

Michael Stonebraker

Okay. So I think over time, there’s a bunch of research projects that are trying to automate the knob knob tuning. Mm hmm. And I think that will probably succeed over the course of a decade or so. Mm hmm. So I think the DBS won’t have to do that anymore. OK.

17:42 – 18:11

Michael Stonebraker

That will leave them with with basically doing schema design. Mm hmm. And I think the thing I find fascinating about about schema design is that there are all these tools that are really good at building your first schema like toad. Toad is the one that everybody yeah, if you have a green field and you want to do a new application, well, then then tools are good at helping you do that.

18:11 – 19:03

Michael Stonebraker

Mm hmm. That isn’t the real problem. The real problem is then no more than three to six months later, things change and you have to do schema evolution. And you don’t want what you really would like to do is go back and run toad again. Mm hmm. Build a clean schema and then convert all your data from, know, you know, the schema you have to a clean new one. Mm hmm. No one does that credible. And so your schema gets dirtier and dirtier and dirtier with every passing evolution until you finally get something that is just not maintainable. And then you have to throw it away and do it all again. Gotcha. And there’s got to be a better way.

19:07 – 19:43

Michael Stonebraker

And yeah, and and instead of doing doing the minimum cost incremental thing, which is to make your schema dirty. Hmm. Take a longer view of the world and realize that that’s long term suicide. Mm hmm. And I think the trouble is enterprises in general are too focused, you know, on the local short run. Mm hmm. So I’ll be out of this job and my successor will inherit a mess.

19:48 – 19:56

Michael Stonebraker

But I think C-level people have got to get much smarter. Yeah. About these strategic technical decisions. Yeah.

20:26 – 20:49

Michael Stonebraker

I think audit physical database design will get automated, really logical database design, I think won’t Godhra because that’s, you know, that’s it’s in the mind of the beholder. What’s an entity? Yeah, yeah. Yeah. And what’s the relationship and what’s an attribute? Yeah. So I think that I mean, there are tools that help you do that.

21:08 – 21:29

Michael Stonebraker

Yeah, I mean, it seems to me you inherit decayed data. Yeah, yeah. And it seems to me that ideally enterprises should desperately attempt to put Tamr out of business. Yeah, but do it by just not creating such horrible messes. Yeah.

21:38 – 22:18

Michael Stonebraker

Well, I one wants to say I get asked all the time. Yeah, my my company has has this mess. What do I do? Right? And I sort of look at the mess and say, Go work for a different company. Just quit, because the wonderful thing about startups is they don’t they don’t have this legacy that they’re dragging around. Yeah, yeah, yeah. So I think it gives a huge advantage to startups. And so I think it’s so difficult for legacy enterprises to move forward aggressively. Wow.

22:40 – 23:02

Michael Stonebraker

But but you have a CEO. And so you say, I’d like to rewrite everything over over 15 years. Yeah. So let’s suppose that’s $5 million a year. Yeah. So it’s $75 million project, right? And you go pitch that to your CEO. Yeah. What do you think’s going to get no money for that right now?

23:04 – 23:13

Michael Stonebraker

Because he said, Your CEO says crap, the competition is breathing down my neck. Yeah, I’ve got much more immediate problems.

23:53 – 24:16

Michael Stonebraker

Well, I can’t tell you one vignette of what doesn’t work. OK, yeah. So for a while, I was on the Technical Advisory Committee for a major investment bank, you know, Manhattan Investment Bank. So they said, we want to, you know, we want to embrace the new stuff. And way back when it was get going on Unix and Linux.

24:17 – 24:33

Michael Stonebraker

And so I said, Well, you know this, this this. I have this student who wants to come live in New York for a while. Mm hmm. And by the way, you know, this is a Berkeley student, which means ponytail, no shoes, you know, Birkenstocks.

24:34 – 24:48

Michael Stonebraker

And so he goes to interview and they won’t hire him because he looks too weird, really. And so first of all, you’ve got to embrace weirdness because, yeah, all the really good people are just plain weird. Yeah, yeah.

24:51 – 24:57

Michael Stonebraker

And yeah, I mean that, you know, know, like casual Fridays aren’t going to cut it.

24:58 – 25:50

Michael Stonebraker

Got it. And so and so you’ve you’ve got to swear you’ve got to switch the culture and it’s got to come from the top. Yeah. And so start with with the CEO not wearing a suit that would be a good place to start. Yeah. Another thing that would be a good place to start is enterprises all hired a chief data officer. Hmm. But most of the CDOs I know are struggling to get access to all the enterprise data assets. Hmm. So like, here’s you’re supposed to be this person in charge of data, but you can’t access your own data. Hmm. So do not take a job as a CDO unless you can get access to every piece of data the enterprise has.

25:53 – 26:03

Michael Stonebraker

You can’t be effective otherwise. I mean, I think. So I think you’ve got you’ve got to embrace, you know, dramatic change, that’s it’s incredible.

26:16 – 26:31

Michael Stonebraker

Same thing is true of the CDO and make make sure that your CFO doesn’t do the budget on his on his spreadsheet. Yeah, right. Make it data that’s accessible. Yeah. So at least other people can look at it. Wow.

26:46 – 27:20

Michael Stonebraker

Yeah, but Mark Mark Mark’s great. I had dinner with him one night. Used to be CTO of GlaxoSmithKline. Yeah. And he said I wouldn’t. I was talking to the CEO and I said, I will not take this job unless I get access to every piece of data. And all of GSK makes it to the CEO said no way. And Mark said, Well, I had said, that’s awesome. And so he got it and helped him a lot.

27:30 – 27:41

Michael Stonebraker

Yeah. Yeah, yeah. Because because all of your operational data managers, when you go calling and say, What do you got? They say, go away. You know, I don’t report to you.

27:43 – 27:50

Michael Stonebraker

Yeah, I own my silo and get out of my hair because my silo is my job security.

28:06 – 28:39

Michael Stonebraker

Well, and I think the thing that any executive could start by doing is is hiring five of the smartest people they can find and paying way off scale to get them so that you have some smart people who can tell you what you ought to be doing kind of build the muscle. Yeah, because right now, if you don’t have visionary people, then you’re blind and you have no one. Yeah, no one leading the blind.

28:57 – 29:03

Michael Stonebraker

Huge number of silos. Yeah, yeah. And running one of every kind of piece of software you can imagine.