datamaster summit 2020

Build a scalable data culture with cloud-native solutions

 

Andy Palmer & Roman Stanek

Andy Palmer, CEO and Co-Founder, Tamr
Roman Stanek, CEO and Co-founder, GoodData

Every business understands the urgency to scale business insights, build predictive models, and implement better data-driven decision making — all faster and more efficiently than ever before. But what exactly does it take to reach this data-driven holy grail?

Join Roman Stanek, CEO and Founder of GoodData, and Andy Palmer, CEO and Co-Founder of Tamr, as they tackle the hot topic during a 30-minute fireside chat to learn how to leverage your data as an asset and successfully build a scalable data culture.

Transcript

Valerie Chan:

Welcome to GoodData’s fireside chat webinar, Build A Scalable Data Culture with Cloud Native Solutions. I’m Valerie Chan, marketing manager at GoodData. Let’s go over some logistics. All attendees are muted upon entry. As you come up with questions, please submit them in the Q&A box within Zoom. We’ll address them at the end of the webinar. After the webinar, we will share a follow-up email with the recording. Our webinar will examine how organizations could effectively leverage data as an asset to build a scalable data culture. Fortunately for all of us, we have two industry veterans here to share their expertise.

Valerie Chan:

Ramon Stanek is the CEO and founder of GoodData. GoodData delivers growth across the globe through analytics, helping more than 140,000 of the world’s top [inaudible 00:00:51] deliver on their analytics goals. GoodData is data as a service infrastructure is the future of analytics, real-time, open, secure, and scalable. Andy Palmer, our esteemed guest today, is a CEO and co-founder of Tamr. Tamr masters data enterprise scale to drive timely analytics projects and deliver successful business outcomes. Previously, Andy was the co-founder and founding CEO at Vertica Systems, a pioneering big data analytics company. Roman and Andy, I’ll let you take it from here.

Roman Stanek:

Hello, everyone. Andy, it’s good to see you again, and everyone, welcome to the webinar. So, let’s kick it off. Let’s start with Tamr. So, what is data master and why is it an essential step in data prep? And is it the same as data literacy? And I’m not fully familiar with that term, so what is data mastery?

Andy Palmer:

Yeah, it’s a great question. The incredible thing about data, over the last 20 years, is people have begun to realize how important and critical data is to their businesses, is that we’ve started to pull apart the various components in a next gen modern data ecosystem. And one of the things that I believe is really important in building a modern data engineering infrastructure is to have a data mastering function, which is relatively independent from the other components in your system, whether it’s a raw data source catalog or how you store the data. And what mastering is, is really the process of taking many different tables of tabular data and processing those tables using the machine as much as possible and adding human expertise to curate how the data is connected.

Andy Palmer:

And the results of data mastering is a set of tabular data endpoints that represent the best data across an entire organization, in tables that are versioned and organized around logical data entity types. So, instead of the names of tables, physical tables, cust_5_72, blah, blah, blah… Really, when we master data with our customers, the result is a clean table of customer data from hundreds or even thousands of different sources. And it has, it is a simple table that has all the customer data that you could get from all these different places around the organization. Like I said, it’s a function, it’s a relatively independent function that co-exists along with all the data pipelining that you have and all of the data governance that you might have and all the data storage and all of these other things. And the goal of this very simple function is to deliver clean, crisp, curated, comprehensive data to many different consumers that want the best place to go to find high quality data.

Roman Stanek:

Okay. That’s helpful. You mentioned that you use machines as much as possible, so how do you guys do it? Is that manual? Is it semi-automated? Is it automated? How do you do it?

Andy Palmer:

Yeah. So, we spent three years working at MIT building the math that enables us to use the machine as much as possible to organize the data. At its core, it really has three functions. The first is figuring out whether to attribute to the same. So, I’ve got an attribute called First Name over in one table and another attribute in a different table called First_Name, do those two things mean the same or are they different? The second function is record matching. I’ve got a record for Roman Stanek and this database. Is that the same as Roman Stanek 2 in this other database. And then third is classification. I’ve got Roman classified as a friend in one table and I’ve got him classified as a colleague in another table. Does that mean the same thing? In our case, it absolutely does, of course.

Andy Palmer:

So, these three core functions of schema mapping, record matching, and classification are the core elements of large scale data mastering. And the math that we built enables us to use the machine to process hundreds, thousands, even millions of tables, and the machine using the models, the probabilistic models that have been developed will give recommendations as to whether or not two attributes map, two records match, or two things are classified the same or different. And that’s augmented by humans, then, who validate that, yes, that’s a good mapping or not. Or they say, “Well, the confidence level of that mapping isn’t quite high enough. I need to try harder. And, oh, by the way, here are the three or five people we want to ask about their opinion as to whether or not the confidence is high or low.”

Andy Palmer:

And so, at its core, using the probabilistic method and a series of models to do this kind of mapping, matching, and classification is a fundamentally different sort of approach than traditionally, as you and I have both done for a long time, was to use rules and the rules-based systems to set up rules that literally map tables or match records or classify things. Yeah.

Roman Stanek:

That’s cool. It’s very clear that data is such a complicated topic that it’s no single rule can actually, or no even set of rules can ever capture it in a company. And so, when you actually look at it… I want to talk about cloud native and how we actually can build more scalable solutions, but before we actually go to cloud native, this whole idea, our idea of data as a service, as a fundamental building block of what Gardner is calling composable analytics. So, it sounds to me, and I hope I’m right, that it’s almost like a value chain between what you guys do in getting these tables ready, what we do, turning it into a single source of metrics. So, we can add semantic model and metrics and metrics [inaudible 00:08:01] calculations and aggregations and predictive metrics and so on. And the last step is then a composable analytics where customers can compose new obligations and there’s [inaudible 00:08:12] data discovery. Is that how we would actually present it to someone who is looking at Tamr and GoodData?

Andy Palmer:

Yes. Yeah. I think so. Yeah. My experience over the last 20 years has been that we’re re-engineering data in the enterprise from the outside in, and for years, I spent a lot of time working on core database systems that would help improve the performance of data and how you can store data efficiently and all these kinds of things. And many people, including yourself, have been working from the data consumer back into the infrastructure, and ultimately, these things come together where you have this next gen core data infrastructure with much higher quality data, it’s much more integrated, and you have a set of infrastructure for consumers that makes that infrastructure much more accessible and much more dynamic and composable, I think, is a good word.

Andy Palmer:

And so, I think it’s this intersection between next generation data engineering and next generation, call it, business intelligence and analytics and even operational ways that people consume data. It’s like all this stuff coming together and meeting in the middle, and it’s amazing to me how fast this is happening. And I think a huge accelerator in the last five years has been the large enterprise moving over to the cloud as their primary deployment modality, moving the center of gravity for their data over to the cloud.

Roman Stanek:

I agree. I know that they exist like… It’s like sediments. You find analytics processing from the ’70s and ’80s and ’90s, and now we have to, for the first time potentially, we have to [inaudible 00:10:18] modernize. And cloud is, as you said, it’s a good [forcing 00:10:21] function. Yeah, I agree, and I actually think that what you said is that we are coming at it from two different angles, like opposite angles, actually. You’re coming to it from data, how do you actually provide consolidated, clean, and categorized data? And we look at it as how do we actually enable these data sets to be available in a unified and real-time way to as many consumers as possible, but also in as many ways of [inaudible 00:10:52] consumption, like KPIs and notebooks and dashboards and so on? So, I think this is… As you said, it’s kind of a new model, how to do it, and I agree. It’s actually happening very quickly.

Roman Stanek:

Let’s talk about the cloud native, why and how cloud native can actual help with all of that. But if you look at how people do it today, I would actually say that, today, most of it is actually happening as staging areas. I copy data to a staging area, that’s where it gets cleansed, and someone copies it to another staging area when it’s picked up by some data analytics tool. And every analytics tool has different set of rules and semantic models and meetings, and presentations and so on. So, I do believe that this kind of vision, that there needs to be a consolidated set of data that provides real-time data [inaudible 00:11:51] that anyone can combine and [inaudible 00:11:54]. But it comes with different sets of requirements. We are not talking about an analyst or two, we are talking about thousands and tens of thousands of people, and that’s where the cloud and cloud native and scale and SLAs come in play.

Andy Palmer:

Yes. Yeah. And I think one of the things I think is amazing, and I’m excited to hear you talk more about, is this term that you’ve really redefined, data as a service. Because I think it’s very accurate as a bundle, in terms of describing what organizations should be striving for. I think that most organizations need to figure out how to get out of the details of the physics of managing the data and think more about how they deliver data as a service and the tooling and the infrastructure that’s required to do that. What you guys are doing in terms of leading that charge at GoodData, I think, it’s really essential because it… There’s so much complexity and there’s so many distractions that large organizations can find themselves in. They really need to figure out how to minimize the number of distractions. Just the decision of whether they’re going to run some set of infrastructure on-prem versus on the cloud can either create a lot of distractions if they decide to do it on-prem, or eliminate a huge number of distractions if they decide to do a cloud native.

Andy Palmer:

And so, I think if they start from the mission, the goal of trying to deliver data as a service to as many consumers in the organization as possible and work back from there, some of these core infrastructure decisions become very, very simple and kind of a no-brainer. Yeah.

Roman Stanek:

Yeah. I agree. And I actually think that the biggest problem or challenge we are jointly solving is… and that’s kind of a fundamental problem data as a service, and I would say it’s a fundamental problem of data in general, is the lack of trust and governance. Why people copy data again and again? Because they don’t trust the people who manage the data next to them. They don’t trust that manager. So, everyone wants to get access to their own data and do it again and again and again. So, I see the data as a service as, clearly, a different way how to deal with trust and govern data in the enterprise. And emphasis is on government trusting, and that’s where GoodData and Tamr come in play because, again, you guys are [inaudible 00:14:53] that trusted governed data set that has been mastered and so on. And we look at it as how do we add real-time governance? Who can see what? And what is available? And what it means and semantic model and so on.

Roman Stanek:

So, I do believe that we have three issues here. We have the technical infrastructure, cloud, no cloud, on-prem, and so on, but then we have the trust and governance are two big issues. It’s good to see that this is actually being solved as well.

Andy Palmer:

Yeah. Mastering is one component in overall data governance and trust, but I’m really… One of the things that frustrates me right now when it comes to governance and trust of data is that there a lot of people focused on source-based data governance, which I think is a bit of a red herring. The reality is that if you try and govern data based on where it comes from in an organization and inherent all the access controls from wherever it was created all the way through, it’s too complicated, it’s too heterogeneous. It never actually gets done. The only really healthy way to govern data in a large enterprise, and the only viable way, is to start with how the data is being consumed and work back from who’s consuming what data and whether or not they’re consuming it appropriately.

Andy Palmer:

And then, all the provenance and lineage that you can provide as context to whether that person is consuming that data appropriately or not is very, very valuable. But I really get worried when people… I have a number of our colleagues in large enterprise, and I was tempted to do this when I was a chief data officer, that you get sucked into these large source-based data governance projects that they never finish, you run out of resources before you get halfway through, and it doesn’t really matter. What really matters is how people are consuming the data and whether they’re consuming it appropriately. And then use as much of the lineage and provenance as you can to inform those decisions about who gets to see what data. So, it’s a big hot button for me, in part, because I see so much resources going into projects with companies like [Culebra 00:17:23]. Yeah.

Roman Stanek:

Yeah. No. And again, it’s, as you said, it’s very modern these days. My perspective of that, I would call it passive governance, unlike I call it active governance, that actually is governing the excess rather than passive governance that’s governing who can see what. Or not even who can see what, where the data is coming from, I see it almost like… like you go to the library, have you seen that index [inaudible 00:17:50] that’s how I see it. I can still go and grab that book if I want to. So, it’s kind of nice to have, but again, it’s a good step. Yeah, for us governance, it really is more about how do we actually make sure that people actually trust the data and have access to as much data as possible? [inaudible 00:18:16] actually need to interface with another system that’s acting like [inaudible 00:18:20].

Andy Palmer:

I think that’s right. In data mastering, very closely related to data preparation. Right? Which a lot of the people associate with companies like Alteryx, or Trifacta, this last mile data prep where you can combine a couple of tables and do some data cleaning before you actually use it. I’ll never forget this experience. I had one of our larger customers, one of the top five biopharmaceutical companies in the world, and we were building this next gen data infrastructure for them and they were using a data prep tool. And the folks from the data prep company were giving their pitch as to how every single person in the company, the tens of thousands of people were each going to be able to take the data and tune it and tweak it to suit their own their own needs. I’ll never forget, somebody at the company, from the operating side held up their hand and said, “Well, if everybody’s doing that, how do I know that their data is consistent with mine? And doesn’t that really cause me not to trust anything [inaudible 00:19:24]?”

Andy Palmer:

I think that, in many cases, we’ve relied on end users to do a lot of their own preparation of the data, which is empowering and important, but we have to be able to resolve those idiosyncrasies they create with the core data itself and whether or not they’re creating massive inconsistencies. Because sooner or later people are going to take action on this stuff, and if the data’s not lined up enough and appropriately, then you can get people doing the wrong things really easily.

Roman Stanek:

Yeah. And that’s kind of the whole idea, again, for us behind the data as a service and composable analytics, that you’re building it from composable blocks that are validated and tested and governed and so on. The funny story is that the very first website on GoodData, the headline was, “Now anyone can be an analyst.” If I did a new website today, I would say, “No one wants to be an analyst.” People want to get data as a service. People want to get data in the way they get [inaudible 00:20:31] and use it and get the decisions and so on. So, an idea that every single person in the company will have to go and clean their own data set and so on, again, that’s too much.

Roman Stanek:

But that goes back to where we started almost, and that’s the cloud infrastructure, because for the first time ever, we have an infrastructure that can actually, that can power the real-time analytics that can make us help decisions of tens of thousands of people in the field, mobile, and so on. So, I do believe that, and I would like to hear, and we have maybe a few minutes and then maybe some questions, but what are your plans for cloud native, hosted? How do you get that scale to support some of the largest companies in the world?

Andy Palmer:

Yeah. We’ve really embraced our customers’ schizophrenia with regards to cloud. We still have a bunch of customers that are deployed on-prem, but increasingly, people are deploying cloud native. So, Tamr runs natively in GCP, AWS, and Azure. You can buy Tamr directly through either one of them. And very soon, we’re going to also be launching Tamr as a service, what we call Tamr Cloud, where you won’t have to set up your own instances or any of that stuff. But I think that the real accelerator for us, we did this big project for SocGen, the big French bank, at one point, and they made a decision really early on to run natively, host it on the cloud. And they estimated that to run the same project on-prem would have taken six months just to provision the hardware. And ultimately, we ended up modeling everything that SocGen buys and doing spend optimization on that, from ATM machines, all the way down to pens and pencils. And it took tens of millions of dollars by cleaning up their data across SocGen and doing simple cost optimization.

Andy Palmer:

But they were able to deploy the project, so in less than a month and a half, and primarily because they didn’t wait six months to provision a bunch of on-prem hardware. And when we had to scale up to support all the diversity of the sources and the data that they had, we were able to elastically expand all of the compute that was required to serve the models that they were doing in Tamr, and then when we were finished training the models, we gave all those resources back. And so, there were literally thousands of servers that were spun up for four or five hours, and then given back. The alternative is, well, you have to have a big cluster that you can go use. And so, it’s so much better to run natively on the cloud and use these cloud native services that are elastic by design and ephemeral that it’s just massively better on a whole bunch of ways. Yeah.

Roman Stanek:

I agree. We actually went the opposite way. We actually started in the cloud, as a cloud SAS solution for mostly SMBs and companies that have no ability to actually manage their own data, and now as we are going more and more into enterprise, and we get the more sensitive data, we are the cloud native so we shouldn’t have to [inaudible 00:24:02] deploy behind a firewall. It’s so much more flexible. So, I see cloud native as the best of both worlds [inaudible 00:24:12] actually, it feels almost like an on-prem in terms of control and so on and sensitivity of data, but it does have both ability and flexibility and elasticity on the cloud.

Andy Palmer:

Yeah. I wished I was in your spot. I wish I was going from multi-tenant SAS to cloud native because it’s much more efficient. A lot of our customers that run cloud native, one of the biggest challenges they have is the people on their staffs figuring out how to run these cloud native services. Even though enterprise has come a long way in terms of understanding DevOps, there’s still a lot of folks that don’t understand how to use the basic scale-out services in these large enterprises. And so, it can be almost worse than deploying on-prem sometimes because they know how to do on-prem stuff. You put them into a bunch of instances in AWS and a bunch of services, maybe they’ve got the skills, hopefully they do, but oftentimes, we end up running those things as a managed service for our customers.

Roman Stanek:

Yeah. Yeah, no, I agree. This will take a decade for everyone [inaudible 00:25:34] the same, at least a full decade for everyone to build the same level [inaudible 00:25:39] on-prem and trust and so on. Before we go, I actually, we have some questions here from Q&A from audience. What is the highest return on investment for what you do? Where do you see these kinds of examples where everyone should be doing it? [inaudible 00:26:00] for the data mastery or mastery data management.

Andy Palmer:

Yeah. Well, that’s a great question. We did a study with Forester across all of our existing… or a large quantity of our existing customers, and they came back with an ROI that was like more than 700%. It was just the raw data from our customers so it was pretty compelling. And most of that return comes from the use of the clean, curated master data to do applications that are either a spend optimization, direct or indirect spend optimization, making sure that they’re getting the best terms every time that company buys something, or cross selling and upselling where they’re able to sell more to their existing customers by simply knowing what those customers have bought in the past, making sure they’ve got a good customer master and have a solid master of all the products and cross selling effectively.

Andy Palmer:

And then the third area where people are able to get value from Tamr is in reducing risk. So, there are many cases, especially in financial services, where these organizations are obligated to know who their customers are and, in many cases, have thousands of reference systems about who their customers are. And if they can’t definitively say that some new customer that just signed up is who they think it is, and it is not some bad actor, they risk many, many tens of millions of dollars of fines. And so, that’s another key source of ROI from cleaning and mastering data.

Roman Stanek:

That’s actually very cool because that essentially means that whatever we do together in making this available as a service to wider audiences, that will only increase that return on investment. If the main investment return is from the use of the data, if you increase the use 10 times or 100 times, because more people can consume it in a way as a composable analytics standard, then really, the investment doesn’t increase that much, but the return is actually can go up 100 times.

Andy Palmer:

Exactly. Well said. This is also true. I started studying AI back in the 1980s with this guy Marvin Minsky, and Marvin taught me two things. One is no algorithm is useful without enough great data. And the second was, it’s always about the human and the machine working together. And as large companies have adopted AI, I think that they’re realizing now more than ever that without great data, it’s like garbage in, garbage out. And so, cleaning up and mastering your data not only has this incredibly high return on very tactical spend optimization, cross-selling, upselling, and compliance initiatives, but it also enables you to do all kinds of AI projects that without great data you couldn’t even attempt.

Roman Stanek:

Yeah. Absolutely. Yeah. Yeah. That’s well said. Garbage in, garbage out. No, this was excellent. Thank you. And I hope that everyone learned more about mass data mastery and data as a service and how it fits together. So, if anyone has any questions, please reach out to Tamr and GoodData teams. It was good to catch up. Thank you.

Andy Palmer:

It’s great to see you, Roman, and thanks for all your leadership at GoodData. I know we at Tamr are huge fans and we’re following your lead on data as a service. Thanks for everything you do. And by the way, we miss you here in Boston quite a bit.

Roman Stanek:

Yeah. Likewise. Yeah. And everything we do depends on trusted and governed data, so it’s very mutual. Excellent.

Andy Palmer:

It’s great to see you.

Roman Stanek:

Thank you. Thank you all. Likewise. Thank you.

Andy Palmer:

Thank you. Cheers.

Roman Stanek:

Thanks.