datamaster summit 2020

Data Mastering is the Key to the Data-driven Enterprise


Andy Palmer

CEO @ Tamr

Every business wants to become data driven – and the pandemic has made it an urgent necessity. But the expectations of the modern enterprise data and the drag coefficient of legacy IT make it difficult. Learn why data mastering is the key to priming the pump of your modern data driven enterprise and the catalyst for broad digital transformation.



Speaker 2: … Welcome to the 2021 DataMasters Summit, presented to you by Tamr.

Andy Palmer: Hi, my name is Andy Palmer. I’m the chairman, CEO and co-founder at Tamr. I’m really excited to be coming to you from Cambridge, Massachusetts and welcome you to DataMasters. One of the first things that we’d like to talk about today is the need for every single business to become data-driven. Many businesses are beginning to invest in their data initiatives, but only one quarter said that they’ve created an actual data organization.
The human part of the data equation is still a missing piece in the enterprise. A lot of people are increasing their investments in data-driven initiatives and the investments are enormous. We’re talking trillions of dollars that our people are putting into their data initiatives as they try to become data-driven and transform their enterprise into a digital native company.
One of my favorite people is Andrew Ng. Andrew has been talking a lot lately about the value of small data. I think what Andrew is trying to get to is that clean, curated version data is really what people operate on and use in the enterprise. We’re not talking about big data any more, we’re not talking about this massive collection of all the data in the enterprise. We’re talking about which data people can use.
In Andrew’s case, he’s interested in using it for AI applications, but that same data that you clean up and use for AI is also incredibly useful to every data citizen across your company. Becoming data-driven is an urgent necessity. We’re seeing this play out in financial services, as new entrance like Upstart have changed the lending and credit approval process dramatically, by being data-driven from the very beginning. In comparison to the traditional banks and credit institutions that use FICO scores, which are a very, very small, limited amount of data.
Over the last two years with the pandemic, this need to be data-driven has just accelerated the pace with which new companies like Upstart are able to compete, head-to-head, with some of the largest banks in the world. Upstart is now worth many, many tens of billions of dollars. Whereas 10 years ago, the company didn’t even exist.
One of the key opportunities that exists for new data-driven organizations is to use the new mastered internal structured data, as the key to unlock all their enterprise data. For many decades, people have been using their intranet search engines, but they really don’t create a lot of value. The clean, curated data organized by logical entity type, that you create when you master your internal structured data, can be used as context to make internal enterprise search, like FAST or SharePoint, much, much more useful.
Also, you can use this same clean, curated, organized data and the entities associated, to organize external data as it’s coming into your organization. Finally, you can actually go out and harvest data from the modern web and bring it in. In context of these key logical entity types, to make it useful for people inside of your organization. So, this clean, curated internal structured data is a real cornerstone for your entire data strategy.
One of the largest problems and the biggest unknown realities of modern enterprise data is the data silos that exist across most companies. The hundreds of operational systems, all the new SaaS applications that people have adopted. The dozens of data warehouses and the hundreds of data marts results in many, many tens or hundreds of thousands of tables of data that exist everywhere within your organization.
This is a really, really challenging problem, to bring all this data together and resolve it into that small amount of crisp, clean, curated data that your data consumers actually need. Creating clean, curated and versioned data for efficient operations and effective decision-making really starts by using the power of the machine. To take this insane quantity of data that exists across all the data silos and organize that data using machine-driven, human-guided approaches into these key logical entity types that people can relate to.
Customers, suppliers, products, employees. Sometimes, also, there are very industry-specific entity types that people need to use, such as in the oil and gas industry, oil Wells are a classic example. Or, in financial services, securities. There’s many, many different types of industry-specific as well as common entity types. At Tamr, we help use the power of the machine to get all of the company’s data organized into these common entity types, so that the data can be monetized productively at scale.
There’s always four sort of key ways that people monetize their data. First is they use it to accelerate growth, grow the top line. Secondly, they use it to optimize spend, figure out how they can spend less, every single year. Third is how to reduce risks. Usually, this is in compliance applications and making sure they know who they’re doing business with. With both their customers and their suppliers and make sure they’re conforming to all their government regulations.
The final way that people monetize data productively at scale, is by improving the efficient operations of their companies. Constantly improving and making sure that they’re running as efficiently and effectively as possible. After we had a great quantity of systems in place and operating, people started to realize the benefit of aggregating the data and bringing the data together from all the different operational systems.
This is when big data infrastructure emerged in the 1990s. The first data warehouses came out from companies like Teradata and Oracle. Then, an amazing thing happened in the 2000s. There were a whole series of companies, starting with Tableau and Qlik and Spotfire, that began the process of democratizing the access to analytics. Creating lots of analytical data citizens and large companies that wanted to consume data. Oftentimes looking to this next-gen big data infrastructure that had been created a decade earlier.
Over the past 10 or 15 years, what happened is we made a little bit of a mistake. We went down this path of trying to implement data lakes and moving all of our data into HDFS. I think it’s a foregone conclusion now that that was a massive distraction. My partner, Mike Stonebraker and I tried to point this out, back in the mid-2000s and we feel like we’ve kind of been vindicated. Now, we’re at a point where people are moving their data directly to the cloud and not into their data lakes.
The next step could be a swing from aggregated data methods to federated data methods with data fabrics and data mesh. But, one of the most important things to keep in mind throughout this entire history, is we’ve never actually closed the loop and actually delivered clean, high-quality versioned datasets, in context of key logical data entities that matter to data consumers in the enterprise. The time has come for us to do that.
There are four methods for delivering this clean, curated data. In spite of all the data variety and the data silos that exist out there, the first two approaches, rationalization and standardization are great, but they’re sort of a constant drum beat.
You’re always trying to minimize the number of systems you have and you’re always trying to make your systems as consistent as you can possibly be. The place where you can really move the needle is an aggregation and federation. This is where most of the action is taking place, as people move their data out to the cloud.
Let me talk about the first one for a second. Aggregating your data is a necessary thing, but not sufficient in order to solve the problem. We saw this with data warehouses and data lakes. Even when you aggregate your data and you put it all in one place, you still end up, oftentimes, with messy data. It’s not enough to just put the data together in one place.
It’s even worse when you take a federated approach and you leave the data where it is and try and query it where it is, because the data is going to remain in a state. You don’t have a chance to transform it. So, data federation approaches, data fabrics and data meshes have an inherent problem, in terms of delivering messy data to data consumers.
Our approach, at Tamr, after working with dozens and dozens of large Global 2000 companies has really been to start by making sure you can keep all your source data connected. The new data catalogs that are on the market, from companies like Elation or from all the big cloud vendors, very, very useful and important tools in order to do this.
The second thing is matching records and mapping schemas across all these different domains. This is where Tamr comes in, this is really where mastering starts. Then, rectifying all the data quality issues, again, another core component of a mastering infrastructure that you need and that companies like Tamr can provide.
The next phase is to have persistent IDs across all the different sources, regardless of your typology. Regardless of how aggregated or federated you are. If you don’t have these persistent IDs, these join keys, across many sources that are consistent and persistent, it’s difficult, if not impossible, to keep the data mastered.
Finally, you want to close the loop in curating and stewarding your data, to make sure you’re getting feedback from all your data consumers, as to which data is right, which is wrong. Understanding as much as you can as to why it’s wrong, so that you can improve the infrastructure over time.
This bi-directional feedback for data in the enterprise is something that really doesn’t exist right now. It tends to flow from the sources and out to the consumers and there’s no method for people to feed back on the quality of the data, how it’s organized and whether it’s useful or not. We at Tamr are really committed to enabling large companies to create this bi-directional flow of feedback about their data.
One of the reasons this is happening now, is the opportunity presented by the cloud and the democratization of AI and ML techniques. These two things together, are creating the opportunity to break all the data silos down, by using machine-driven, human-guided techniques that run natively on the cloud and use Elastic Compute and Persistence in order to organize lots and lots of data very quickly. Get it down into the clean, curated, versioned datasets, relatively small data sets, that are used by the average data citizen in the large company.
The key to making data-driven decisions into enabling your organization to be data-driven is data mastering. There are many other things that are required and for large-scale transformations, you need something like DBT. For large-scale movement of data from your on-prem systems into the cloud, you need something like Fivetran.
There’s many different components to your modern ecosystem. We love this term data ops, because it very deliberately refers to dev ops. When dev ops emerged back in the early 2000s, to enable software developers to build test and release high-quality software on a regular basis, it was a completely different approach. They used lots of tools from many different vendors in a best-of-breed approach, in order to make sure that they could continuously build test and release software.
The same thing is kind of happening in data right now. There’s a new set of vendors that are emerging, that all run natively on the cloud, Tamr being one of them, that enable large organizations to build test and release data continuously, out to all their data citizens, to be used as a strategic competitive asset. We believe in this best-of-breed approach and bringing together the best tools, rather than buying into a single-vendor, single-platform approach.
We think that you have to, as an organization, get sophisticated enough to be discerning as to which tools are most important and powerful. To combine them together. We’re committed to working with our partners, Fivetran is a great example. To make sure that all the end points are lined up technically, so it’s as easy as possible to connect these best-of-breed tools.
There are five different stages that we see our customers going through, in terms of their digital transformation. The journey is a long one and it starts with understanding where you are. Oftentimes, many of the companies that we work with are at the very beginning of the process. Really just understanding what they have in terms of assets and kind of where they are. Trying to figure out what it’s going to take to get through the entire journey.
The second type of companies that we see, we call them Explorers. Are folks like one of our customers, Newmont, that are figuring out exactly what they want to do for their first initiatives. How it’s going to function and what are the core capabilities that they might be missing? The third stage is when the companies are actually trying to do real projects and real activities. Great example in our customer base is the Department of Homeland Security. They’re really running great projects that are being successful, but it’s still very early.
The fourth group of customers that we see are those that are transforming their enterprise using data and digital techniques. Johnson & Johnson is a great example of this. There’s many projects going on at J & J that are working really, really well. They’re beginning to do these kinds of large data projects at extreme scale.
Finally, you have the Disrupters. A great example, in our customer base, of one of these, are the folks at Capital One. Capital One has been using data as a strategic competitive asset for many decades. They are truly leading the industry, in terms of their ability to use data as a strategic competitive weapon.
A key thing we see across all of the projects that we run, is you have to focus on delivering real business outcomes. You don’t want this to devolve into a boil-the-ocean IT project. At many of our customers, whether it’s the Department of Homeland Security, Santander, ThermoFisher Scientific or Maersk, we’ve worked very closely with the project teams, to make sure that we were delivering real business value in weeks and months. Avoiding projects that had life cycles that were measured in quarters or years.
At Tamr, we work with lots of different companies, across many different industries. But, what we see as a common theme, is that all these companies are trying to become data-driven. We’re committed to doing whatever it takes to support those companies in their data-driven journey. At Tamr, our vision is a world where every business is data-driven. This is the way.
Our mission at Tamr is to accelerate our customer’s digital journey, by enabling them to continually curate and consume clean, curated, versioned data. It’s a very challenging mission and it requires lots of human and technical innovation, but we work hard, every day, to make sure that we can do that with our customers. We’ll do whatever it takes to make them successful.
I really want to thank all of you for joining us at DataMasters. We’re thrilled that you’ve decided to spend some time with us and attend so many great sessions. All of the people involved in these sessions care about data deeply. Many of them specifically focused on the large enterprise. I hope that you get as much out of these sessions as we will. Also, please don’t hesitate to engage with us at Tamr and/or any of the participants.
We all care about data. We’re all interested in the same things. Facing a lot of the same challenges. At the core, that’s what DataMasters is really about, trying to share the experiences that we’ve had and learn from each other’s mistakes and successes over the last 10 years. Really, really looking forward to the upcoming sessions and thank you again for joining us …