Big Data London: Getting Started with Data Products

Rather read the transcript? Dive right in.

Welcome everyone. My name is Anthony Deighton. I am with Tamr and I run our data products business and I'm going to take you through in the next half hour or so our sort of introduction to data products, what it means and if the internet gods look favorably upon me, we'll also show you a live demo. So we'll see how that goes.

‍

So we'll start a little bit with why we're all here. So we're all here ultimately to drive business performance for our organizations. And the good news is that when we invest in data and making data available broadly within our organization, it helps drive business performance. McKinsey did a study where they looked at all sorts of organizations around the world and divided them into two groups, high performing organizations that had high growth rates of their EBITDA and revenue and those that were not. And then they looked at their data practices, how they invested in data and how they use data. And there were two key findings that I pulled out of this report that I thought were particularly interesting.

‍

The first is that they invested in a C level executive in their organization that cared about data and decision making. And the second is that they made data broadly available in their organizations to frontline employees who needed to use that data to make better decisions. Two big ideas here, invest in a C level executive and make data broadly available.

‍

So taking the first one for a moment, in a sense this is the easiest piece of the story. The role of the chief data officer inside organizations today is much more prevalent, much more. There are many more CDOs in C level positions around the world. And that's even true in Europe. Almost 50 percent of organizations in Europe have invested in a CDO in their organization.

‍

But more important is the role of the CDO has changed fairly significantly. Where we used to see the CDO as somebody who worked in IT and whose job was to lock down data and make the joke that they're a bailiff or a lawmaker, that their role is really one of control. Increasingly, we see CDOs inside these organizations as enablers.

‍

People whose job it is to build the organization, build decision making, data driven decision making capability in the organization, and are really there to drive strategy. And so a lot of organizations that have invested in a C level executive, really the idea is how can we enable this organization with data so that we can actually drive business performance. Again, two ideas. Invest in a C level executive. So that's the good news is we have that. Let's talk about the second big idea. Which is, how is it that we can make data broadly accessible inside our organizations? And this is where we turn to this idea of data products.

‍

And so I thought it would be incumbent on me to give to you a definition of ‘What is a data product’. If there's one thing that's true about Big Data London this year, is that walking the halls, we see a lot of conversations about data products. And in fact, I believe there are even books being written on the topic.

‍

And I think it's fair if I take a moment and give you at least my definition, or I can say Tamr's definition of what is a data product. From my perspective, a data product is clean, curated and continuously updated data that's consumption ready. And there are a couple of important ideas here. And the first, and this is something a theme we'll come back to in the course of this presentation, the first is this idea of having consumption ready data.

‍

So this isn't thinking about source data governance and source data management. And not that there's anything wrong with that. There are many companies out there that are thinking a lot about how to manage sources. But we're thinking about the other side of that conversation, which is is how we bring this data together into consumption ready data sets.

‍

I will talk more later about what I mean by consumption ready. The second is that this is clean, curated data. So this is the best data your organization has about a business topic area and that it's continuously updated. So this is the one thing we can say confidently about data is that it's not static, that it's always changing.

‍

So this is our definition of a data product. And again, I'm going to spend a lot of time on this topic and we'll come back to this definition in a moment. But if we step back from this idea of a data product as this consumption ready, clean, curated data set. Let's talk about the data. And, my view is that cost effective management and storage of enterprise data is a solved problem.

‍

And this is not, this was certainly not true ten years ago and probably not even true five years ago. But today, if your business challenge is how do I cost effectively manage large volumes of data, out there on the floor are a myriad of really fantastic solutions for doing this inexpensively in the cloud with all kinds of magical bells and whistles built into it.

‍

And I've worked in data and analytics for many years. And every year somebody comes to me with a sales pitch that goes something like, we're going to take all of your enterprise data and put it into brand X, whatever that thing is. And I'm here to tell you that your enterprise data is never consolidated.

‍

You are never going to get all of your data in one spot. And it's probably not even a good strategy. Because as soon as you put your data in one spot, you do an acquisition, or you change your data strategy, and now you have another data silo that you need to put into this central warehouse or this central place managing data cost effectively solved problem. It's never consolidating. You're not going to end up with a single data warehouse. You're always going to have lots of sources of data. And that's a good thing because you want to enable and empower your organization with the tools and technologies they need for managing data.

‍

And lastly I throw out this idea of what I call Conway's Law for data, which I'm trying to brand Deightons's Law. So if you guys could help me with that'd be great. I'm trying to get everyone to call it Deighton's Law because I've always wanted my own law. Conway's Law was an idea that came out of the 1960s in software development.

‍

And it's a very simple idea, which is software reflects the organization that built the software. So if you have a front end team and a back end team building software, Conway's law says you will end up with a front end piece of software and a back end piece of software and an API that communicates between them, right?

‍

And for those of you in the software development industry, you'll see Conway's law all over the place. So you'll see a piece of software, you'll see how it's architected. And then you look at the organization that wrote the software and shockingly they're organized the same way. Deighton's law takes that simple idea and applies it to data.

‍

And my point is that the data in your organization reflects your organization model. If you've organized your business by product line, you will have databases organized by product line. If you're organized geographically, you have a European organization and a U. S. organization, you will have databases that are organized geographically.

‍

And, oftentimes, you have both. You'll have organizational constructs, or, sorry you'll have geographic constructs and product constructs and many others. You end up with this proliferation of data. Storing data is a solved problem. You're gonna end up with lots and lots of databases and data sources.

‍

Let's turn our attention to the other side of the coin. These consumption endpoints. And again, here... We see the data consumption problem is a solved problem. If you want to create bar charts and line charts, there are a hundred different tools for doing that, and they're fantastic. It's never been easier to create bar charts and line charts.

‍

I worked in this space many years ago, and like the tooling here is fantastic. Stuff really works. And there are also new endpoints. So new ways of creating machine learning models that are really easy and accessible and fun to work with and consumption endpoints have never been more important, enabling your organization with a great tool for creating dashboards and analysis and reports.

‍

A fantastic and useful investment. So we have this really interesting prep storing data cheap and easy using data. Fantastic. Great tools for that. Then how come Nobody trusts the data in their dashboards. How come after having spent all the time and energy of managing data cost effectively and creating great consumption endpoints, how come when you click on your dashboard, you look at it and you go, this data is obviously wrong?

‍

I can see they have 13 copies of Microsoft and I know there's only one Microsoft. Or I look at this and I see we're missing basic information like address or the phone number or how come when you look at your dashboard and you say did you include the data from this division over here or from our European subsidiaries?

‍

Oh, no, we didn't do that. So people don't believe and trust the data in their dashboards. And so it's not for lack of tooling. This show floor is full of companies that have little pieces of software that are designed to solve small portions of this problem, to move data from here to there or to, put a metadata layer over there or to, so there's a lot of data management tooling.

‍

And I would suggest to you that this is exactly the state of the world. In the 1990s in the application software space, I began my career at a company called Siebel Systems. And Siebel was a CRM application, not unlike salesforce. com. And today, if I told you needed to have a system for managing contacts, opportunities, and accounts, you would say, we'll buy Salesforce, right?

‍

What you would not say is I'm going to buy a whole range of different tools and knit together my own CRM application. And yet when we're faced with this challenge of how to get people to believe and trust in our data, what do people do today? They buy a range of different tools and they try to knit them together to build a solution from scratch.

‍

It doesn't work in the enterprise software, in the operational application space. Why should we think it works in the data space? And that's why our view is that what the world needs is an application approach to managing data. And that is what we call a data product. So a data product is organized around these key entities that matter to your business.

‍

Who are my customers? Who are my suppliers? What parts do we make? What products do we use and ship to our customers? What suppliers do we have? Who are our partners? Et cetera. And organized into a series of data products. And what this enables is a range of different User behaviors are outcomes that come from organizing your data around these key entities that matter.

‍

So it allows you, for example, to have insights about how that data changes and moves over time. Be able to curate that data so the users can provide feedback into that data and tell you what's right and what's wrong. It allows you to use machine learning to bring that data together and match it across these different entities.

‍

Implied data quality rules. And even link in third party outside data. So to connect our messy sources that are never going to get clean, they're never going to be in one place, but they're cheap and easy to use, with our beautiful, fantastic, easy to use consumption endpoints, the way we solve this challenge in between is using a data product approach.

‍

So coming back to that definition I started us with a data product is a clean curated, continuously updated set of data that's consumption ready. That's how we connect and bridge this divide between these sources and these consumption endpoints. So there are a couple of key capabilities in data products that are really necessary if you're going to have a data product strategy.

‍

The first. is that we need them to be discoverable and usable by end users and a mechanism for users to provide feedback into the data. So users should be able to go into the data and say, this is right, this is wrong, I have feedback on this, etc. We also need the data to be accessible. Accessible to users and accessible to machines.

‍

We need to publish APIs for that data. It needs to be aligned to domain specific schemas. I'll come back to this in a moment, but around the key entities that matter and bring in third party sources, but it's the best version of your data likely doesn't exist inside your enterprise. It's actually coming from third party data.

‍

We need to be able to validate this data against third party standards that we talk about something like an email address or a physical mailing address. We need to be able to validate and standardize that. And it needs to be version controlled. Just like software products are version controlled, a data product needs to be version controlled, so you can ship new versions with time.

‍

And we need to use machine learning as an underlying platform for bringing data across sources together and linking those sources. So these, in our view, are the key capabilities of a data product. And so how does this manifest itself? I'm going to show you a demo, a live demo in a moment, but here are a couple of kind of screen shots.

‍

And I'm going to walk you through conceptually, what should a data product look like in your organization? So first of all, it should ultimately, your goal should be that this should be comprehensive. It should reflect all of the organ, all of the data in your enterprise. It should be dynamic, should be changing so that every day you come in there, new data showing up.

‍

New sources being added, new entities being added potentially, and it should be consumption oriented. So it should be focused on how you intend to use the data, not where it's stored and managed. So when you come into a data product environment, you would start with all of the key entities that matter to you in an organization.

‍

As you can see in the screenshot, each of these tiles represents an entity that matters to your company. So it could be as simple as something like your customers, but it could be more industry specific. If you're a global shipper, it could be the ports that ships come into and out of or containers that you put items in.

‍

And this is really your jumping off point for how your users would engage the data. So if they wanted to know where the best data about your customers is, they would know that there's always a tile here, which allows them to see that data and then provide feedback. When they take a specific, data product and you drill down on that data product, you should get a consolidated list of all of the best data behind that entity.

‍

So this is the almost like a spreadsheet of the best data. This is not where you're going to be doing your analytics. You're going to do that in [00:16:00] your consumption endpoint of choice. But this should be a view where you can see all of that data and provide feedback on that data. And then when you drill down on a specific Entity.

‍

You should have a 360 view of that entity, which links it across all of your different data products. So these sort of three key ideas are the central point of a data product. All of the tiles, all the data products in your organization, a list of comprehensive data and in a 360 view. So talking about, uh, organizing by consumption endpoint, a point I've made a couple of times now.

‍

In a way, this is one of the key things to take away from this presentation is that when you think about data products, don't think about sources, think about consumption, think about the the end point data that you care about as an organization. So again, something simple like customers makes a lot of sense.

‍

But if you're in the financial services space, you might be thinking about market data linkage. Or if [00:17:00] you're in the health care space, you might be thinking about providers or patients. So think industry specific here. And then also think about when you do think about sources on the other side of the coin, think about bringing these sources together and linking it through a common ID.

‍

And my simple example here, you see we have three different business divisions and we've linked all of the data in those divisions through to a tamer ID. You can do the same thing for external sources, so imagine bringing in data from CapIQ, FaxEd, whatever, and linking those through. And you could even imagine linking in data.

‍

That's external to the organization. Oh, sorry, unstructured. So internal documents or unstructured data on the Internet. And so the idea here is that the linchpin to your data product strategy is a reference or an ID that links these sources through to this consumption endpoint. Because this allows you to bring together this view of your [00:18:00] data.

‍

And putting in this example, putting customer at the center, which becomes this reference integrity, this reference point for all the data in your organization. So this is something we've done. Tamer's done over a number of different customer examples. I would point you to without going through this in in great detail tomorrow at 12 o'clock in the modern data stack theater, which is over there.

‍

Santander will be coming and sharing how they've built a data product strategy around their customer data. So I would encourage you to, if this is interesting, to go hear how they've done that. This approach, by focusing data, from a data product perspective and thinking about consumption first, is a very different way of coming at the problem of sharing data with your organization.

‍

Traditionally, we've thought about this approach as a very top down, IT led, governance based, rules based approach to managing data. This is equivalent to saying we're going to lock down our data, not let anyone touch it, and that's the mechanism by which we're going to create excitement and engagement in our organization around data.

‍

Never works. Typically, you focus on trying to manage tons and tons of sources. You get I. T. To lock everything down. And the result is just like in Jurassic Park, life finds a way data finds a way people go outside the system to be able to use the data. When you take a data product approach, it's an empowering and enabling strategy of bringing people to the data and focusing on these consumption endpoints much more effective.

‍

So yeah. With that, let's take a look at what this would look like in real life. As I indicated, when you come into a data product, what you should see is a series of tiles that represent all of the data products in your organization. [00:20:00] Here, this is obviously a demo environment. You can see customer data, you can see provider and patient data, you can see householding data, healthcare data, etc.

‍

And when I click on a specific entity, I'm going to go and I'm going to see a table with all of the best data that we have around, in this example, customers. And one of the really valuable things about organizing your data as a product is it allows you to watch that data over time. So you can see in this simple demo example, we started with 2000 sources are sorry, 2000 records.

‍

And we've roughly discovered that there's only half that number of companies in that data. You can see all of the different sources. We've brought together six different sources and you can actually watch and monitor the fill rate across the key metrics or the key fields that matter to you as an organization.

‍

Again, you can also see how that data [00:21:00] links. So here you can see all of the different sources of information and you can see, for example, there are some sources which are completely unique or only adding to the data we have in our example with customers. And there's some that are complete overlaps.

‍

And so if you're thinking about removing a system from your environment, this would allow you to see how much each of yours operational systems is contributing to your view of your customers. If I drill into a specific example of a in this case, a customer. You can see the 360 view that brings together all of the information we know, in this case, about DHL Express.

‍

We assigned this customer a tamer ID. So this becomes a referential and linking point. for our relationship with, in this case DHL. And that links to each of the underlying sources that provide information [00:22:00] about in this case, DHL. And we've enriched our relationship with DH DHL with external third party data, in this example from Boldata, so we can get its global ultimate parent, we can get SIC codes, we can get all this external information.

‍

So the data's actually better than what we shipped in. So that's a sort of quick overview. I am almost out of time. sO I'll just leave you with where we're focused. Tamer as an organization is focused is on building a framework for allowing organizations to manage data as a product and to manage data products.

‍

If this is interesting and this is lines with your company strategy, we're at booth 3 48, which is in the back left corner from where I'm facing back right corner from where you're facing and appreciate everybody's time and attention. And if there are any questions, there's a person with a microphone and you're in the [00:23:00] remaining time happy to take any questions there.

‍

Question & Answer Session

‍

Audience Question:

Analytics engineer from Adoptivist. I just wonder how you implement data contracts. So there are they visible in the tool? How do you do that?

‍

Anthony Deighton:

Yeah. So the question is around data contracts and we think about it. The way to think about that is this linkage point at the center. So we, we use under the covers. We use machine learning to discover the entity relationships in the underlying sources and then mint this I. D., which becomes the central linking point across all of those data. What we're not doing is telling you to go back to a source and delete records or two or merge records.

‍

Obviously, many people will do that. But the key point here is that now we have this reference point where we can all agree that these 1, 2, 5, 100, whatever, these underlying sources contain the same piece of information about, in this example, a customer. I think we're almost out of time. Maybe one more question, if there are any.

Thank you everyone for your time, and enjoy the show.

‍

See Full Transcript ↓

Getting Started with Data Products

The future of data management: data products

The future of data management: data products

Rather read the transcript? Dive right in.