Our big inaugural event, DataMasters Summit wrapped this month. Hundreds of people registered to listen to more than 25 original presentations by some of the most important and influential voices in data. We featured speakers from Government, Financial Services, Life Sciences and other industries, as well as Tamrs, company founders, cloud partners, and investors.
We have hours of original content for you to dig into and enjoy; all on-demand.
But if you’re looking for a good place to start, we’re highlighting some of the most important questions asked and answered at DataMasters in this two part series. The following remarks are edited for content, clarity, and brevity. To watch every video from DataMasters, you’ll need to enter your information one time.
Why don’t traditional MDM solutions work?
Michael Stonebraker, Turing Award winner and Co-founder, Tamr: Let’s look first at schema integration. The traditional elephants in the data mastering space, almost all of them put a table up on the left-hand side of the screen, table up on the right-hand side of the screen, and allow you to draw lines in the space between pairs of matching attributes. Notice that this is all human powered. So imagine doing this on a scale of many millions…carpal tunnel syndrome for sure. And it’s going to take absolutely forever. So anything that is human powered is guaranteed not to scale and you should reject it out of hand.
What about entity consolidation? Well again, what do the elephants do? The traditional solution is to use a rule system to do what’s called match merge. Match means finding two source records and trying to decide if they’re the same. For example, a particular rule might be, if the edit distance between names of content items is less than a certain amount, then they’re the same title. So, that’s one rule. Do this over and over and write as many of those rules as you need to.
So, why does this fail? Well, I’ll point to an unnamed media company. They wrote, believe it or not, 200,000 such rules in a homegrown rule system that two people have been working on for 13 years. So think of them as into this for at least $5 million over a long period of time. They’re finding this totally unmaintainable. What does this mean? Well, if they have to add a new data source, it takes forever. If they have to change anything, it takes forever.
Rule systems, just plain, don’t scale. This is why it took decades for two engineers to write 200,000 rules. The minute you go over about 500 rules, life gets really, really, really hard. So, just remember that if you have a problem that’s going to require more than 500 rules, you’re in deep yogurt.
Why is Tamr’s machine learning approach to data mastering a better solution than a traditional rules-first approach?
Anthony Deighton, Chief Product Officer, Tamr: At its core, the idea of Tamr is to use machine learning for a specific purpose: to apply machine learning to the difficult and challenging tasks of data mastering. Our mission is simple and clear. To enable organizations to quickly and easily bring together their siloed, disparate data and quickly, and interactively deliver tangible business value as they become data-driven businesses.
The challenge with the traditional rules-based approach to data mastering is that it’s rules first, value later. Not only is it incredibly time consuming, but it relies on the dark art of a few developers to generate and create these rules. As a result, the rules are extremely brittle and they tend not to cover the full scope of the data. So as data comes into the system, we’re forced to create a series of rules to try to figure out every possible corner case in that data. Heaven forbid that new data would show up or data would change or shift with time and the rules break, and it requires recoding these rules over time. So what we find is that traditional rules-based projects take a long time and typically aren’t very accurate. It’s no surprise that in general, these approaches are not particularly successful and people don’t particularly enjoy working on them.
With a machine learning based approach, we turn that equation on its head. We allow you to deliver value extremely quickly with high accuracy. We do this by training the computer to do the hard work of building a machine learning model that looks at the data and figures out the best way to bring that data together, to categorize it, and to figure out what data belongs with what data. So this machine learning based approach offloads the hard and difficult time-consuming work to the computer. In turn, this frees the energy of the smart people in your organization who know and understand the data and want to be able to answer the important business questions and focus their unique talents on the corner and edge cases where machine learning may have trouble.
What are some key components of a strategy to create and push an MDM as an asset program across industries and organizations?
Young Kim, Group Chief Digital Officer, SK Holdings: We (SK Holdings) deal with companies that have different maturity levels for looking at data. I say “you must build the horse before you can ride into the sunset.” We start with the process before we go into each holding company.
SK Holdings has built a technical arm that helps and enables companies to drive data transformations. We use a set of principles to handle different phases. And we need a shared technology to help each company be successful and a consistent process that is flexible to meet each company where they are.
Relationship building is key between the (data) strategy group and business leaders. To help, we install a dedicated team onsite to make sure all groups are in-sync and collaborating. We begin by asking each company how they want to start this journey, and then we overlay that with our framework so that they have a say and can apply our best practices.
Some people are skeptical, while others truly believe data is the most important asset. If they don’t have the right data, we help them put it together to move their business forward. Value comes when you weave it all together.
Who should be the main sponsor for the “data as an asset” paradigm shift in the organization?
Barkha Saxena, Chief Data Officer, Poshmark: Everyone wants to make decisions based on assets. But there’s a gap with what business wants to do and the data. Data leaders take a leading role in inspiring the organization to close that gap. The relationship between data and the business is a partnership — not a support or service.
Vertical data teams within the business should look like they’re more part of that (business) team than the data team. Feel their pain. Know the business. Not just the department leader’s questions, but the questions of everyone on that team. When business teams see this model of behavior work, they are very open and sharing. Build data products (analysis to sophisticated ML model) it all starts at the top level business challenge.
Kathleen Maley, Former Head of Consumer & Digital Analytics, KeyBank: Data leaders take a forward role in the business discussions. Necessity is the mother of invention. Those who were studying the data could find leakage in business processes that others wouldn’t find. When revenue is impacted, those are easy things to get behind and measure, so bring these ideas to the business.
The next three to five years, what do you think are the biggest data opportunities, both on the business side and the technical side?
Salema Rice, Chief Data Officer, Geometric Results: Well, I think that we’re going to see more big data. One thing that COVID showed us is, when my governor got on TV and started talking about big data and data sets and how much data he had, I was like, “Look at this!” I think that people are really looking at data as the differentiator in every industry there is. So, I think that we’re going to see data embedded everywhere. One of the things that really fascinates me about the people, process, and technology in our industry is that we’ve been in business for probably 25 years. And in those 25 years, it really was about putting the right processes in place, putting the right expertise in place to drive those processes. And then, the end result was getting data out that we could do something with.
But for the first 24 years — we’ll say 23 years — that’s where it stopped. And so for us, it was taking all that data and actually turning it into something. The question is now, with this data, what information are you making out of it? And that’s really how we’ve been able to turn the corner on the future of this industry, by taking that right process, the right talent and the data, and building a data science, mastering everything on top of that data to ultimately get to the best results.