datamaster summit 2020

The Future of Data Mastering At Scale

 

Anthony Deighton

Chief Product Officer @ Tamr

Creating cloud-native data mastering solutions that scale. Hear from Anthony Deighton, Chief Product Officer of Tamr, on our product innovations from this past year that accelerate business outcomes and the future of data mastering from Tamr.

Transcript

Speaker 1:
DataMasters Summit 2020, presented by Tamr.

Anthony:
Welcome to beautiful and historic Fenway Park here in Boston. I’m Anthony Dayton and I am so excited to share with you some of the amazing work that Tamr has done innovating and driving our product agenda. So let’s head inside and get into it. Good morning, good afternoon, good evening, depending on where you’re joining us from, of course. We are so pleased that you’re able to join us today. I’m Anthony Dayton, chief product officer here at Tamr and on behalf of the entire organization, I am so excited to share an update on our product direction and strategy. Now, there are three things I want to cover today. First, I want to give you a view onto how our industry is changing and why these changes are moving our unique approach to data mastering at scale into the mainstream. Also, I want to share with you the amazing innovation we’ve delivered this year that’s helping our customers drive business outcomes more quickly and easily than ever before.

Anthony:
And lastly, I want to share a view into the future investments that Tamr is making to keep our lead in the market. Now, I wanted to start with my personal journey to joining us here at Tamr. Now, as many of you may know, I spent many years at [Qlik 00:01:32] helping to disrupt the analytics market. I learned two things in that time. The first, as we worked with customers and we gave them new ways of visualizing and analyzing their data, every time we took the covers off and we looked at their enterprise data, it was a disaster. There needs to be a new approach to helping people bring their data together and understand and get value out of that data. But the second thing I learned was about the nature of disruptive technologies.

Anthony:
When Qlik started in the early 2000s, what we were really disrupting was traditional thinking. There was a pervading wisdom about the way that you analyze data. That involved ETL, building cubes, storing cubes on disk, and Qlik started with a new approach and we disrupted that market. And when we launched QlikSense, we took that disruptive desktop approach, we brought it to the enterprise and to the cloud. Very much the same thing is going on in the data management space. There’s a traditional wisdom about the way that you can master data. And we’ve already gone through the cycle of a desktop approach to this challenge. And today, there’s the opportunity to bring a cloud first enterprise approach to data mastering at scale. When we see a disruption in the market like this, there’s always an exogenous platform shift, which drives this innovation. In the case of Qlik, it was 64 bit technology. In the case of Tamr, as we will see soon, it’s really about the cloud and cloud scale. The second is that we move from the impossible to the possible. Things which previously seemed not possible become possible.

Anthony:
And lastly, we moved from IT science projects to business driven business outcomes. Now this fits into a broader trend of the consumerization of enterprise software. In our personal lives, technology works for us. We’re in control. Innovation constantly makes our lives better, but when we come to work, we suffer through broken systems and silo data and inaccurate decision-making. And the example here on the screen, like photos, and this is a very personal example because in my life, five, 10 years ago, I had a highly manual process of organizing my photos. I used to take photos on this thing called a digital camera, which was specific purpose built for taking pictures. And I would take memory cards out and stick them in the computer and try to organize my pictures manually into folders.

Anthony:
And inevitably it was a disaster. I would forget when we took the pictures, there’d be multiple memory cards, multiple cameras. But today, whether I use Google Photos or iPhoto from Apple, all of my photos are brought together. They’re stored on the cloud, they’re automagically organized, and I can share them easily. And all of this is done with the underlying power of machine learning so that when I search these photos, as I’ve done in the screenshot here for New Zealand, magically photos of or with text about New Zealand show up, and the question is, is this true for our enterprise data? Of course not. And think about data, the volume of data my photo collection is a lot. It’s absolutely nothing compared to what we see in the enterprise. As organizations move to best of breed SAS solutions, what we see is that not only is there a huge volume of data, but every source is unique.

Anthony:
And so it has tremendous variety and it’s arriving and changing at an unprecedented rate of velocity, but I’m probably not telling you as an audience, absolutely anything new here. You’re all well aware of the challenges of data in the enterprise. What is new is that we are at a unique moment in time, as it relates to the opportunity to change and get value out of the data that we’re collecting as a business. First, the primacy of business transformation as a key business strategy. At its core, every business is a data business. The second is the movement of data and compute to the cloud. And this changes the economics of data-driven businesses, really not only is every data or every business a data business, but every business can afford to get value out of that data. And third is the invention of machine learning algorithms applied specifically to the challenge of mastering data.

Anthony:
At its core, the idea of Tamr is machine learning for a purpose. There are many companies and technologies out there, which are working on general purpose machine learning algorithms. At Tamr, our aim is to apply machine learning to the difficult and challenging tasks of data mastering. And when these three trends come together, it unlocks a massive opportunity in the data management space and allows every organization to move from these big bang waterfall projects, to an agile and iterative approach to getting value out of your data. It allows us to move from these top-down developer led projects into ones which engage the entire organization. And from handcrafted human written rules to elegant, speedy, scalable machine executed models. And the result is that the market is much bigger, just like we saw with user-driven BI, which opened a whole series of new use cases. So too, as we move to a machine learning based approach, we will find that the market is actually much bigger.

Anthony:
This allows us to create focus and clarity of purpose. Our mission is simple and clear. To enable organizations to quickly and easily bring together their siloed, disparate data and quickly, and interactively deliver tangible business value as they become data-driven businesses. And we are doing just this at our customers around the world. In a wide variety of industries and use cases from financial services to regulatory use cases, accelerating research and development in the pharma industry. For example, at Scotia Bank where we’re consolidating over 40 data sources with millions of data records to comply with KYC and anti money laundering, know your customer, anti money laundering regulations to get a single view of their trading customers. At DHS, securing America’s borders by bringing together huge volumes of data into GTAS, which is the global traveler assessment system to make sure that criminals and terrorists are being identified when they travel.

Anthony:
Or GSK, where we are literally saving lives by developing new drugs by bringing together thousands of clinical trials study data for researchers or at Hess where we’re driving operational efficiency by bringing data together from oil and gas wells made up of thousands of wells, across many different data sources. And the speed and agile approach of Tamr is critical because those wells and the data coming from them is changing every single day. At Toyota Europe, which brings together data from all of their dealers across Europe, across 30 different geographies to get a single view of their customer, sales and service relationship, or at CAA, which matches artists, creators, and athletes, and other innovators on the cutting edge of popular culture, with opportunities to showcase and share their unique talents. The common thread across all of these customers is getting tangible business value out of their data. How do we do this? What’s unique about Tamr’s approach? We start with data on the left and analytic ready data sets on the right. We’ve already talked about the fact that the data is a mess and we meet your data where it is.

Anthony:
We acknowledge the challenges, the changes and the variety of data that’s in your enterprise. We apply a probabilistic machine learning based approach to allow us to do data mastering at scale for the attributes, records and labels in that data. So attributes, asking the question are these two columns the same column? Records, asking the question are these two records the same record? And labels, asking the question, what category does this data fit into? And when you apply these heuristics to this massive volume of data inside the enterprise, it brings together curated data sets for business critical data, allows you to answer key business questions like who are my customers? What did they buy from me? Who are my suppliers? And what did I buy from them? Basic questions that drive the heart of every single data-driven business. And the key to our unique approach is this machine learning probabilistic approach.

Anthony:
So let’s talk about that in a bit more detail. The challenge with the traditional rules-based approach to data mastering is that it’s rules first, value later. Not only is it incredibly time consuming, but it relies on the dark art of a few developers to generate and create these rules. And the rules are extremely brittle and they tend not to cover the full scope of the data. So as data comes into the system, we’re forced to create a series of rules to try to figure out every possible corner case in that data. And heaven forbid that new data would show up or data would change or shift with time and the rules break, and it requires recoding these rules over time. So what we find is that traditional rules-based projects take a long time and typically aren’t very accurate. And so it’s no surprise that in general, they’re not particularly successful and people don’t particularly enjoy working on them.

Anthony:
With a machine learning based approach, we turn that equation on its head. We allow you to deliver value extremely quickly with high accuracy. We do this by training the computer to do the hard work. So we build a machine learning model that looks at the data and figures out the best way to bring that data together, to categorize it and to figure out what data belongs with what data. So this machine learning based approach offloads that the hard work, the difficult time-consuming work to the computer. And what’s important is that this frees the energy of the smart people in your organization who know and understand the data and want to be able to answer these business questions, to focus their unique talents on the places where the machine learning is having trouble on those corner and edge cases.

Anthony:
This is a highly focused set of energy that is focused on the hardest, most challenging problems and challenges in the data to derive and get the most value. And the result is highly accurate results delivered quickly. So now let’s shift gears a little bit and talk about what we’re doing at Tamr. We’ve talked about how the data mastering space is ripe for disruption. And let’s talk about how Tamr is investing to deliver these capabilities to you. So I wanted to start by talking about our machine learning algorithms at the core of Tamr. Without a doubt, since the founding of the company out of academic work at MIT, machine learning is the core of our competitive advantage. And we’re doubling down on this capability. We focused on three major areas for Tamr and they are performance, accuracy and model explainability. On the performance side, we focused energy around two really important capabilities. Getting the machine learning to avoid answering obviously wrong answers and obviously right answers.

Anthony:
We do this through two capabilities called binning and pre-grouping. Binning eliminates these obvious non-matches and pre-grouping eliminates obvious matches. And the value here is performance. Because when we can focus human efforts on the most difficult questions, we can boost performance by not computing most matches, and we can shorten the workflow because we focus in on the most valuable places where subject matter expert opinion helps. Simply put, humans have less work to do, and the machines have less work to do, and the overall system runs significantly faster. The second major investment is around accuracy. Now, at its core, the idea behind Tamr is to present to the user examples. We know that people are much better at providing feedback on examples than they are on providing feedback to answers. This is why pairs matching is such an important capability for Tamr.

Anthony:
However, we’ve added the opportunity or the ability to actually operate on clusters or on answers. So in addition to providing feedback on examples, we also allow you to provide feedback on the end answers that come out of the machine learning algorithms, and we’ve added the ability to prioritize that feedback to high impact answers or high impact clusters. So coming out of the machine learning, when we show clusters of data, we tag certain ones as high impact, acknowledging that if you could provide feedback on these clusters, they’ll have the biggest impact on improving the machine learning model. In addition, we’ve added other ways to provide feedback to improve the accuracy of the machine learning. You can think of these as heuristics, ways to add signal into the model through simple heuristics, which give human input into the machine learning. And here’s a key piece about why Tamr’s approach here is so unique.

Anthony:
These are not rules. They’re not applied blindly. These heuristics we add into the system provide structural expertise to the system and they can often be in conflict. And the machine learning will use these heuristics and evaluate them in the context of the results that they provide. And so even when they’re not perfect, they still help improve the model. And so we’re driving significant increases in the accuracy of the machine learning and, last but not least, around model validation and explainability, we want to be able to answer the question when has there been enough training, and also when do we want to allow you to update or touch up the model? And as everybody knows with, especially with high velocity data, there are times when the model isn’t working optimally and you can touch it up. And this is precisely what Tamr now provides. We give feedback when the model could use some additional input and we provide the examples where the model is having trouble, allow you to address and touch up those corner cases, and the model keeps moving.

Anthony:
So we are doubling down on the power of our machine learning to drive significant enhancements in performance, accuracy, and model validation and explainability. We’ve also added an entirely new capability in our matching algorithm around location to provide the world’s first location-based mastering capability. The key innovation here is that we’ve figured out how to do geospatial feature matching at large scale with distributed compute. Traditional approaches for location mastering use a central index and therefore are single threaded and suffer significant performance challenges. At Tamr, we’ve built into the core matching algorithm, the ability to use locality information and calculate it as we do with everything in Tamr, in an entirely distributed compute method. And this allows you to incorporate geospatial features like points, lines, polygons into the matching process, and it integrates seamlessly into your existing mastering workflow. So as you can see in the screenshot, when presented with geospatial information, we give you a map, we show you an overlay, we’ll display two points on a map, and we’ll allow you to use that signal as part of your mastering workflow.

Anthony:
This opens up a whole series of new opportunities to look at location information in the context of your business data, because ultimately everything in your business happened somewhere. Now, you can actually do mastering with even that kind of data. We also believe strongly that the challenge of data mastering is a team sport and we need to engage everyone in the enterprise with the challenge of data mastery. So, we’ve added to the product, the ability to actually push or plug in feedback on data right where the user’s engaging that data. That might be a plug-in into your business intelligence or analytics system, or plugins into your operational systems, such as salesforce.com. When you’re using those systems and as an end-user when you engage that data, and it’s not what you expected, you can provide feedback into the data curation process to understand how and where and exactly what data needs to be improved. But here’s where this gets really exciting. By the entire organization in the challenge of data mastering, we allow everybody to provide feedback into the system.

Anthony:
So not only do we see where their data challenges and where we can improve data, but we also know where your best data is and where your best data experts are because data doesn’t answer business questions, smart people do. And with this capability, we build your organization’s data knowledge graph so that you know where your best, smartest people and data come together. Now, the other place we’ve been working is around boosting the signal in your machine learning using enrichment sources. What we’ve seen across all of our customers is that almost every single customer uses third-party data as part of their machine learning models. It’s an incredibly effective way to boost signal in those learning models, because sometimes the best data is not the data that’s inside your organization, but data that you can bring from outside. One simple example to illustrate this is the idea of address validation. You may have customers or suppliers or partners with addresses, and if you can have a third party improve those addresses, make them really accurate, that can be a really valuable signal in your machine learning for looking for matches across those entities.

Anthony:
What we’ve found through practical work with our customers is that when you can do enrichment with the data, you can have a dramatic improvement in the efficiency of your machine learning models. Sometimes as much as 50% improvements in the efficiency of those models right out of the gate. We burned this in directly into Tamr’s core. So you can simply add an enrichment source and go and get these significant improvements in your machine learning models. We’re also focused on making data movement into and out of Tamr significantly easier and faster. We’ve added a fast load capability backed by Spark to make it really easy and fast to get data into Tamr, specifically data that’s stored in file-based locations like AWS S3, Azure ADLS, and GCP GCS. With these external storage providers, we’ve added support for reading and writing directly into parquet files. This has the benefit of significantly increasing the speed with which and the throughput with which we can bring data into these machine learning algorithms run by Tamr and therefore the speed and time with which you can get the value out of that master data.

Anthony:
Speaking of getting data out, we want to make it really easy to publish the high quality business data that you’ve mastered in Tamr, right where you need it into downstream data sources, data warehouses, and operational systems. So we’ve added this ability to publish directly from Tamr into the sources that you use every day. And we want to do this with full support for versioning of entity types and the underlying entity data, so that we move just the data necessary into these systems. So this means now that you can pull data in the Tamr extremely quickly and having run the machine learning and master the data, move it into the operational stores and analytical stores where your users can consume it and get business value out of it. I wanted to wrap our new product announcements with what I think is the most important work that we’ve done.

Anthony:
We built Tamr to run natively on the three major cloud providers and fully utilize their native scale-out capabilities. This is incredibly important because it dramatically increases the performance of Tamr and lowers the compute and storage costs for you. This is the reference architecture for the cloud scale-out capabilities of Tamr. And you can see that we run natively on for compute, for distributed compute on Apache Spark, for our storage on Hbase, for the search capabilities on Elasticsearch And we store our metadata and PostgresSQL. We take this reference architecture and we apply it to each of the major clouds. So for example, on Microsoft Azure, we use Azure Databricks for our scale-out compute and HD insight for consumption and storage, or for AWS, this means we use EMR for all of our scale-out compute and data storage, or on GCP, where we use Dataproc for the compute engine and Bigtable for large scale storage.

Anthony:
So taken together, we’ve taken this reference architecture and we’ve tuned it specifically running on the three major clouds. And we’ve developed deep and strong sales and technology partnerships with AWS, Microsoft Azure and Google Cloud. And you’re going to hear a lot more about that over the course of these couple of days. Our sales teams are cross-trained and we are engaging joint customers globally, but not only have we chosen to build on these cloud platforms, but Microsoft and Google have chosen Tamr. They’ve chosen us for their data mastering challenges, and we look forward to seeing their success on their platforms. So what does this mean for you? Cloud native architectures are critical for data mastering at scale because they improve the performance while lowering your operational running costs. With a traditional architecture, as you scale up the data volumes, you expect to see exponential cost for storage and hosting going up as well.

Anthony:
And there are two important capabilities that bring this down. The first is this idea of the ephemeral capabilities and the second is elastic. The ephemeral capabilities mean that we can turn off compute when we’re not using it. And the elastic capability means the ability to add compute as it’s needed as a data mastering pipeline requires it. So the cost of scaling, even terabytes of data, millions of records stays linear. Now we worked with a third party, an independent third party to test these cloud-native scale-out capabilities on Google Cloud. And what we’ve found is that for large scale data [inaudible 00:27:44] projects, this results in a nearly 85% reduction decrease in operating costs, and that can amount to hundreds of thousands of dollars in annual cost savings. So by using an elastic ephemeral approach by running natively on these cloud scale out architectures, by allowing you to pay for just what you use, you can now run massive volumes of data with really capable machine learning at tremendous scale and at a reasonable cost.

Anthony:
So, as you’ve seen, we’ve made huge investments in machine learning, in user engagement, enrichment, data movement, and cloud native scale. Now, I want to share with you where we’re going longer term. When we think about Tamr today, we have a strong focus, as we talked about earlier, on taking massive volumes of data in and pushing them out into operational stores and analytics stores. And we do this across the three major clouds. Where we’re going and what we want to add are some really important capabilities. First, doubling down on this idea of boosting the signal in your machine learning with third-party data. We want to deliver curated data from Tamr to further enrich your master data sets. So this means building entirely new enrichment sources, sometimes even providing that data ourselves and allowing you to loop in third-party data through potentially a data marketplace. Second, we want to deliver apps, apps tailored to your specific requirements, designed to tackle your needs in the context of data and analytics.

Anthony:
So rather than providing a generic platform, we build an application framework on top of that, allowing us to deliver purpose-built applications to solve your most pressing data challenges. But in addition, we want to make that application framework available to you as a customer so that you can build applications to deliver to other parts of your organization and to our partners so that they can deliver purpose-built applications on that framework as well. And you could imagine gathering those applications together into a sort of app store, which allows you to start your data mastering challenge, not from a core platform, but actually from a purpose-built specific application to solve a problem. And last, we want to double down on the investments we made in our cloud scale out architecture. We want to offer Tamr as a fully hosted SAS solution on our own Tamr cloud, but let me be clear, this is not an instead of option, but rather an in addition to option.

Anthony:
Our ambition is to offer a hybrid SAS architecture, which allows you to work both with our Tamr cloud, but also fully leverage the investments you’ve made on your cloud of choice or clouds of choice. So to be very clear, in the future, you will be able to have a SAS version of Tamr, but execute the machine learning on data and compute that’s in your [inaudible 00:31:07] provider of choice, run by you. So what this looks like over time, as we’ve delivered the cloud native capabilities today, and we are working on a Tamr SAS solution as we speak, in the future, we will take a fully hosted Tamr SAS solution and split the data plane from the control plane, allowing you to operate a control plane in our cloud, running that against a data plane in your cloud native scale-out in the cloud of your choice. So you can run it entirely SAS on Tamr, and you could run it connected to the cloud architecture of your choice, where data and compute reside on your cloud.

Anthony:
Now, I hope you’ve enjoyed this update. It’s been a pleasure to spend time with you and share with you all the great work that we’re doing. We’ve shared our view on the changing industry, why the data mastery market is ripe for disruption and why Tamr is perfectly positioned to do just that. Our innovative product announcements, which significantly expand your ability to master data at scale and get business value faster, and our bold vision for a future, which includes a hybrid SAS architecture. And let me end by saying this conference is a gold mine of content, knowledge and inspiration, and I wish you all the best as you spend time with us over the next few days.