datamaster summit 2020

Developing a modern data architecture with the cloud

 

Pankaj Bagri

Vice President, Product Partnerships @ DataBricks

In this fireside chat, Tamr Chief Product Officer Anthony Deighton and Pankaj Dugar, vice president of product partnerships at Databricks, share how the cloud offers the scalability, favorable total cost of ownership, and elasticity organizations need to better leverage their data. You’ll also learn how Tamr fits with Databricks’ lakehouse approach to data management.

Transcript

Speaker 1:
DataMasters Summit 2020 presented by Tamr.

Anthony Deighton:
All right. Welcome, everyone, to the developing a modern data architecture with the cloud fireside chat with DataBricks. I’m Anthony Deighton, Chief Product Officer here at Tamr, and I’m joined by Pankaj Dugar at DataBricks. And we wanted to take a few minutes and just have a conversation about the impact that the cloud has on the opportunities around managing data and getting value out of data for the world’s enterprises. And in particular, this idea that data is moving at an incredibly rapid pace into the cloud, and that offers significant opportunities for scalability, really important total cost of ownership improvements, and driving significant productivity as you look for ways to leverage your data. DataBricks is a real leader in the availability of really fantastic, highly scalable and elastic compute on the cloud and making some really important investments in how to manage data in the cloud at rest, and we’ll talk a lot about some of those things, I’m sure.
So, thank you, Pankaj, for joining us, and maybe we could start just with a brief introduction for yourself. Who are you? And maybe talk a little bit about… Well, let’s start with who you are, and then we’ll talk about DataBricks.

Pankaj Dugar:
That sounds great. Thanks, Anthony. I’m delighted and honored to be part of the DataMasters series here. So, my name is Pankaj Dugar and I manage our product partnerships at DataBricks. What that means is it falls on my team to make sure that when a customer is trying to use DataBricks with a partner product, that they have a seamless experience in doing so. So, we are really responsible for making sure the integrations with our partner companies work well, and we do a lot of awareness building so our customers know what’s actually available to them.

Anthony Deighton:
Excellent. Maybe share a little bit about DataBricks as a company, but also as a solution and the kind of challenges that DataBricks helps address.

Pankaj Dugar:
Yeah, absolutely. So, DataBricks was founded seven years ago with a vision to allow all the data teams, which include data analysts, data scientists, and data engineers to innovate faster by collaborating on single now unified platform. And the platform allows the data teams to work with big data, not only to transform it, but also to do machine learning and predictions on that data and to analyze it. It was founded by the original creators of Apache Spark, but the platform has grown well beyond what we all know as Spark, which is basically the de facto standard for data processing at a large scale.
When I talk about DataBricks having grown well beyond Spark, what I mean is some important new innovations that we have done that include Delta, which is another open source project that we developed and put into open source a couple of years ago, and ML Flow, which is really a machine learning model management platform that allows you to not only track your experiments from the start but also to put it into production. I mean, we have over 6,000 customers around the world, and we process an excess of three exabytes of data per month on the DataBricks platform.

Anthony Deighton:
Wow, three exabytes. Yeah, that’s impressive. As you think about the DataBricks business, is there a particular set of companies that you see adoption or particular verticals that are adopting DataBricks more quickly?

Pankaj Dugar:
Yeah, that’s a great question. So, with the volume and variety of data, just having exponentially exploded over the last decade or so, DataBricks has really come in handy across a variety of industry verticals, including public sector. So, we have very large customers in the financial services industry, in the healthcare industry, media entertainment, retail, CPG, and we have a number, as I said, of public sector companies that are using DataBricks, as well. And the versatility of the platform is proven out by this very fact, that we have so many large customers, including many, many Fortune 500 companies using the platform today.

Anthony Deighton:
Yeah. No. And I think what Tamr sees, as well, is that these are very horizontal challenges. Something I think I’ve spoken about before, but this idea that every company at its core is really a data business, and so providing an infrastructure to be able to manage that data, it’s the factory of the century. It’s just something you have to have. And the cloud really changes the economics of that. And so, as data itself moves to the cloud, we need to be able to move compute to the cloud to be able to move processing to the cloud, and that seems to be something that DataBricks has been a real innovator on. Is that fair?

Pankaj Dugar:
Yes, absolutely. You’ve probably heard the term data is the new oil, but you have to remember that oil in its raw form is fairly useless and you have to take it through a bunch of transformations for it to become usable in everything from automobiles to jet planes. And so it is with data. because we’re a unified data platform, we give people the ability to use the distributed computing framework of DataBricks in the cloud, we’re natively born in the cloud, to be able to work with very, very large data sets. If I remember correctly, some of our largest customers have 20 plus trillion row tables that they are using to analyze threats, for example, and they use DataBricks to actually do it. We have lots of other vertical use cases, including in fraud detection, anti-money laundering.
I’ll give you a great example. If you use Comcast and you use the voice remote, all of that is powered by DataBricks underneath. In fact, when they introduced that feature, it was so powerful that they actually won a technology award at the Emmy’s for it. In a completely different example, Regeneron is another one of our life sciences customers, and they have used DataBricks to accelerate the discovery of a drug that helps with chronic liver disease. So, you see so many of these use cases that have direct impact on not just human lives, but also on making their businesses more competitive. That’s the reason why DataBricks has garnered customers across a variety of different industries.

Anthony Deighton:
Absolutely. Again, I think the mastering challenge is a really specific application of the generic idea of machine learning, but it’s a specific application around being able to find the right set of data that’s hidden, especially when data is stuck in all of these different silos. So, this idea that the data is moving to the cloud, one of the challenges we hear consistently from customers is you don’t want to just blindly move the mess that you have on your enterprise into the cloud. You want to use that as an opportunity to bring that data together. And following on your oil analogy, if you can bring these different sub components of the oil together, then you can refine different kinds of oil products, and that’s very much the idea behind mastering. It’s this idea that you have the data inside your organization to make really great decisions, but it’s trapped in all these different silos. And if you can actually apply mastering machine learning to it, you can actually find those unique combinations of data, which add real value.

Pankaj Dugar:
Which is why products like Tamr are actually so needed in the market today is, as we’ve seen an unprecedented rise in the movement of data into the cloud, and because storage in the cloud is so cheap, customers have, over time, just made their data lake into some sort of a storage service where pretty soon if you’re not organized enough from the get-go, your data lake starts to resemble your garage once you’ve lived in a house for multiple years. So, when you have a product like Tamr that lets you do data mastery, having that product to organize your data lake and the petabytes of data that you may have in there, and not only that, applying machine learning to it, makes Tamr such a successful product.
And I think that is the reason why a partnership between Tamr and DataBricks is so powerful, is you apply machine learning to vast amounts of data to enable customers to master it, and DataBricks is, as we know, pretty much the de facto platform in the cloud to enable this distributed computing to help customers not only lower the TCO of owning their data, but then getting to the insights from the data in a much faster timeframe.

Anthony Deighton:
Yeah. Couldn’t agree more, and I think that’s the essence of the partnership, is the ability to… Machine learning, which you can’t execute, or it takes weeks or months to execute is useless, we need is a high-performance engine to be able to process this data, introduce them in a highly distributed way on the cloud. Now, you mentioned, at the top, some of the innovation that you’re bringing to market, and you mentioned a couple of things, but I just want to pick on one for a second, which is Delta. It feels like that’s an important shift for DataBricks, an important new add, and I’m not sure everybody knows exactly what it is. So, maybe take a moment and share both what it is and why it’s so valuable.

Pankaj Dugar:
Yeah, absolutely. So, Delta is an open source storage layer that brings asset transactions to your big data workloads. See, normally what happens is when you have a data lake, you have multiple data pipelines working against that data lake, and you have lots of concurrent reads and writes. And when something like that happens, it’s very hard to maintain the integrity of the data. So, having Delta Lake as a transactional layer on top off your data lake, which essentially makes it into a Delta Lake, is something that our customers want because now they’re able to trust the data that’s in their data lake, which they weren’t able to do before. And any time they wanted to analyze the data that was in their data lake, that they had to spend lots of time and effort to actually move portions of it into like a data mart or a data warehouse so they could trust that data.
We’re seeing the need for sort of data marts and data warehouses diminish largely because of this new paradigm that we’re seeing in the market called lake house. And just quickly, a lake house is… So, decades ago, data warehouses and data marts became really popular because it was the single source of truth for customers, but as data has moved into the cloud, you’ve seen emergence of cloud data warehouses. But as we’ve discussed a few minutes ago, the volume of data that exists in a customer’s data lake is significantly larger than what typically exists in your cloud data warehouse. And the customers have started to demand that rather than me take a portion of my data and the data lake, ETL it into a data warehouse so I can start to do analytics on it, what would it take for a vendor or a partner or a technology partner of mine to enable me to do the analysis on all of my data that’s in the data lake?
And so it’s a little bit of play on words, but if you look at a data lake and data warehouse, we call it a lake house. And that is the paradigm that we’re seeing and the world is converging into this concept of a Lake house now.

Anthony Deighton:
So, would it be fair to say that the old way of thinking about those would be I dump all my data into my data lake and becomes this data swamp just full of junk, and then what I was going to do is take that data out, maybe run it through Tamr, try to see if I can find some nuggets of truth in it, and then land it in this expensive, difficult to manage data warehouse? That’s the old way of thinking about it. But what you’re saying is, today, customers can think about actually addressing the data in the lake directly, as though we’re already in the warehouse, and then run it through Tamr, do the magic that we do, we’re mastering it, categorizing it, and then, that is your final data and it can land back in that lake or it could land somewhere else. Is that a fair way of framing it?

Pankaj Dugar:
That is. I mean, the point really being, why use Tamr only on a portion of the data and that’s moving to the data warehouse? Why not use Tamr on all of your data that’s in the data lake? And there you go, you’ll have a single source of truth. The other benefit of this is, typically, when data comes into the cloud, it lands in your data lake. By the time you actually move that data into a data warehouse, time has passed and there’s pretty good chance that data has already become stale. In this day and age, what you need is to be able to analyze data as it’s coming in. So, if you could apply Tamr to all of the data that’s in the data lake, as it actually comes in, I mean, this is golden for the customer because now they have access to all of their data and the entire data lake, versus having to wait a few days, a few weeks until you can curate a portion of that data and move it into the data warehouse before analyzing it.

Anthony Deighton:
So, yeah. So, it’s fair to say it’s not an incremental shift, it’s a really new way of thinking about it. If a data warehouse is just an incremental change, this is the fundamental change that improves the economics and the mechanism by which the customer can get value. Is that fair?

Pankaj Dugar:
100%. I mean, just think about it simplistically from a customer standpoint, right? Let’s say you have a petabyte of data in your data lake. Okay? And let’s say you have 100 terabytes of data in your data warehouse. That’s a 10th of your data, and chances are, that data’s stale. What the customers are saying is, “I want access to my entire petabyte of data that’s in my data lake, and I want it now.” And so, it really is a paradigm shift that we’re seeing, which is why we’re calling lake house a paradigm. In fact, it was such an important thing that our founders actually wrote a blog on it. If you’re interested, you can just Google, “What is a lake house?” and this was a blog published earlier this year by the founders of the company. And it really talks about DataBricks’ vision for where we see the world going.

Anthony Deighton:
Got it. That sounds like something everyone should hop on over and check out. Now, but the obvious challenge is if you’re dealing with a petabyte of data or an entire data lake, especially if that lake is messy, is it’s going to be slow. Is it going to be crazy slow? But no, you guys have solved that.

Pankaj Dugar:
Yeah. I mean, obviously, I mean, Spark is a super fast engine, and we continue to actually make it faster and faster. In addition to that, we just announced, and currently available on Azure DataBricks, is something called Photon. It is a brand new engine, a vectorized engine that we’ve built from the ground up in C++, that is 20 times faster than Spark. So, another new thing that’s available in a private preview in Azure DataBricks in the Azure cloud today, but it is, again just… The types of innovation that’s coming out of DataBricks is monumental, and when you add companies, partners like Tamr on top of that, the possibilities, they don’t just incrementally increase for our customers. You’re talking about like just quantum shift in what our customers are now able to do, just given that you’re now able to process 10, 20 times data in the same amount of time that you were before, because you’re not only applying machine learning to your data, you just have a fundamentally faster engine that you’re working with.

Anthony Deighton:
Yeah. So, it reminds me of a conversation I was actually having, it was last week, with a large US drug manufacturer, and it’s a joint customer between DataBricks and us. And the challenge they faced is they’ve been using Tamr for a long time and they get tremendous value out of mastering their clinical trials data, so they have thousands of clinical trials, they’re all stored in different places, in different ways. And what they want is to be able to get insight into drug development by looking at trials across all of those in one place. But prior to the availability of the cloud, they had to stand up that infrastructure themselves, and they had to sort of manage it and run it themselves, et cetera. And it was becoming incredibly costly, but also difficult and time-consuming to manage the infrastructure, let alone the value coming out of that infrastructure.
Now, it was very valuable to them so they were willing to do it, but I know we’ve been working together with them and Microsoft with the Azure DataBricks to move that entire infrastructure on the Azure DataBricks and get them significant cost savings, but also, I’m sure, significant performance improvements and still get that business value out the back.

Pankaj Dugar:
Yeah. I mean, you touched upon two or three really important things that bear reiterating here. So, one is when you’re working with a product in the cloud, you’re always working with the latest version of the product, versus when you have it on prem and you may have upgrade schedules once a year, et cetera, et cetera. So, that’s one. You always have access to the latest innovation just as soon as it’s released. The second one is, this movement of data to the cloud means that you no longer have to maintain a large army of people to actually maintain your infrastructure on premises. Because the clouds have invested disproportionately in making sure the cloud infrastructure is optimized, it’s secure, and it’s reliable, customers no longer have to have armies of people and significant budgets to actually manage all of that when they had it on premises. Those folks can be easily repurposed and upskilled to actually work on something that’s a little bit closer to the business.
I mean, a simplistic way I think about this is, your business and your data, you want to be as close to that as possible. If you work for a company you want to be as close to the data, the actual business as possible. When you get into actually having to manage infrastructure and those kinds of things, you’re getting more and more into the periphery of the business. Why not leave that to somebody whose business it is to make sure that you actually have the most up-to-date infrastructure and the most optimized, the most reliable, and, in many cases, more secure?

Anthony Deighton:
Yeah. No, I think that’s exactly right. And I think what’s very important for people to take away is this idea that storage and data itself is moving to the cloud, but the compute is, as well. And when two sit next to each other, that’s where machine learning in particular becomes a really valuable technique, because it really relies on the ability to look at all of that data, and the algorithms that Tamr uses, but machine learning algorithms in general are incredibly complex. They’re compute intense. We’re looking, in Tamr’s case, all of the pairs of data. We need people to compare them and give feedback to the model. That’s computationally complex, but it’s computationally complex in a very defined period of time. And so, it’s a perfect workload to move to the cloud and take advantage of something like DataBricks, where you can now process that data extremely quickly and do it across the full data set because it’s sitting right there in the cloud. That’s exactly what this drug company was finding.

Pankaj Dugar:
Exactly. And the elasticity of the cloud, I think, is what you’re referring to, right? So, if you had to do it on prem, you’d have to spend weeks and months trying to sign us up what computing power you would need, but in the cloud, when you’re using Tamr to actually master vast amounts of data set, you don’t need to worry about that because you, more or less, get infinite scale in the cloud. So, whatever compute you need is going to be available to you with a snap of a finger.

Anthony Deighton:
Yeah. 100%. And I think the people, or we, I should say, as software vendors, we always talk about technology and how cool it is, but ultimately, what we’re really talking about is economics, right? I mean, it’s about getting value and creating new business opportunities through the use of that data.

Pankaj Dugar:
Absolutely. I mean, economics in terms of your ability to monetize the data that you have, an ability in terms of like the time you actually save in getting to the insight from data. One of the things I always tell folks, especially new hires at DataBricks when I teach the new hire class, is at the end of the day, all the customer is trying to do is to get from raw data to some actionable insight, but that entire process is fraught with friction. The more seamless companies like Tamr and DataBricks can make it for the customer to get to insight as quickly as possible, the more successful we’ll be in helping our customers achieve great success.

Anthony Deighton:
Yeah. No, I mean, I worked for many years in the analytics industry. What we found is that 80% of the work that analysts did was not actually analyzing data, it was trying to munge the data, get it into a shape that they could even start to answer questions, or trying to manually diagnose the data mastering problems. The fact that there was duplicate data and it was all over the place and it was stuck in silos, et cetera. And that’s wasted effort. It’s just wasted effort.

Pankaj Dugar:
It’s the number one problem, over and over again, the data scientists and data analysts [inaudible 00:25:07] engineer cite when they’re actually working with data is having to munge the data to actually clean it even before they begin the process of prepping the data for analytics. So, very well said.

Anthony Deighton:
So, any last takeaways you want to make sure that the audience takes away and things to think about as it relates to DataBricks?

Pankaj Dugar:
Yeah. I mean, we’re in some unprecedented times right now because of COVID, what we’re seeing is that is just an acceleration of customers wanting to move to the cloud and really make digital transformation a central theme of how the company operates. Data is no longer in the realm of lower… the employee’s at the bottom of the totem pole who just have to produce these reports. These have become boardroom imperatives. So, when customers are looking at data as the new oil, they ought to be using the appropriate technologies just to make sure that they have all of the data easily available to them with, obviously, confidence that the data has high integrity to make sure that they are actually providing the competitive advantage they need to their business.
You can no longer treat data and data analytics as something that’s in the realm of science. This has become a modern imperative, and that’s why you hear terms like modern data architectures that have become such a significant part of most boardroom discussions. Because if you don’t, then you’re going to be left behind, and if you had been ahead, but you’re stalling, then your competitors are going to just catch up to you.

Anthony Deighton:
Yeah. It goes back to this core point, which is that every data… every business, rather, is a data business at its core. So, you might think you’re a manufacturer, you might think you’re a drug company, but at its core, you are the data that you produce. And then the question becomes, how can you profitably take advantage of that data asset that you control, that you manage? And that’s where building a modern cloud-based infrastructure on the back of DataBricks is, as you say, it’s an imperative, it’s a requirement of doing business today, and then utilizing machine learning and doing data mastering at scale, that’s the mechanism by which you’re going to mine that incredibly valuable asset that you have and build competitive advantage from it. So, there really is a win-win situation for our joint customers.

Pankaj Dugar:
Yeah. And the final thing I’ll say, Anthony, is when people hear about machine learning and they hear about AI, it all seems like science fiction or, at a minimum, it seems too complex, but companies like Tamr and DataBricks are hard at work to make sure that we’re democratizing AI so it becomes much more accessible to the common person working with data and abstract away all of the difficulties of how these things work underneath.

Anthony Deighton:
Brilliant. Well, thank you. Thank you so much for making the time and joining us at DataMasters, and we look forward to many successful-

Pankaj Dugar:
Joint customers and success stories.

Anthony Deighton:
We have to work together. Exactly.

Pankaj Dugar:
Yes. No, Anthony, thank you for having me. Again, it’s truly a privilege and an honor to be part of this discussion with you.