datamaster summit 2020

Accelerating Business Outcomes with Tamr and Google Cloud


Pallab Deb

Head - Partner Solutions & AI Partnerships, Google Cloud

Learn why customers are turning to Tamr and Google Cloud to master data at scale. Topics this session will cover include the cost saving generated by leveraging Tamr’s cloud-native capabilities, how Tamr works with Google’s data services, and what customer should know about migrating to Google Cloud from on-premise systems.


Speaker 1:
DataMasters Summit 2020. Presented by Tamr.

Nicole Wong:
All right. Hi everyone and thanks for joining us for today’s session, which is, accelerating business outcomes with Tamr and Google Cloud. Today, you’ll be hearing from Pallab Deb, head of partner solutions and AI partnerships at Google Cloud. He is responsible for developing a technology partner ecosystem for Google Cloud’s AI product area. Also joining the discussion is Dan Bruckner. Dan’s helped found Tamr and serves as our lead engineer for scale, performance and, not coincidentally, GCP integration. Lastly, I’m Nicole Wong, a senior sales engineer at Tamr.
I’ll be moderating the session, which is again, accelerating business outcomes with Tamr and Google Cloud. With that, let’s start a conversation. Some of the topics we’ll touch on are, what’s behind the migration to the cloud, what a migration to the cloud looks like, what technologies are making mastering data easier and how Google uses data to generate business outcomes at a global scale. The first question is for Pallab. There was a time when primarily startups used the cloud.
Now, major companies like HSBC, Target, Home Depot and Dow Jones are using Google Cloud to run parts of their business. What’s causing the momentum for large enterprises to move to the cloud?

Pallab Deb:
Thank you, Nicole. Thanks for the question. First and foremost, Dan, good to see you again. It’s a pleasure to be with you and hopefully we’ll have a insightful conversation in the next half an hour or so. Nicole, getting back to your question. Well, I mean, if you look at the way the cloud journey has play out in the enterprise world, of course you had early adopters, like you rightfully pointed out, in terms of startups and born-in-the-cloud companies.
If you look at regular enterprises, whether they are in retail, technology, media, manufacturing, banking, financial services, and so on, I think all of them dipped their feet in the water a few years ago, just to check out what this was all about. They’ve obviously heard about the benefits of cost arbitrage, price arbitrage that they would get by moving applications out from their data centers and have them run on this massive public hyperscaler. Lo and behold, they did get those advantages, right?
I had the benefit of working with some of these early banks. In fact, the first bank in the U.S. that really moved to the cloud in a big way. They were euphoric about the cost advantages that they were able to get on the cloud, and that was a big, big win. Then of course, as they moved more and more workloads, people started asking the question, look, within the bank, for example, speaking of the same bank again, they had spent years building up applications, running these applications in different databases, and then building all sort of feeder applications around those.
Not almost always in the most elegant way. It was almost always designed around the constraints that were prevalent on the day those applications were written. These companies started thinking about, “Look, when we go to the cloud and these cloud guys seem to have enormous amounts of scale, whether it’s on the network or on the compute and storage and so on. Maybe it’s worthwhile for us to start rethinking about how to use native cloud abilities. We just don’t take the stuff that we have in there, but we also try to transform that.”
That’s kind of the second phase. We’re seeing a lot of that already happening. I think there are some industries where this is happening faster than the others. I would clearly call out retail. I would call out media. I would even call out, in some ways, insurance and healthcare. Fundamentally, to put it very simply, anything that’s B2C, I think they have a much higher impetus in getting it done because they really want to get the experience of a Netflix or a Uber or an Amazon or an Apple out to their users, right?
Because as users, all of us expected similar experiences, whether we are using the application for a business use or personal use. That’s the second phase. Now the third phase is the most interesting one in my opinion, where people are saying, “Hey, let’s not try to re-engineer these applications. We have lived with it. Yes. We have tried transforming it. What if we were to step back? What if we were to step back and think about the process of the way we write insurance today, or the way we issue mortgages today?”
Any one of us who’s hopefully gone through the mortgage process here knows that it’s an excruciating process in the U.S.. Numerous documents have to be written out and people show up. There’s enormous amount of what we would call waste that happens in the system. There are companies today, large enterprises, they are thinking about the way they’re going to write mortgages in the future, where perhaps there’s not a single document that needs to get printed out.
There’s perhaps not a single individual on the other side, who’s having to read through your document, take figures and facts out of that and put it into a spreadsheet so an underwriter can assess your loan and decide whether it’s a thumbs up or thumbs down. People are rethinking that whole process. People are thinking about how to build it grounds up in the cloud native world. That to me is the third era of transition of companies adopting the cloud.
Going back to your question, yes, major companies like HSBC, Home Depot, Dow Jones, and so on and so forth, have really gone through this journey. The braver of these ones are the ones that are going to the third phase, or let’s put it this way, all of them are headed there. Some of them are faster than the others in regards to their maturity. That’s a long answer, but hopefully it came across and I answered your question in that respect.

Nicole Wong:
It was very interesting to hear about your perspective and what you think about that. I have a related question, actually, for Dan. We know expanding in the cloud is a game changer for a storage and networking infrastructure, and that brings about continuous deployment for dev ops. How does that bring about opportunities for data ops and MDM?

Dan Bruckner:
Yeah. Great question. First of all, thank you, Nicole, for organizing all of this, getting this all together. Pallab, great to see you again, likewise. I’m very excited to dig into this stuff. Yeah. You mentioned the progression from these core compute services and infrastructure that early stages of cloud provided and commoditized and made really accessible. I think as we look at what are the next stages in the rollout of cloud and all this infrastructure, what’s the impact for applications in data and data mastering?
I think Pallab’s points about these later phases in how cloud is evolving are really important. Because essentially what you can see happening now within these more data-oriented applications really we’re realigning around how can we not just take advantage of the large scale compute, but how can we take advantage of these cloud native services that are available in the cloud to get next level power in solving problems? Exactly the phenomenon that Pallab was just describing. Someone in the insurance industry coming in, rethinking from the ground up, how do you solve this problem?
Given the capabilities, the unique capabilities in the cloud. The same is true within data processing and data ops and master data management and so on. What it comes down to in a big way is the ease of use and the power that you get from cloud technology allows you to focus more on your application. In the case of data ops and data engineering, that means you’re spending less time worrying about problems of just low-level server administration, other things like how do I scale up this cluster?
How do I scale it down as needed? How do I make sure my core services aren’t crashing? It’s things like data intensive applications tend to have lots of large scale dependencies, software dependencies, the Hadoop stack, for example. If you’re not spending most of your time administering Hadoop, and you’re instead focused on writing pipelines that solve problems or improving the quality of data, or figuring out better ways to use that data to solve downstream business problems, then you’re going to have much higher impact overall.
You’re going to think about new ways to approach the problem. The shift of energy and focus turns out to be a game changer in that respect.

Nicole Wong:
All right. Awesome. Thanks for sharing that answer, Dan. This is a question for Pallab. Are there technologies that Google Cloud customers are especially interested in using to help improve how they use their data? What pain points does this technology aim to alleviate?

Pallab Deb:
Sure. Nicole, I’ll take up from where Dan left. Dan, you made an amazing point about how in the data world there is a lot of dependency on, of course, the ability of the platform to support what you’re trying to do with the data. Then when you don’t have to worry about that platform, it allows you so much more time to extract value out of that data instead of worrying about how to manage it and how to stand up the Hadoop cluster or run that Spark job and so on and so forth. That’s, Nicole, where we come in. That’s where most of the hyperscalers come in.
All of them have come to the table saying, “Hey, we’ll manage this stuff for you. We’ll offer this as managed services.” True to their pitch, all of the hyperscalers offer services that allow people who bring in their applications or their data management services to not worry about, at the administration level, what needs to happen. It allows you to, like Dan said, focus on making sure you’re meeting your business needs. Google Cloud does all of that. I don’t want to dwell too much on that. I think most of our customers who have spent some time with Cloud get that.
I think in addition, the parts of the Google Cloud platform that I like to highlight are, number one, we do all of this in an extremely secure manner. That I think is super important. Yes. Serverless and managed environments are important, but are they as secure as I would love to be? That’s where we can talk for a long time about security and how the infrastructure on which we support all of these abilities are the same ones that support, let’s say, YouTube or Google Maps and Google Search and so on. It’s built for resilience. It’s built for scale and thankfully, since we haven’t got hacked so far, it’s built for security.
You can come in knowing that you get the best of the Google security as well as scale and that allowing you to do what you need to do with your application when you bring it here in Google Cloud. Now, going a little bit above in the stack, so as to say, we also want to be able to give our customers the freedom of choice. That comes to the fact that for example, when Tamr goes out and meets with … when Dan meets his customers, I’m pretty sure there are quite a few customers that say, “Yes, I’d buy into the cloud, but there are a couple of things out here that I probably wouldn’t move to the cloud.”
Not because they don’t want to for the sake of it, but because perhaps regulation demands that they stay within the premise. Sometimes it’s governmental regulations, sometimes it’s industrial regulations and whatever have you. Therefore, our ability to extend all the best of Google Cloud down to at-premise through our hybrid and multi-cloud management platform, which we call in-house is again one important piece.
It’s really been a game changer, Nicole, because it allows us to be able to go and have the customer to say, “Hey, look, okay. If you’re worried about your data not leaving your premises, yet you want to use the power of Google Cloud, and you want to use your developer, give your developer that ubiquity of writing his applications without having to worry if that has to call a database that’s sitting on prem or on the cloud, guess what? We have a platform for that.” As long as you develop an [inaudible 00:13:13] and we just make it simple.
It’s the same Google Cloud console. You get the same experience. You don’t really have to switch between two experiences of saying, “Oh, okay, I’m going to use on-prem GK container right now or perhaps use something that’s running on cloud itself.” That’s number two. The number three piece or the third piece is when customers come to us and we again want to get them to their success markers besides hybrid, besides security. The third piece we talk about is our ability to get them to extract insights from their data much more faster and in a much more valuable manner, which is to say our abilities or the ability that they can tap into, into our machine learning and our AI offerings.
Whether it is data in the form of images or data in the form of structure data, or it could be video streams, or it could be whatever have you. Our ability to look for those and be able to arm you with tools from which you can take insights and feed it into your business process, that I think is, again … I won’t say unparalleled. But we’re definitely better than most of the folks out there in making sure that we are able to get you those insights in the most accurate manner. Then the second most important thing, in the most cost effective manner as well. You don’t have to spin up five GPUs to get to what you want to.
We manage that infrastructure load for you, and we’ll get it to you at a much more lower price point with a much higher accuracy, so you can take the insights out of those images and move on to the next step in your process. In the mortgage process, for example, it could mean looking at your W2s and picking up all the data that I needed for your W2s, and feeding it into the next step in the process where an appraiser is perhaps looking at that to decide whatever your loan estimate is. I’m just getting back into the world of Loan IQ [inaudible 00:15:09] meeting on loans.
This customer is really excited and wanted to see how they could use Google’s AI to really automate the loan process in a significant way. That’s the third piece, really.

Nicole Wong:
Pallab, could you talk a little bit about the portfolio of Google Cloud services? If there are any in particular that would benefit companies that are looking to master their data.

Pallab Deb:
I’m going to try and connect what Dan said back to the offerings around data, which I think is super pertinent and super relevant for our conversation. This is where the best of Google Cloud and Tamr come to our customers and show itself to our customers as well. One of the things that Dan talked about was the simplicity of the GCP console and the experience that we have brought into that console. Essentially Nicole, we are trying to extend that same experience and the same approach to our products as well.
For example, in our data portfolio of products, if you look at the data warehouse or the data platform that we talk to our customers about, it’s called BigQuery. Our endeavor is to create the same experience that you get in GCP console or in the consumer-grade Google application onto BigQuery as well. A good example of that is making machine learning available to users through a SQL interface. Think about that. You’re essentially saying basic machine learning capabilities that are available within Google Cloud be exposed to users who only know SQL and are used to writing SQL inside the GCP console.
That’s amazing because that opens up a wide area of the market to be able to interact with and benefit from machine learning. That’s a small example, but Google’s BigQuery is really one of those products on which we think we can help enterprises scale out their ability to manage data, whether it’s structured, whether it’s unstructured. They hopefully won’t have the need to create a data lake because BigQuery is completely serverless. It scales. It allows you to do machine learning on top of itself.
If you think about comparing this to the traditional world of today, where enterprise typically has a data warehouse where they use operational reports, or they use enterprise business reports for, they additionally have a data lake where they’re putting all the unstructured data. These are two environments. They need to be kept in sync. There’s a lot of overhead in managing a data warehouse and a data lake in tandem. In the Google world hopefully you will not have to do that. You can get all of that out of BigQuery.
Things of that nature, the focus on the simplicity of using that platform, the ability to support a different type of workload, say for example, streaming or batch. Google Pub/Sub is again one of the product that comes in place, especially when you need to work with streaming data. In the context of data, the idea is to make sure the platform’s available to handle all types of data requirements. Then we have partners like Tamr who bring in the ability for customers to deliver mastery over their data when the data is sitting on Google.
In a nutshell, where I’m trying to get towards is really to convey that between Google Cloud and Tamr, we share the same objectives of making sure our customers have the most efficient tool sets in front of them that they’re able to use and consume easily. They spend much more time getting value out of data as opposed to struggling to keep the data platform up and running. I hope that answered the question, Nicole.

Nicole Wong:
Actually, I want to go back a little bit. I remember you said a little bit about lift and shift, and so I want to return to this topic. A lift and shift approach to the could migration, it does have many benefits, but on its own it doesn’t solve underlining data quality issues or any siloization of the data. How can data mastering better enable enterprises in the cloud to be able to do this?

Dan Bruckner:
Yeah. To build on the theme of ease of adoption and expansion, and to return to a turn to a point that Pallab made earlier, migration to the cloud is really never an all or nothing proposition within an enterprise. It happens incrementally. It happens step by step. There’s some data that’s ready to move where the cultural setting and the regulatory setting is all right and ready, and folks are motivated to do it and get it done. Same for applications within the enterprise. Some of them are going to be ready to migrate while others stay behind.
A big piece of data mastering in that puzzle is we can help facilitate, make sure that as that happens and as data and applications, either migrate or don’t migrate, move to different settings, have different folks responsible for operating them, data mastering can help provide the links that keep all that together. Moving data to the cloud, getting them into this central data warehouse like BigQuery provides, can be seen as a big opportunity for doing some of this cleanup. You move a bunch of sources to the cloud, you connect them to what you already have in the cloud.
You get some linkage and now you know how your application’s running in the cloud or your analytics powered off of that cloud data warehouse connect back to operational systems or other analytical systems that are still running on prem. Getting that global view across all the data is I’d say the biggest benefit of thinking about mastering at the time that you’re thinking about cloud migration. It’s going to be several years before all of these processes have moved over and you want to be able to still connect the dots on data distributed across the enterprise and across environments.

Nicole Wong:
Perfect. Thank you for sharing that, Dan. Pallab, this is back to you now. Google uses data to generate business outcomes at a global scale. For enterprises that are looking to achieve the same results, can you share some best practices?

Pallab Deb:
Yeah. Sure. I think Dan did a fantastic job really calling out how the advantages of using a hyperscaler or Google Cloud in the true sense gives you so much more flexibility, so much more cost advantages and the scale as opposed to running it on just virtual machines. I think some of the examples that I’ll call out here are all folks that have gone that full journey, and they’re using Google native services and not just sitting, writing it out on VMs. The one that I like most is, well, unfortunately not the best story for the COVID time space, but this belongs to an airline. This is interesting.
When you think about a large airline with hundreds of airplanes in their fleet and servicing hundreds of destinations, think about the tasks of the [inaudible 00:22:48]. Think about the task of a planner or the operator who’s planning out how scheduling needs to happen in that airline. Apparently it takes about five years to really train up airline planner to be effective on the job, because it’s just that complex. Now, if you were to build out your entire application that is able to plan for all sorts of contingencies in regards to let’s say airplane allocation, or crew allocation or passenger [inaudible 00:23:26] or whatever have you.
You use the ability to do that. You’re taking hundreds of different signals. Some of it coming in, in real-time to you, some of it could be third-party signals which you never really bothered into your scheduling, but now you can, that’s a weather data, for example. You’re using machine learning to do that. That’s actually one of my favorite stories of what Google Cloud is doing with one of the bigger European airlines, to be able to really get them really, really efficient in how they plan and operate.
There are similar such examples across industries, but since you asked for at a real global scale, the other ones that I love calling out are, let’s say, in retail. This is with Lowe’s. We’ve all heard of them there in terms of their hardware. They’re a very large hardware department store. What they do in regards to using Google Cloud for planning their merchandise in the store almost on a daily basis.
Being able to take all the inputs in regards to what sales have happened, but more importantly, also any other inputs that need to come in, in regards to that neighborhood around, their area, their service, allowing therefore merchandisers to decide what sort of merchandise to be put up in the store the following morning or the following week and so on and so forth. That’s, again, a good example.
I’m going to the ones that are typically a little bit more back office oriented, because that’s the real hard problem that enterprises face today, of getting this supply chain, their inventory and all of that stuff fixed. That’s where the heart of the problems lie. Then of course going back into, I spoke about mortgages earlier, but that’s a real problem. Today, if you think about how banks and loan originators are thinking about really transforming that space, that’s a remarkable story.
Being able to take 300 documents that typically are needed to close a loan in the U.S. from start to finish and being able to use a document understanding AI from Google Cloud, to be able to recognize all those documents as you upload them. In many cases, you don’t even need to upload them because we’ll be able to go and scan or look over documents that you’ve already submitted, say, for example, your tax returns. Being able to pull stuff out of there so you don’t need to go through the trouble of uploading a whole bunch of documents to the mortgage provider’s website.
Talking about mortgage processing, which again is another interesting area where Google Cloud’s helping to completely transform the process along with our partners. The idea being there to be able to look at these hundreds of documents that go into a loan document and into a loan packet, and being able to use AI to really go through these documents, pull out insights that are needed and help make the next step in the decision cycle. Also, for the user it’s an amazing experience because you don’t really have to now go and upload hundreds of documents for your loan to be processed.
Last part of it is that you can literally cut that cycle time down to days from the weeks and months that it takes today. In a nutshell, Nicole, whether it’s transportation like through the airline example or retail, as you can imagine, or banking financial services and in healthcare, there are many too. I mean, there are publicly available case studies on our work at Ascension Health and with Mayo.
Again, those case studies are again very interesting because they are using AI to look at medical reports, let’s say scanned images from a CT scan or an MRI, and being able to alleviate healing process or the doctor’s diagnosis in a much more effective way. In a nutshell, these are really some of the stories that really stand out. Again, connecting it back to what Dan said, these are customers who were able to take advantage of the native capabilities to deliver this.
If you’re just going to be bringing applications and make it sit on a VM, but still use the old ways of running the application, you’re not going to be able to get this advantage. This is going back to what I said in the beginning, phase one, two and three. These are customers who are either in phase two, or are actually rethinking and redoing the way they want to do a particular business process. These are the customers who are getting those advantages relative to what we have to offer.

Nicole Wong:
Thanks Pallab for sharing those transformative outcomes. That’s really interesting to know about. We’re coming towards the end of our session today. Dan, to wrap things up from your side, could you summarize how Tamr and Google Cloud can help enterprises build a tech stack that supports scale and agility around data mastery?

Dan Bruckner:
Yeah. Absolutely. I’ll speak a little bit, first of all, about the inside the box tech stack within Tamr and what that looks like and then branch out to, okay, what’s the ecosystem going to look like more broadly on GCP when running with Tamr? Yeah. First, looking under the hood of the car. Tamr is going to bring a stack that is built … As I said before, were designed to run on top of these open source data processing engines. The key services in that picture are the two I mentioned earlier, Cloud Dataproc, which provides extremely scalable ephemeral cluster computing in particular Spark and data processing.
Then for storage, we use Bigtable for a lot of our primary storage so that we can perform extremely efficient, incremental processing so that large scale processing pipelines operating on millions, tens of millions, hundreds of millions or billions of records and mastering them, don’t need to be recomputed from scratch every time they run as incremental updates and so come in. Bigtable, again, gives us an easy to manage scalable solution for that problem. We use ElasticSearch which we run on top of Google Kubernetes Engine for indexing data so that users can have a nice interactive experience working with it.
For some of the smaller items, we run Cloud SQL for storing metadata, which is a key part. Google Cloud Storage is of course, part of everything. Then I’d say the other aspect is how we integrate with management layer in Google. Tamr’s built to follow all of the best practices around IAMS and security and identity management and access control. Everything within your Tamr stack you can manage and control access in the same way that you would manage any other service you’re running on top of GCP, or even how you manage GCP zone services.
Similarly, for monitoring and logging, we integrate directly with Google Cloud Logging and Monitoring Stackdriver so that you can see Tamr logs, Tamr metrics in context with the same Google services that are running your whole stack. You get the seamless really Google native experience running our stack. Looking at the bigger picture, Tamr is often going to be part of a solution where other key services, that there could be many players, but the usual suspects that we see are BigQuery, of course. Tamr works directly with BigQuery for importing and exporting data sets.
It’s usually where raw data sources that Tamr is going to be processing are staged. That’s the data lake side, you can picture it. Then it’s also where results and this unified data is going to land and become available to a broader user base within an organization. Big Query is really this key touch point for how Tamr manages data. Tamr runs its own pipelines internally using Dataproc, as I described. But often those pipelines are part of bigger pipelines that are going to be taking raw data from user applications that are either running in the cloud or on prem or potentially that are third-party remote data sources.
Frequently, what we see as part of a complete solution architecture with Tamr is Google Data Fusion will be running as part of the big picture, picking up those sources wherever they are, loading them in, performing basic transformation to get them into a good state and queued up for Tamr. Then Data Fusion will usually land that data into BigQuery where it can get picked up and used downstream. On the other end of the spectrum, Data Fusion may be used to push data out to other use cases, other users.
A frequent pattern we see as well between Data Fusion, BigQuery and Tamr is the use of the data loss APIs to make sure that secure data, data with PII coming into the system gets de-identified and to facilitate workflows where private details aren’t leaked to analysts. That analysts are still seeing clean data, clean unified data, but they’re not seeing the compromising details that may be present in the raw source data. Yeah. There are many other services as part of the complete stack, but that’s the triad that we see again and again and again on top of GCP.

Nicole Wong:
Perfect. Thanks for sharing that, Dan. The last question, Pallab, is actually for you. Before we close out, is there anything else that you’d like to mention about the Google Cloud and Tamr partnership?

Pallab Deb:
Well, I think Dan is very eloquent in explaining how Tamr works on Google Cloud and the value that it brings in. I’ll just amplify by saying that, look, customers come to Google Cloud or almost always when we talk to customers to say, “Hey, why did you choose Google Cloud?” It’s almost always because of our data and our AI. Not to say the security and the other things don’t matter, but this is what comes top of mind. They come to you because of your ability to handle massive amounts of data in a very secure and scalable manner, a very user-friendly manner.
Then, of course, the machine learning and AI that you can perform on it. I think when we are able to connect all of these together in the manner that Dan just did, in regards to how the Google core assets on data like BigQuery or Cloud Data Fusion or Dataproc, are really, really leveraged by Tamr and all in the purpose of making data processing simpler and more user-friendly for the end user, that’s a story that really resonates.
I think it just speaks to our partnership that you bring an amazing set of capabilities to Google Cloud to complementing what customers come to us for or know us for, which is our data capabilities. When they discover what Tamr has to offer on top of Google Cloud’s data capabilities, I think that’s an unbeatable combination out there compared to anything else on prem or on any other cloud.

Nicole Wong:
Awesome. Thank you. In our closing statement, I want to say thank you very much to Pallab and Dan for participating in the session today with us, and also for sharing how Tamr and Google Cloud are helping customers better use their data for driving business outcomes. We’ve covered a lot today, and we’ve learned a lot about how Google Cloud and Tamr are helping customers get more value from their data faster. Thank you guys again.

Dan Bruckner:
Thank you.

Pallab Deb:
Thank you, Nicole. Dan, it’s been awesome, as always. Thank you very much.

Dan Bruckner: