datamaster summit 2020

Code Free Data Transformation with PII Leveraging Google Cloud

 

Direnc Uysal & Vaughn Muirhead

Direnc Uysal - Chief Technologist
Vaughn Muirhead - Head of GCP Practice

Australia-based intelia shares effective data pipeline strategies to get your data “ready to master” and why organizations are turning to Google Cloud to manage their data transformation. Hear best practices in sourcing, transforming, and security multiple diverse data sets into a Google Cloud BigQuery staging area to supercharge your analytics.

Transcript

DATAMASTERS SUMMIT 2020 presented by Tamr.

Speaker 2:
Hi, I am Seychelle Hicks, APAC Lead at Tamr. I’m pleased to welcome you to this discussion about DataOps with our certified partner intelia. A Melbourne based data analytics, cloud, and intelligent automation consultancy. In DataOps and agile NextGen, best of breed and interoperable technology such as Tamr is critical. But equally so is the process and the people required to create and manage an end to end solution.

Speaker 2:
intelia as deep expertise and broad experience in delivering DataOps driven solutions to its clients. Well, Tamr excels in mastering data at scale with our cloud native machine learning approach to drive improved business outcomes with more accurate and timely data. There is one very important question that actually precedes that. Before the data from various source systems even comes to Tamr and all the different data silos that it needs to come from, a number of data questions, and challenges occur.

Speaker 2:
A few of these might be, what sources should I use? Can I get access to the source systems or the data lake? Is there sensitive data like PII that I need to consider? Can my data only remain in certain sovereignties and jurisdictions? And how can I efficiently move and secure this data at scale among many others.

Speaker 2:
So with this in today’s discussion, Direnc Uysal, Chief Technology Officer in intelia and Vaughn Muirhead, GCP Practice Lead at intelia will shed light on effective data pipeline best practices as shown with GCP BigQuery and sourcing, transforming, and securing data in order to make them ready to master by Tamr and ultimately to supercharge your analytics. So without further ado, I’m excited to introduce our intelia partners Direnc and Vaughn.

Direnc Uysal:
Thanks Seychelle, I’m very excited to be here and partner with you at out bringing well being and capability table into the APAC region. As most of you have seen, the 10 platform has significant capability when organizations are looking to link and master this with datasets. Today, we’ll be taking you through the part of the process before time, right? How do you get excited to get up in a staging area to effectively make a great [inaudible 00:02:15]?

Direnc Uysal:
We’ll touch on tools to make your life a lot easier, [inaudible 00:02:19]. We’ll haul out some of the business challenges when dealing with PII or confidential data. Then we’ll show you how the Google Cloud Platform provides a fully managed secure and scalable solution to sync your data into.

Direnc Uysal:
I want to quickly touch on who intelia is? Vaughn can you jump to the next slide for me. So intelia is focused on data. That’s all we do. We’re a consultancy headquartered out at Melbourne Australia, and we work with our clients to mature their data capability. We’re finding more and more organizations want to start leveraging the power of advanced analytics and AI. But [inaudible 00:02:55], there are a number of data maturity steps, right? Including how do you build effective data pipelines heading clean and secure your data assets? And then how do you take advantage of NextGen technologies like Tamr to truly unlock the value that your data upholds.

Direnc Uysal:
I can tell you [inaudible 00:03:14] on Google [inaudible 00:03:15] is Tamr. We see PCPs capability in the data spaces, head and shoulders above the competitors. And today we’ll take you through some of the platforms, capabilities, and its online power.

Direnc Uysal:
Next slide Vaughn. What are some of the considerations when integrating your duct system building data pipelines? Right? The key one is that most organizations don’t have clean raw data sets to just plug the data managing [inaudible 00:03:41], right? So significant efforts is expended on cleaning, transforming, and unifying that data.

Direnc Uysal:
The second area that’s getting a lot more [commonance 00:03:49], especially with global initiatives like GDP, GDP Arab and Europe and [contribute 00:03:54] data rocks in Australia, is how private and confidential data is handled. A lot of organizations establishing clean lines of delineation between who can access PII data and who can access only the attributes that don’t have anything confidential in them.

Direnc Uysal:
And the formula is, how scalable but hotlines and staging area are? Do you and your organization want to spend time configuring infrastructure, monitoring it and making sure it scales out and is brilliant to use, or do you pick a platform on JCP that does this work.

Direnc Uysal:
Next one Vaughn. So today we’ll be giving you a walk through of a real life example of a couple data pipelines in GCP. We’ll show you both how flat files and data have systems. And in this case, SQL Server can be integrated and transformed. We’ll show you how to leverage some of the inbuilt smarts of GCP to take some of the work off your hands. And finally, we’ll show you how all the data lands in BigQuery, GCPs next generation cloud data warehouse, and is ready for further analytics. With that, I’ll hand over to Vaughn to flow through the demo.

Vaughn Muirhead:
Thanks Direnc. So as Direnc mentioned, we’re going to take you through a real world example of a lot of the data manipulation techniques that we use, or we have used with [inaudible 00:05:20] to [inaudible 00:05:23] government client in this example. So what you see on the screen is an architecture diagram, very simplified data architecture diagram to speak through the general business problem and how we prepare data. Specifically the handling of PII to prepare this data for Tamr.

Vaughn Muirhead:
So this particular client, their business problem was one where they have multiple patients, multiple data silos, and these data silos might exist on-prem or in various cloud infrastructure. And the business problem was that they wanted to understand a single patient as it exists, or that person exists across all of those data sets. So Tamr would call the solution to this, I believe, patient mastering.

Vaughn Muirhead:
So the problem here, we have the same person over years of their life popping up in multiple data sets for this government client. The name maybe misspelled, the phone number might be different, lots of human error and incompatible systems collecting data, and the problem is how do you actually identify a single person across all of those data sets?

Vaughn Muirhead:
Now interestingly, as you might imagine patient data contains some very, very confidential or personally sensitive information and this particular client had a requirement such that their team who manages all of the data, it’s actually split into two groups. Group one is allowed to see personally identifiable information and not the attributes that go with that. And group two is able to see attributed information, but not the personally identifiable information. I’ll get into what that really looks like if I zoom into my diagram here.

Vaughn Muirhead:
So again this is a very simplified, one single small table that we generated for the purpose of this explanation. But if you look at the raw data, you can see that we have a table containing people, their [inaudible 00:07:33] information such as their name, Medicare number. And then we have what we might call attribute information. Such as their income or their public housing application status. Again, for example, in this case.

Vaughn Muirhead:
So this is a combined table. What we call a raw structured data. In the case of our solution architecture on Google Cloud Platform, this data as it appears in its raw format, as it’s loaded onto Google Cloud. This is actually architected to be in a very secure environment. That’s only accessible by automated service accounts that have to process this data into the next two steps that I’ll talk about, and a very select few authorized users.

Vaughn Muirhead:
So again, the data management team in order to pay ultimate respect to the handling of personally identifiable information, they’re going to split this data into two one that contains PII only. And that’s what we’re showing in the top of our screen here, pardon me. And one that contains the attribute information only, and a key that joins the two.

Vaughn Muirhead:
Just to dive a little bit more explicitly into what we’re doing here, you can see that this table it exists on GCP and yet secure project. This is only accessible by group one who is authorized to see the person identifiable information. And you can see the architecture here we went from a full structured table, and we split that table into PII only. Again with the record ID and then we applied the Tamr technology. Again, this is a very simplistic, I think the Tamr team can explain to you better how this works in detail. But for the purposes of my example, we’re adding a master ID to these records.

Vaughn Muirhead:
And so again, this is one table. In reality you may have hundreds or thousands of tables with up to billions of records that are going through this same process. In order for this explanation, we’re simplifying this very much. On the other side of the coin, again, we have a restricted environment only accessible by the data administrators in group two. This contains the attribute information where our data analytics would take place.

Vaughn Muirhead:
So the ultimate use case is once all of these patients have been mastered and we have a single record that’s been identified by a Tamr, all I suppose, cluster and prepared properly such that all of our records from our disparate data systems have become one. This attribute information is actually what our analytics teams or their end users are going to be typically using to query. So we’re removing personally identifiable information and leaving the attribute information behind. With a master ID and a record ID that can be used by authorized users for the event that these tables need or can be joined again later on.

Vaughn Muirhead:
Now, just to illustrate the example of what I’m talking about with this architecture here. The dive into a technical demo, and this will be demonstrating a product on Google Cloud Platform. So this is a managed product called Data Fusion. What data fusion does is it allows us a code free way effectively to create data pipelines. And I’ll show you what I mean by a data pipeline step by step here.

Vaughn Muirhead:
But basically, we’re taking data from a source system or multiple source systems and bringing that data from that source system into this first use case that I’ll show you, it’s Microsoft SQL server on-premise. And we’re eventually going to put that into Google’s next generation data warehouse server, or Google’s warehouse BigQuery. In a clean and usable format. And much like the previous architecture example which our showed, the main purpose of our pipeline is to actually split that data and place it in BigQuery within those secure, different environments.

Vaughn Muirhead:
So on the top we have, data which is accessible to the team or the group as part of that team that has access to the attribute information. And then on the bottom side of the pipeline, we have our personally identifiable information being landed in BigQuery in a separate secure environment.

Vaughn Muirhead:
So quickly, data fusion, this is a really interesting product. So it does a lot of things including connecting to various data sources. So most of the data sources that you use today, or you might think of, you simply choose from this menu on the left. In our case I’ve chosen Microsoft SQL server and actually pre-prepared a pipeline that I’ll step through shortly.

Vaughn Muirhead:
But it’s as simple as selecting your [inaudible 00:12:38] source, configuring with [inaudible 00:12:42], with lots of [Precanon 00:12:43] transformations that you can use to interpret, decrypt, crypt and those types of things. Very common steps that data engineers or your business probably has to do manually, every day. And this makes that a lot simpler.

Vaughn Muirhead:
And again, without writing code this is a big thing. As a cloud architect, I’m much more comfortable with the diagram that I showed previously. So diagrams are my strong suit, but even I think I’ve been pretty comfortable using this tool. So I’m not a coder, and I didn’t have to code to get some really interesting things happening.

Vaughn Muirhead:
So what I’ll show you today is, to dive into our pipeline. So if I just look at the properties of the SQL server connector, once it loads up, I’ll show you how this has been configured and what it’s doing as part of our pipeline.

Vaughn Muirhead:
So very simply, you can see that this is connection information to, in this case, it’s on [mock data 00:13:39] on-premise SQL server. So we give it the IP address and the connection information. And interestingly, the way that we get data out of our data source in this case, is a SQL query. So if anyone is familiar and most people in the technology industry are pretty familiar with running at least some form of SQL query, that’s what you do to get data from our source database, in this case, out into the rest of the pipeline.

Vaughn Muirhead:
And in this case it’s very simple query, we’re selecting star from a human resources table again, this is mock data. It’s not big data. If it was, we’d be a little bit smarter with their query rather than just doing a select star, but I think it’s going to be fit for purpose for this example. On the right hand side, we actually have the output schema of our input data source. It’s actually giving us a preview of all the attributes or information in our table. And you can see that some of these have check boxes that are tipped and some of them do not. And simply this interface allows us to select the data that we care about, that we want to process into the rest of the pipeline. And that’s it.

Vaughn Muirhead:
So I’ll show you the next step. If you look at the top of our pipeline where this is actually a data wrangling step. The cause of many massive headaches for data engineers around the world, is this data wrangling step. And here I’ll show you a little bit about how we data wrangle to get, in this case, a very simple schema input to an output schema for splitting our data.

Vaughn Muirhead:
On the left hand side here, you can see our full schema of the data types that we’re interested in. You can get pretty complicated in the inside, in this case it’s fairly simple we’re removing the PII. So on this side you can simply see that we’re keeping everything on this table except things like, first name, last name, health number, and those sorts of things.

Vaughn Muirhead:
Of note, we’re actually keeping phone number and email address. And so these are personally identifiable information, but I’ll talk to you about why we’re keeping those a little bit later on in pipeline. Because it’s actually quite interesting.

Vaughn Muirhead:
Okay. So the next step, and this is where we get into the explanation of why we are keeping that PII data. In fact, let me just give you a previous step. And by the way, data fusion allows you to create draft data pipelines, test them up without actually executing and across the entire datasets. In this case we’re dealing with several hundred rows. But I’ve seen this product deal with billions, literally billions of rows.

Vaughn Muirhead:
It’s able to do that because it actually executes on top of people’s big data or one of their big data processing systems called data Dataproc. Which under the hood is actually you’re at Duke or spark ecosystem, just a managed service like Google, data fusion sits on top of that and [inaudible 00:16:44] skates away, as opposed to the writing of the Java code to get those pipelines executing. And it gives us a very convenient way to preview and make sure our pipelines are working well.

Vaughn Muirhead:
Just in this preview mode, it took about a minute to execute and it gives us as many rows as you like, in this case, it’s just the first 10 rows showing what our input schema looks like, including all the PII and where output schema looks like, very simply in this case. With our PII removed, except for the phone number and the email address. These are clear text right now.

Vaughn Muirhead:
Moving into our set called redact. So if I click on preview now. Again, on our left-hand side, we see our clear text, phone numbers and email addresses are inside. We see that it’s been replaced by these surrogate values. Just flip quickly into our record view, the surrogate values are actually what’s called deterministic encryption.

Vaughn Muirhead:
So basically what happens here is, it’s no longer personally identifiable. You have a phone number and email address that has now been encrypted in a deterministic way. And this means that if you have a unique value in your encrypted form, that can be used as a unique key. To try and make that make a little bit more sense, as a data analyst, it’s not really that great if you have a whole table full of millions of records [inaudible 00:18:24], you can’t actually identify a common, in this case, a common person. So that person might appear thousands of times in that data set. If you have no way to relate those records to each other, you’re in trouble. And I think your analytics might not be as valuable as it could be.

Direnc Uysal:
[crosstalk 00:18:42] in these cases, if you’ve had someone’s record in system A, and their phone number in record B, and you send it through this pipeline and it gets encrypted, and you’ve got the same person’s information in another system over here where their mobile number is still the same, that gets encrypted. Those encrypted values are going to be the same, so you can perform analysis on those.

Vaughn Muirhead:
Yeah, exactly Direnc. Then to move a little bit further, when Tamr does its mastering it’s going to take that person, and find the common phone number that might be its golden record phone number, the source of truth. And then what we’re going to do is actually encrypt that such as its no longer personal identifiable but it’s still useful evidence.

Direnc Uysal:
Still [crosstalk 00:19:26].

Vaughn Muirhead:
Yeah. This tool again, without writing code, it’s a little bit configuration I’ll show you quickly how that looks. It actually uses machine learning, believe it or not. Under the hood this is a service Google call, DLP API, which stands for data loss prevention API.

Vaughn Muirhead:
This is a configuration screen for the plugin of DLP API. We configured a template in the back end, I’ll get into that in too much detail. We’ve discovered template ID and then we’re just defining what sort of things, or what fields we’re interested in encrypting effectively. So if it finds a phone number or email address in this field, it’s going to encrypt those whatever [inaudible 00:20:09].

Vaughn Muirhead:
In our case, it’s going to use deterministic encryption. It can use masking, there’s lots of different types of ways to deal with personal identifiable information. In this case, we want this to be unique and reversible, but if it were a credit card number or other information, which we’ll talk about a little bit in presentation, you could just mask that. If it’s not useful for analytics, and it might be risky to actually have that in clear text. However, we might just mask that, which is basically hash-lines instead of numbers.

Vaughn Muirhead:
So finally, to finish off our talking to pipeline, we’re finally going to move our data into BigQuery. This is just where we landed in a table and our table looks like this. Again, if I switch into our record view, it’s very simple, small table, including our survey encrypted values. I can show you a little bit later what that looks like in BigQuery itself. But effectively we’ve made it through our data splitting processes from the top side of the pipeline and moving PII.

Vaughn Muirhead:
The bottom side of the pipeline. I won’t set through this in grand detail, but it’s a very similar thing, just the opposite where this is our PII and our attribute information. So I’m just showing the preview as in BigQuery would appear. We have just our PII and not attribute information to our customer requirements for their business problem. We have separated those in automated and secure and respectful way.

Vaughn Muirhead:
What I’ll do now, is just quickly move into the second pattern I want to show today. So, this pattern is taking data from on-premise SQL server. The other pattern is actually taking data… It’s a very similar pipeline. It’s almost identical, but instead of taking data directly from an on-premise database, we’re actually going to take data from an extracted file on Google Cloud Storage.

Vaughn Muirhead:
So I’ll flip to the next tab, [breath 00:22:07] pre-prepared a pipeline preview there. So in this case, our source data system is Google Cloud Storage. Now, I just want to talk briefly in terms of options here. The two patterns that we’re shown today. The first one I showed you was connecting to Microsoft SQL server directly. This is pretty efficient because you have all of your data pipeline controlled by the cloud.

Vaughn Muirhead:
So you have a very scalable cloud system. You have an easy to configure and manageable scheduler that controls when it gets data, it gets data directly from the source there’s no middle person in the loop. So you can understand how this might be very efficient. On the other side of the flowing, and in my experience, we’re almost never allowed to use this use case specifically to reach out to on-premise systems because they’re not always scalable. And the people who own those systems within an enterprise are typically quite protective of this, and they prefer to control their own destiny, so to speak.

Vaughn Muirhead:
So the second [pounded 00:23:06] and I’ll show you, is the other side of the coin where we have our data owners who own those systems. And again, these might be hundreds of different databases within an organization. All these data silos. We have our owners controlling their own destiny and extracting and pushing the data on their own terms to Google Cloud. This would land in our Google Cloud Storage. From Google Cloud Storage, that’s when we were actually processing our data pipeline, exactly as I’ve showed before.

Vaughn Muirhead:
There’s one little difference which I’ll show, I’ve actually added something specific to that data, which I’ll get a little bit into detail on the PII service shortly. But effectively, this case you don’t have to worry about your on-premise datasets, [inaudible 00:23:55] is also here, from not only just making your data stakeholders feel more comfortable that you’re not going to melt bear on-premise systems with some infinitely scalable cloud service.

Vaughn Muirhead:
It allows possibly [inaudible 00:24:12]. We have a lot of connectors here for our multiple data sources, but you might not see everything under the sun in this menu. But almost every day source I can think of it allows extraction to a type such as CSP.

Vaughn Muirhead:
And so if I just jumped into our properties on this [inaudible 00:24:31] us to connect quickly. I’ll show you what the source data looks like, and how it’s being processed in terms of this type [inaudible 00:24:42] from the last example.

Vaughn Muirhead:
Right? So in this case, we can see that we just have an end point on Google Cloud Storage for our file this one happens to be CSV. It’s really standard stuff. We’re not telling you to skip the header, but there’s lots of different configuration tools in here, that you might be familiar with when dealing with the CSV that you might need to be aware of.

Vaughn Muirhead:
And very, very simply we’re taking that CSV, and creating a string and the body. And that’s [inaudible 00:25:12]. We’re moving that into the next part of our pipeline, which is our Wrangler, and this is where we actually create the recipe to deal with the CSV. So we describe what the internal data looks like. There’s a few different options when you’re doing a CSV data, including the delimiter.

Vaughn Muirhead:
So here’s our recipe, and this can be saved and reused or at least a template of this can be saved and easily copied and pasted. And with simple modifications perhaps you might have a different separator in your case. Really easy to configure here. But yeah, it’s like a different language just for dealing with CSV to automate that headache of potentially preparing or cleaning or [Wrangler 00:25:55] is the best term actually, which the [inaudible 00:25:58] uses in order to get your data into a useful structured format.

Vaughn Muirhead:
What you can see on the right-hand side, this is our output schema. You might be familiar with the tick boxes, choose the data that we’re interested to bring into the rest of the pipeline here. This is actually a live preview from Google Cloud Storage. It’s interpret the schema and it’s showing us what sort of data we’re dealing with? It’s actually really convenient that way.

Vaughn Muirhead:
From there, the rest of our pipeline actually looks really similar, including in our redact step here where we’re actually using Google’s DLP API to redact this data. I’ll show you, dive into this just quickly again here.

Vaughn Muirhead:
Okay. So we have switched to our record view. We have again our email address, our phone number and on the other side, we’re encrypting that just the same as we had in our previous example. The difference that I’ve thrown in here is I’ve actually added this free text field into the data for an example.

Vaughn Muirhead:
Somebody who might understand machine learning and data processing a little bit, you might say, why would I use a heavy solution or perhaps a complicated solution it might seem, such as machine learning, simply if I know that I have a credit card number in a field on a database, why don’t I just apply some kind of hashing algorithm or something like that on-premise before I extract it, then you don’t have to deal with it.

Vaughn Muirhead:
That is a totally valid solution in a case where you understand where your personal identifiable information lies, a lot of cases we’re not that lucky in enterprise or our clients aren’t necessarily that lucky, especially in the case where you have a free text field. For some of our clients, this keeps them up at night. I’ve never known what’s going to be in there, and it’s pretty, pretty hard if not impossible to add rules to be confident that this personally identifiable information can be discovered or, were dealt with appropriately.

Vaughn Muirhead:
So a lot of our customers, what they end up having to do is completely eliminate this field from any data analytics. Actually, that reduces the value of the data. So in this case, we’re applying our same data loss prevention API template to the free text field, except we’re configuring it such that we’re using a technique called masking just with what earlier, rather than encryption, because we don’t need to keep the credit card information, and I’ll show you what this looks like in a little bit more detail when we flip to our BigQuery final stage.

Vaughn Muirhead:
But you can imagine if you have a whole glob of free texts, and having a machine understanding where the PII is in there and deal with it automatically for you in a highly accurate way. That’s allowing our clients to sleep a little more soundly.

Direnc Uysal:
Yeah, mate. And then a couple of applications of this organizations that have to have call centers that they type things depending on what the customer’s saying to them. And secondly, more and more, especially lately is speech to text, right? So it’s recordings of conversations and almost the whole thing going to a glob and how do you pause this stuff. This is my [inaudible 00:29:14].

Vaughn Muirhead:
We have lots of use cases like taking notes on client, this person’s birth date is this, you might think that’s useful. Then you’re having to deal with that PII that you might not have understood or your employer might not have understood. Shouldn’t be captured first place in most [inaudible 00:29:32].

Vaughn Muirhead:
The Final, actually before I do that. This demo it does exactly the same thing from outset. It’s splitting our data, PII and attribute information into separate tables and landing them into BigQuery into separate tables in BigQuery. I’ll show you, this is the actual BigQuery interface on Google. It’s browser interface allows us to look at our datasets and our tables. So in this case, in my architecture diagram, we had separate projects where the data was landing in separate projects.

Vaughn Muirhead:
For simplicity in this example, I’ve actually got these just separated into separate tables within one dataset. This is not our production viewer accounts look. What I do want to dive into is, as quickly if you look at our first step, and click the preview tab, and you can see that the data that will appear in this table looks very much like our example that I showed you in the pipeline, which is this side of the personal identifiable information. And I’ll jump into the second example, where we have our non-PII table.

Vaughn Muirhead:
When this preview loads up, you’ll be able to see that we have our encrypted email address and phone number. So you can use that as a key to understand common record inside this table and you’ll see the extended version of our free text fields. If I showed you the raw data, which I won’t right now, because this is all my personal information, I actually tested this out with my credit card number and my birthday and my age and so forth, and my medical care number. It’s actually found all these automatically and perfectly and [inaudible 00:31:21].

Vaughn Muirhead:
That’s pretty awesome. And this is one of the reasons to use Google Cloud, because they do have these advanced over box services such as DLP API. Again they have a managed and code free, scalable product like data fusion, which is easy enough to use for someone like me. I don’t like Java code. But I do like diagrams and that pipeline looks a lot like a diagram to me. So my thought there-

Direnc Uysal:
Perfect. Thank you Vaughn.

Speaker 2:
This is great guys. Um, I really appreciate seeing all the work that’s really required upstream of Tamr to get it in good shape and to secure the data more than anything for the intended uses. I guess a question for you is, Tamr is the only data mastering solution that is cloud native, where we leverage the native components of say, GCP Dataproc for compute or BigTable for storage, GKE for elastic search. But what are the benefits of using built-in GCP tools as opposed to other custom integration solutions outside of Tamr?

Vaughn Muirhead:
Okay. I’m going to take that one. So if we take that in terms of using something existing and configurable, let’s provide it as a service versus rolling your own, or building or coding because there’s a lot of options. We have lot’s of creative and smart people on our teams, who might be capable of building some of this functionality. I suppose there are several advantages [inaudible 00:33:00] a couple right now. Obviously one is ease of use and cost, and for lack of a better term, time of market.

Vaughn Muirhead:
If you actually want to accomplish something, and you have the right tool for the job, then you can pull off the [inaudible 00:33:14] in certain things immediately. And again, data fusion would be an example of this. That it doesn’t take a lot of training to get it up and running, and using this without being a coder.

Vaughn Muirhead:
I think that’s one of your first benefit.

Direnc Uysal:
To that Vaughn, the more and more clients we work with, even their advanced analytics teams. They’re spending so much time on cleansing and messaging the data and writing custom codes to be able to do that. Did they focus where they should be focusing on, that at least pace is pulling around. Right?

Direnc Uysal:
So almost throwing water to tools of [inaudible 00:33:49] of GCP, leaves focus on where you should [inaudible 00:33:52].

Vaughn Muirhead:
A 100% Direnc. There are lots of other reasons why you want to use something that’s purpose-built for this kind of stuff by others and not yourself. One is standardization as [inaudible 00:34:04] and actually not resources in the market. We already know how to use this tool, you [inaudible 00:34:11] your own proprietary tool. You might just have a couple [inaudible 00:34:16] that understand how that works, then that’s a risk to your business.

Vaughn Muirhead:
And the third thing that I’ll cover [inaudible 00:34:24] they’re not easy nuts to crack. It’s actually hard to test billions of rows of data while you’re developing your special [inaudible 00:34:33] product. Google have really done a great job here, making sure that their products are scalable. So there’s [inaudible 00:34:41] and more. I don’t think there’s a limit to the amount of data you can [inaudible 00:34:45] these tools that will process those at scale, and work. I mean, I don’t want to [inaudible 00:34:50] really so. A [inaudible 00:34:56] that’s [inaudible 00:34:58] care for. It’s the best practice not to try to roll your [inaudible 00:35:03] standards as well. It’s better as [inaudible 00:35:04] shoulders of giants in this case. Thanks for that question.

Speaker 2:
Yeah. It makes a lot of sense. Thinking about why you would use maybe all of GCP. There’s some of this functionality especially Cloud Data fusion, that seemed pretty intuitive and could be helpful. Can you use that if you don’t use GCP more widely?

Vaughn Muirhead:
Yeah, absolutely. So, it’s something that not a lot of people consider as being intuitive, but this is a tool, and it’s a managed service. It’s easy to get up and rolling with GCP, even if that’s not the only tool that you used, you could absolutely spin up a product like data fusion and configure that to reach into your sources. I’ve shown for my team SQL server, and rather than placing that into BigQuery cloud platform, we can actually place that process data right back into [inaudible 00:35:58] source, maybe even the same source that it came from.

Vaughn Muirhead:
So you have this code-free managed service that handles your ETL. It’s actually really, really cost-effective as well. And you can absolutely run that, and even use it to process data on-premise or even other cloud environments if you’d like to.

Speaker 2:
That’s great to hear. Just one more, want to be conscious of everyone’s time. This has been really interesting. I loved the demo. When Tamr thinks about scale, we think about it in the context of the three big V’s of big data. You have volume, you have velocity and you have variety. Creating an NVP architecture, pipeline or model across a few sources or several hundred or thousands or even millions of records. It’s doable by many good, strong teams. But scaling a solution design is much harder. So, how can these data pipelines really scale? I’d love to hear your thoughts on the scale problem that we see all too often.

Vaughn Muirhead:
Sure. So the case of the data fusion with the scale is actually handled by Google because data fusion is operating… It’s a nice looking gooey interface, but it’s actually operating [inaudible 00:37:16] Google’s big data processing, managed service called Dataproc. And so, under the hood of Dataproc is your [inaudible 00:37:25] spark environment.

Vaughn Muirhead:
So this is tried, tested and true big data technology capable of handling batch and streaming. So Google basically handle that [crosstalk 00:37:36].

Direnc Uysal:
And as [inaudible 00:37:40] points. It doesn’t matter if you’re doing a pile because it’s going to be quite tiny and dealing with [inaudible 00:37:47] all the way through to billions and billions of records. You don’t manage it, you throw it [inaudible 00:37:53] a nice guy [inaudible 00:37:54] for you. And I guess that’s the elegance and the beauty of [inaudible 00:38:00] basically.

Vaughn Muirhead:
Yeah, you write the recipe [inaudible 00:38:03], with that pipeline, with a few points and clicks, and we’ll scale that up for you as a process [inaudible 00:38:10] to any [inaudible 00:38:10] records. And our examples has been dealing with 10 or a few hundred, but, I’ve seen it crosses billions and [inaudible 00:38:18] at all.

Speaker 2:
Awesome. Well, is there anything else that you guys want to share today? Any of your learnings, time on the battlefield about building pipelines?

Direnc Uysal:
Sure. I guess the [inaudible 00:38:34] early [inaudible 00:38:36]. A lot of organizations have that insight in mind. It’s how do I do [inaudible 00:38:41]? How do I live [inaudible 00:38:43] and capability to [inaudible 00:38:44] in AI and machine learning? To much [inaudible 00:38:47] place. It’s messy. I needed to clean it. Leverage platforms [inaudible 00:38:53], leverage best practices is in data pipelines, to lift and cheer you up, so your focus is on that instant.

Direnc Uysal:
So your focus is on transforming [inaudible 00:39:03].

Vaughn Muirhead:
Yeah. I suppose advice from me, be bold it’s not as scary as you think it is. These tools they’re available, they’re easy to use, if you’re not comfortable doing that reach out to [inaudible 00:39:23] can help you get started.

Speaker 2:
Awesome. Well, really appreciate your time today sharing all this information. Seeing it live in action, I think always really brings it to life. These are a lot of big problems that companies and organizations are trying to solve, especially around things like PII and data sovereignty. And Tamr’s thrilled to partner with you and the Australian market and broader APAC to really bring the solution to life.

Speaker 2:
Since we’re data experts, but there’s a lot that has to happen to that data before it comes into Tamr, which you highlighted today. And I think we’ll have to have a second session on what you do after you have that good clean master data. So round two will be coming.

Direnc Uysal:
Brilliant. Look forward to it.

Vaughn Muirhead:
[crosstalk 00:40:05].

Speaker 2:
All right, thanks so much everyone.

Direnc Uysal:
[inaudible 00:40:08].