As a modern data mastering platform, Tamr overcomes variety in data to produce mastered records for critical business entities such as People, Organizations, Supplies, and Products using a combination of machine learning and subject matter expert input. Tamr’s Machine learning strategy empowers companies to overcome data quality issues that have hindered previous efforts and to achieve higher quality results in a shorter period of time when compared to legacy solutions.
Due to the data’s relative ease-of-access, oftentimes the best time to take advantage of a data mastering platform to logically group key entities is while migrating one’s systems onto the Snowflake Data Cloud. To further contextualize how this may be, in the below example we discuss the case of a large corporation with a number of subsidiaries, each with its own customer records. Each division has their own data management strategy. For example, Globex has built a customer mastering pipeline for all of their data, while Universal Exports maintains separate mastering strategies for in-store and online customers. Previous attempts to build a mastered list of customers have failed for this corporation because each of the subsidiary’s systems has its own data silo with its own naming conventions, not to mention the difference in the data content. So, the mastering rules written to master customers for Globex did not readily address the data from Universal Exports.
With the customer data from each subsidiary’s source system now migrated to Snowflake, Tamr can be applied to produce a 360-degree view of our corporation’s customers, taking into account data from all source systems. The flexibility and robustness of the Tamr platform lend itself well to this task, as it has been designed to overcome data variety at scale.
For example, below is a view of one key customer where the value of Tamr is clear. Allyn Berry’s data has been aggregated from 11 records across multiple different sources to produce a 360-degree view that the corporation can effectively leverage. This is a view they did not have prior to using Tamr:
Tamr automatically generates such views of mastered entities, and can assign them to subject matter experts within an organization to provide feedback on the curation process. In this sense, we assure that whoever knows that certain data best is the person to confirm its accuracy:
Tamr also tracks user engagement as subject matter experts are assigned curation questions to improve on the results created by Tamr’s machine learning model. For example, here we can see two users, Marcus, and John, have been actively interacting within Tamr:
With this context in mind, let’s look at how the Tamr workflow has been enhanced to take advantage of Snowflake’s new governance features to improve the value and security of an organization’s data.
First we’ll show it in action in this video, and then we’ll take a closer look at how this is done:
Snowflake offers the ability to tag objects such as tables and columns in order to identify sensitive data and to aid with compliance and access control. In the following example, we have generated a Snowflake project with five distinct datasets. Each entire dataset, or individual column within a dataset can be tagged with one of three sensitivity levels:
1 – Low sensitivity, least restricted information
2 – Medium sensitivity, somewhat restricted information
3 – High sensitivity, most restricted information
Let’s say that a data security officer has reviewed the corporation’s data stored in Snowflake. They have confirmed that data from the Globex subsidiary is considered highly sensitive. In addition, the DOB field is considered PII, and is therefore also highly sensitive. The data security officer creates and applies tags within Snowflake to reflect these sensitivities:
In the above graphic, we demonstrate how a user may utilize Tamr to ingest datasets with a certain sensitivity value, then repopulate Snowflake with the redacted data. Fortunately for this use-case, Tamr has the ability to apply or read metadata to a dataset or to its underlying attributes. By reading this metadata from Snowflake, we can direct Tamr to censor data based upon its sensitivity. Below we see how sensitivity metadata is automatically applied to datasets ingested from Snowflake into Tamr:
Tamr transformations are then automatically applied to remove values based on this sensitivity information. If we now look back at our example customer from before, Allyn Berry, we can see that this view is the aggregate sum of 8 records versus 11 The three records from Globex and Universal Exports Instore have been removed from this 360-degree customer view. In addition, the DOB field has been censored, ensuring PII redaction requirements for this organization have been met.
As a result, the data privacy officer can have confidence that the data sensitivity tags they create and maintain in Snowflake are applied to the data as it is mastered by Tamr. In fact, this functionality will allow multiple master and golden record views of data to be automatically created depending upon the data consumption use case (e.g. data residency requirements, PII compliance).
Let’s imagine that we have completed the rest of the steps in a Tamr mastering project, and that our mastered customer 360 dataset has been added to Snowflake, along with a version that aligns with the data sensitivity policies set by the data security officer. Either table is being queried and/or accessed by many different users within Snowflake.
Human feedback is an integral part of the Tamr workflow. By taking advantage of Snowflake’s data access history feature, a potentially whole new set of data consumers can be identified and brought into the Tamr feedback workflow. Dynamically including the users accessing these tables the most into the Tamr model feedback process will ensure a high-performing model over time. Users can view user history for a dataset by querying the Access History view. This view can be represented in Tamr’s operational mastering dashboard, as seen below:
Now comes the step to connect the list of users accessing Tamr data in Snowflake with the users available in the Tamr application to assign questions to. To do so, we first create a group within Tamr to represent the users in Snowflake accessing the Tamr mastered data:
Second, we define a user policy in Tamr. Within this project we configure two items. First, we set the scope of the policy to one or more projects. For this example we have granted access to the Tamr mastering projects and their associated golden records projects:
Second, we define the level of access the group members should have to the relevant projects. In this example, they have Reviewer access:
Now, we take advantage of the Access History view in Snowflake to understand who is consuming the Tamr mastered data. Here is the view as of this step:
In Tamr, we add to the Snowflake group those users who have permissions to access both Tamr and Snowflake, and have accessed the Tamr mastered data in Snowflake a certain number of times. To do this, the Access History view is programatically queried to determine who has accessed the Tamr mastered data in Snowflake. Tamr’s APIs are passed the results of this query to affect the group membership. Right now, the only two users that meet these criteria are Marcus Halberstram and James St-John Smythe. With proper user accounts created (perhaps through LDAP integration for Tamr and Snowflake), members of their team can now assign questions to them to accurately train the model:
Over time, more people may start to consume the Tamr mastered data within Snowflake. By querying the Access History view, we can see this consumption scale and automatically update the Snowflake Reviewer group in Tamr to include these users. For example, now we can see that Liz Lemon is frequently accessing the Tamr customer 360 tables in Snowflake:
By refreshing the group membership, we are now able to assign questions for Liz Lemon to answer:
And, the Tamr mastering dashboard is updated to show the status of her assignments
With this programmatic approach, Tamr is able to effectively and efficiently utilize Snowflake’s Access History view in order to ensure that the most appropriate subject-matter-experts, those accessing the data most, are automatically included into the Tamr mastering project.
Snowflake has released a set of powerful governance features that will be impactful to any enterprise customer. These features can be utilized effectively by Tamr to improve the security and governance of an organization’s data and to expand the scope of people who are involved in the Tamr’s feedback workflow using a data-driven approach. The Snowflake data governance and Tamr partnership not only improves the quality of one’s data stored in Snowflake, but it also leads to increased trust in that data, and thus further consumption.
If you have any questions about how you can effectively leverage Snowflake and Tamr to master your data, please contact our Technical Sales Director for SAP & Cloud Partnerships, Stuart Rorer at [email protected]
If you’d like to learn more about the assets used in this blog post and how to use them in your own deployment of Tamr and Snowflake, please feel free to contact your Tamr account manager for more details.