What is a Tamr ID?
A Tamr ID is a persistent identifier (PI or PID), that serves as a long-lasting reference to an entity across multiple data sources assigned by Tamr. Tamr IDs help to create a primary key that organizations can use across multiple databases when there is no common identifier for the same entity, such as a person, a product, or an organization. It can help provide a holistic view of the same entity across multiple operational systems, such as acting as an Enterprise Master Patient Index (EMPI) for patient health records. Or, it can provide a longitudinal view of the same entity over time, for example, when studying a company growing by mergers and acquisitions.
What is a “Persistent Identifier?”
An identifier is a unique identification code that is applied to “something” so that the “something” can be unambiguously referenced. For example, an ISBN number is an identifier for a particular book. In the United States, each citizen has a Social Security Number, which is an identifier for each particular person. This concept of having an identifier is important in database deduplication and entity resolution. In most databases, the identifier is called the primary key: the column or columns that contain values that uniquely identify each row in a table. A persistent identifier is an identifier that is effectively permanently assigned to an entity, regardless of the database this entity is represented in. Using the same analogy, once you assign an ISBN number to a particular book, that number becomes forever associated with that book, regardless of the library or bookshop where the book appears. No other book will ever receive that same ISBN number.
Why are Persistent Identifiers Useful?
There are numerous reasons why persistent identifiers are useful.
- Persistent identifiers are unambiguous: In database systems, we deal with records about many different things: a product, a person, a transaction, a supplier, a contact, and the list goes on. And, we have many different ways of referring to those records. But these identifiers leave room for ambiguity. For example, are Tamr and Tamr, Inc. the same company? Whereas a human can often discern the correct record based on context, it is difficult for a computer to correctly interpret the context.
- Persistent identifiers are persistent: Companies and people can and will change names. It’s also possible that the same product might have a different product code in SAP vs Oracle systems. But with a persistent identifier, you can track the same entity over time, be it a company, a person, a product, and so on. And the persistent identifier should stay the same over time.
Why is it Hard to Establish Persistent Identifiers?
In the context of enterprise data management, keeping the identifiers unambiguous and persistent is a difficult task. On the one hand, the organization is creating more data records and adding more data sources over time. On the other hand, business rules for mastering records may change. In order to maintain persistent identifiers, the database system must have the ability to compare two sets of clusterings: the previously published one and the current one. And, the system must be able to manage cluster member survivorship. This is particularly obvious in the mastering of corporate accounts and assets during cases of mergers, acquisitions, divestitures, and spin-offs. In these instances, records that used to be distinct entities are merged and records that used to represent the same entity are separated. This is a non-trivial problem both technically and visually.
How Does Tamr Manage Persistent Identifiers?
Within Tamr’s mastering workflow, the system assigns a Tamr ID to clusters produced from grouping records together and it tracks cluster statistics over time or reviews the history of records and clusters. These cluster IDs are guaranteed to be unique and stable for any downstream tracking in other apps, like Tableau, PowerBI, or other applications.
Not only that, Tamr offers change details between the last published version of clusters and the current version, enabling users to view these changes in a few different ways:
- Overall changes: A review screen shows the total number of clusters over time, as well as the total number of changed, new, and empty clusters since the last published version. Below, you can see an example of the overall record activity. Filters on the clusters page allow users to view the exact records and clusters that these metrics are referring to.
- Overall metrics per cluster: A sidebar shows the historical size of each cluster over time, as well as the records that have been added and/or removed from the cluster.
- Added/removed records per cluster: There is also a “Diff View” highlighting records that have moved into the cluster since the last published cluster in blue, and records that have moved out of the cluster in red. The sidebar information for each record will also show the current and last cluster that the record was in.
Tamr provides a robust process to manage and track the changes of merged and split entities provided to each record in the unified system. Through these IDs, organizations can quickly match the most updated groups of entities as well as see the lineage of their mastered data over time.