Enterprises have the goal of using clean, updated, data to drive their digital transformations. A key tool they expect to help them get there is a data catalog, tools that create and maintain an inventory of data assets and metadata through the discovery, description, and organization of distributed datasets.
Having a good data catalog is a great step towards the best-of-breed DataOps principles that we advocate for. In our view, a well-curated data catalog should focus on the following: A wide range of connectors to the right data sources, guided stewardship to link assets to domains, automatic classification of data, data lineage extraction, and tracking.
However, merely investing in a data catalog alone won’t magically solve all messy data problems. Tamr can help augment and improve investments in existing catalogs by quickly and easily migrating raw source data to published, curated data.
How does Tamr help enrich data catalogs with machine learning?
- Tamr can help catalogs identify related records/sources/attributes.
- Tamr can help catalogs automate data classification & tagging.
- Tamr can publish curated clean data back to the catalog for data consumption.
For the data stewards and data management teams, curating a catalog even with the right tools is an iterative process and the usefulness of the catalog gets better over time. We talk about these steps sequentially. In reality, it will be quite messy instead of step-by-step, but you already knew that.
How to curate a data catalog
1. Establish a common business domain
The data stewards who are most often responsible for managing catalogs need to work with the business side of the organization to establish a common business terminology, create data domains (what are the kind of data they want to curate) and set up relevant business glossaries. The roles for data users/owners and specific data policies (such as accessing PIIs) will be applied. A good data catalog should be able to automate some of the workflows on establishing data access and privacy based on roles and data policies applied to specific data assets.
Here Tamr can take those business terminologies defined by data sources and learn to automatically provide tagging. For example, after the data is profiled, Tamr can read through the content of the data and help automatically map the columns corresponding to names and addresses to “Full Names” and “Address Line”, respectively.
2. Connect to the data sources
Siloed data is a common data blunder in most organizations. To be effective, data catalogs need to connect to the data sources that are relevant to the established data domain and ingest metadata into the catalog. The catalog should be able to refresh the metadata continuously. Though in reality, what we often see is that customers dump whatever data they have into the data lake without knowing if these are relevant and try to figure out the usefulness later. Some data catalogs have a guided stewardship helping link assets to domains and automatic classification where they tag the data or columns. However, if you don’t know what the source is and don’t know whether the data is relevant, a catalog won’t be able to tell you.
Fortunately, Tamr can use the tagging that we learned from step one to enrich the metadata, and produce a golden record for metadata. And because we use a machine learning approach rather than a rule-based approach, the algorithm can more easily adapt to new data sources and can be easily adjusted as business needs evolve.
3. Data consumption
As the data catalog continues to be curated, data users can start using them through search functions like Elastic and the catalog might also provide data profiling and previews. Catalogs will also be able to extract data lineages from Tamr’s back-end for audit and compliance purposes. Some data catalogs have good collaboration capabilities where they can review/comment on data for other users to reference.
In other words, Tamr can make catalogs more useful so that data consumers can spend more time on curated data discovery with trustworthy data.
Tamr works very closely with different data catalogs. We’ve had experience in multiple projects to augment existing catalog implementations and help enrich their metadata. Most importantly, partnering with data catalogs in the most efficient and effective way so we can better serve our customers.