Written by Margaret Soderholm
Harnessing Meta Data in a Robust Catalog Opens Dramatic Opportunities for Organizations
With the fast-growing interest in data lakes — a storage solution that allows structured and semi-structured data to live in the same place — attention is turning toward metadata as a way to organize large amounts of enterprise data. Metadata is an ambiguous and generic term, but most commonly refers to attribute names, data types, relationships, basic data quality metrics, usage stats and access controls. Metadata is literally data about data, and is often left unwritten — stored only in the heads of those in the know.
At Tamr, we believe that capturing and harnessing this metadata in a robust catalog can open dramatic opportunities for an organization. Specifically, a catalog can improve the availability of enterprise data, allowing data scientists to quickly and confidently gather the necessary data for analysis, while enabling data stewards to understand how data interacts and connects across sources and silos.
Let’s start with an example: You’re a data scientist tasked with analyzing payment terms across all of your organization’s suppliers, with data from hundreds of ERP systems flowing into a single data lake. The fact that this data is in a single location might give you the illusion that it’s easily accessible for analysis, but it’s actually no easier than trying to find your car keys while rushing out of the house. You know the keys are in the house. But you don’t know where you can immediately find them when it’s time to leave. A well-maintained metadata catalog would make far easier for you identify the attributes and sources required for your payment analysis across organizational silos. The result? Significantly less time is spent collecting data, and more accurate outcomes can be achieved.
This all sounds great in theory. But what about in practice? Here are four best practices for starting to manage your own metadata for analytic applications:
1. Start with Questions (The Hard Ones)
Before you begin thinking about metadata, start by thinking of the most impactful business-level questions that your organization would want to solve, and the subsequent data required to answer them. The data needed for an analysis of cross-sale opportunities between your lines of business may be different from the data needed for a projection of inventory for your stores. Thinking through requirements ahead of time and ensuring they are baked into your metadata catalog can be an immense time-saver when it comes time to perform analysis; you may not have the data immediately ready, but you’ll know what’s available, its quality and reliability, and where to get it.
2. Identify Core Attributes and Sources (Customers, Suppliers, Parts, etc.)
As you develop these key business questions, you’ll no doubt get a better idea about the underlying entities that would be required for analysis. In the payment analysis example above, the key entities would be suppliers and payment terms. For a pharma analysis, it could be patient, drug and experiment data. With respect to metadata and analytics, understanding entities and their relationships is critical for downstream analytics.
3. Identify Key Data Experts
The most valuable metadata often isn’t stored in a database or data lake. It’s stored in the brains of people. In other words, the data owners experts who are often spread throughout the enterprise.
Understanding table relationships, completeness or emptiness indicators, and table structure is way too big a job for one person. The knowledge is split up amongst the various domain experts who use the data regularly and IT analysts who create and maintain the data structure. Everything from quality metrics (e.g., ‘99999’ means null in this attribute) to data origins (e.g., average county income was used for the ‘income’ attribute) to much more nuanced information (e.g., inflation in the US vs. inflation in Mexico) is stored somewhere in the minds of your owners and experts. Once you’ve identified the business goals and what kind of data you’ll require, make sure you verify with these experts that you have everything you need, and that you have what you think you have.
4. Create a Protocol, and Be Consistent
Data changes constantly. New business initiatives and needs pop up every day. Responding to all these changes ad hoc is not going to lead to long-term data stability. Instead, create a more deliberate process for reviewing metadata changes and monitoring data streams for change. Metadata is a critical part of a healthy data ecosystem, but it only takes one oversight or mistake to render it ineffectual.
Clearly, this is the most difficult step to implement. Part of this implementation is deciding what tools to use for tracking and maintaining data deltas. Master Data Management (MDM), software that uses user-defined rules for matching entities and mapping attributes was developed for exactly this reason. Many of those tools have been incredibly effective mapping segments of a data ecosystem. But there are a few problems with using this top-down rigid approach: namely, we don’t know what we don’t know.
At Tamr, we’ve taken a probabilistic, bottom-up approach to building a data unification platform that learns what it doesn’t know. Algorithms automatically connect the vast majority of data sources and resolve duplications, errors and inconsistencies among entities and attributes. When the Tamr system can’t resolve connections automatically, it calls on people in the organization familiar with the data to weigh in on the mapping and improve its quality and integrity.
Tamr Catalog is a key component of this unification platform, cataloging automatically all metadata available to the enterprise. Think of it as a logical map for all of your enterprise’s information, organized by logical entities (what the data represents), rather than where the data is physically stored. This map essentially provides you total visibility into all enterprise metadata regardless of type, platform or source. So that when you do start with the hard questions — as these best practices encourage you to do — you’ll have some quick help identifying the sources, attributes and key experts that are central to good metadata management.
(Now if we could only help you find those keys …)
To learn more about Tamr Catalog:
Maggie Soderholm is a Field Engineer at Tamr.