DataOps: Where and How to Start (Stat)

How to spot a data readiness problem and expedite a fix

You’re responsible for managing data as an asset for your organization. From our first post in this “Accelerating DataOps” series, you’ve been reminded that DataOps is a continuous process and not a single event. Second, that perpetually clean, healthy and trustworthy data for your data pipelines is in the critical path to success.

You have critical data that you want to pipeline to solve a business problem or seize a business opportunity. But you won’t know how “bad” the problem, or how rich the opportunity is, until you: (1) get some actually “good” data and (2) make it fit for consumption (e.g., for operational improvement or analytics). 

Unfortunately, the path to clean, curated data is rarely straightforward. You’re almost guaranteed your data will be a mess because of decades of amassing data silos, inconsistent data management, and messy data exacerbated by freewheeling updates and duplicate-record creation. 

Further, the organizational resources to “fix” the data (i.e., people, processes, and technology) are often hard to identify and cost-justify. There are legacy technologies that need to be either embraced or replaced. There are legacy processes that haven’t evolved to take advantage of new skills, knowledge, or measurement capabilities. The cold truth is that there is no single packaged solution to your problem, but rather a collection of best-of-breed solutions from which to choose. 

But where to start? 

The answer: take a triage approach.

Much like how ER doctors must prioritize which patients to treat first in an emergency, you need to determine what to prioritize, and where to martial your initial efforts with DataOps. Which of my entities is in dire need of help? Which is in pretty good shape, or can wait? Which is hopeless? What organizational resources will I need, and by when, to be effective? What technologies need to be involved, and which do we have ready to deploy versus need to acquire? 

Here’s the process, from a high level:

Diagnose the Data Problem

To start, you need to stay agile and get a handle on all of your “symptoms” before you can “cure” the data with a complete DataOps pipeline. Therefore, plan to do a small spike project first. 

Start by building a data pipeline for a single data entity. Whichever entity you choose, make it an entity that is important to a business problem or opportunity and thus likely to be a good sample “patient” that will turn around a quick return on your investment.  For example, a common entity is a Customer Account, widely used across business organizations (although not always known by that exact name), frequently changing, and with data stored in multiple systems (e.g., ERP systems, Salesforce, and other CRMs).

Your team will need to include subject matter experts, people chartered with curation of Customer Accounts within their business unit, and technical resources capable of building your data pipelines and setting up required technological solutions.

Apply the Right Therapies

Next, build a DataOps pipeline through it, going through all of the usual steps, using your most advanced therapies.

 

Modern DataOps Ecosystem

Source: “7 Components of the DataOps Ecosystem,” a DataMasters podcast

 

Cataloging (crawling and registry services) can help find and identify your data sources. From this exercise, you’ll discover how consumption-ready your data is: how distant it may be, how messy its source systems, what software is involved and so on.  Next, go deeper to learn the status of the underlying data sources at an intimate level. For example, how many languages is your data in? Is the data standardized or in 15 different formats? Is it curated, and if so, what is the process of that curation? 

You’ll now be able to prioritize all the related data sources to get a complete picture of them and their health (individually and holistically). This will enable you to focus the right amount of attention on each data source according to priority. 

Data mastering, quality and enrichment will help you get the data ready for production use. Modern schema mapping, low-latency matching, and data mastering techniques can help speed this part of the process. It is important to keep in mind that your data is constantly changing and evolving, thus any solution here must remain iterative and easy to update over time. There is no such thing as a one-time cleanup of a data entity.

Discharge a Stable Patient 

Finally, you’re ready to publish your data to its consumers (such as data scientists or analysts) for use in generating actionable business insights. Because you used the correct and most modern therapies, the data goes home with a clean bill of health. 

For now, anyway. It’s important to remember that this process was a temporary fix to a single data entity at this point. Just like in a real ER visit, the patient is going to need some follow-up appointments to fully heal the disease or wound. (We’ll be covering this in future posts.)

Congratulations: Your triage worked! 

By identifying the real problem and its severity up front and mustering the right resources to solve it, you’ve gotten your data to a better state of health and usability. 

More importantly: You now have a model that can be applied to future “patients.” In fact, if done successfully, the ROI from the spike project can often fund the next pipeline project and so on, creating a self-funding model for a culture of healthy data in perpetuity.  

ROI from your spike project can often fund the next pipeline project and so on, creating a self-funding model for a culture of healthy data in perpetuity.  

You didn’t “boil the ocean” by starting with an over-ambitious, top-down project for data transformation. You made it out of the Emergency Room, alive and well, with a prescription for healthy, analytics-ready data for DataOps that can be reapplied. 

In our next posts, we’ll delve deeper into why delivering business-ready data is such a challenge and why modern data mastering in particular is a must for effective DataOps. We’ll also look at additional data-health-maintenance practices that will keep DataOps running in peak form, including automated source remediation and entity resolution.