How to Build a DataOps Toolkit

The DataOps cross-functional toolkit is not a single tool or platform—nor can it ever be. Rather, it’s a framework that prioritizes responding to change over following a plan. This means that the toolkit is inherently a collection of complementary, best-of-breed tools with interoperability and automation at their core design.

Key #1 to the Toolkit: Interoperability

While a good ETL platform certainly appeared to afford interoperability as a core principle, this, in fact, was a reaction to the complete lack of interoperability within traditional data repositories and tools. We all know that in practice if your ETL platform of choice did not support a native connector to a particular proprietary database, ERP, CRM or business suite, the data was simply forfeited from any integration project. Conversely, data repositories that did not support an open data exchange format forced users to work within the confines of that data repository, often subsisting with rigid, sub-optimal, worst-of-breed, tack-on solutions.

The dream of being able to develop a center of excellence around a single data integration vendor quickly evaporated as the rigid constraints of the single vendor tightened with increasing data variety and velocity. That’s why interoperability is perhaps the biggest (and most necessary) departure in the DataOps stack from the data integration tools or platforms of the past. To truly embrace interoperability, we need three components:

Composable Agile Units

The agile DataOps toolkit not only allows a user to plug-and-play a particular tool between different vendor, open-source or home-grown solutions, it also allows greater freedom in deciding the boundaries of your composable agile units. This is key when trying to compose agile DataOps teams and tools while working within the realities of your available data engineering and operations skill sets. If your dashboarding team can work across many different record matching engines, your record matching engine can work across many different dashboarding tools too.

In fact, DataOps tools aspire to the mantra of doing one thing exceptionally well, and thus when taken together, present a set of non-overlapping complementary capabilities that align with your composable units.

Results Import

Still, in practice, even DataOps tools can, and often must, overlap in capability. The answer to this apparent redundancy reveals one of the unique hallmarks of interoperability in the DataOps stack: the tool’s ability to both export and import its results for common overlapping capabilities. This ability allows the DataOps team greater flexibility and control in composing a pipeline, designing around the core non-overlapping ability of each component, while leveraging their import and export functionality to afford the interoperability of common abilities.

Metadata Exchange

In DataOps, the focus of interoperability is not the need for closer systems integration, e.g. with each tool possessing native support for each other tool. Rather, it’s the need for open interoperable metadata exchange formats. This objective reveals the second unique hallmark of interoperability in the DataOps stack: the ready exchange and enrichment of metadata.

Each of our DataOps tools (data extraction, data unification and dashboarding) requires the ability to both preserve (passthrough) metadata and enrich it. In flowing from source, through extraction, to unification and dashboard, the metadata is preserved by each tool by supporting a common interoperable metadata exchange format and then enriching the metadata as it’s exposed to the next tool. This is a primary and distinguishing feature of a DataOps tool and an essential capability for realizing interoperability in the DataOps stack.

Key #2 to the Toolkit: Automation

One simple paradigm for automation is that every common UI action or set of actions, and any bulk or data scale equivalent of these actions, is also available via a well-formed API. In the DataOps toolkit, the meeting of the API first ethos of agile development with the pragmatism of DevOps means that any DataOps tool should furnish a suite of APIs that allow the complete automation of its tasks.

Broadly speaking, we can consider that these tasks are performed as part of either continuous or batch automation.

Continuous Automation

If the primary challenge of the DevOps team is to streamline the software release cycle to meet the demands of the agile development process, then the primary objective of the DataOps team is to automate the continuous publishing of datasets and refreshing of every tool’s results or view of those datasets.

The ability of a tool to be automatically updated and refreshed is critical to delivering working data and meeting business-critical timeliness of data. The ease at which the dataset can be published and republished directly impacts the feedback cycle with the data consumer and ultimately determines the DataOps team’s ability to realize the goal of shortened lead time between fixes and faster mean time to recovery.

Batch Automation

With most tools already providing sufficient capability to be continuously updated in an automatic, programmatic manner, it’s easy to overlook the almost distinctly agile need to spin up and tear down a tool or set of tools automatically. However, the ability of a tool to be automatically, initialized, started, run and shutdown is a core requirement in the DataOps stack and one that is highly prized by the DataOps team.

The ability of a tool to be automatically stood up from scratch and executed, where every step is codified, delivers on the goal of Automate Everything and unlocks the tenet of Test Everything. It is possible to not only spin up a complete, dataset-to-dashboard pipeline for performing useful work, e.g. data model trials, but also to execute the pipeline programmatically, running unit tests and data flow tests, e.g. did the metadata in the source dataset pass through successfully to the dashboard? Batch automation is critical to realizing repeatability in your DataOps pipelines and a low error rate in publishing datasets.

Creating Your DataOps Toolkit

To meet the demand of delivering working data with ever-increasing data variety and velocity, the key components of the DataOps ecosystem will continue to evolve, requiring new and greater capabilities. The individual tools that are deemed best-of-breed for these capabilities will be those that incorporate interoperability and automation in their fundamental design.

To learn more about how to implement a DataOps framework at your organization, download our ebook, Getting DataOps Right, below. To learn more about Tamr and the role we play in a DataOps ecosystem, schedule a demo.

Getting DataOps Right

In this report, five data industry thought leaders explore DataOps—the automated, process-oriented methodology for delivering clean, reliable data across your organization.

Download Now



Liam Cleary is a technical lead at Tamr, a machine learning based data unification company, where he leads data engineering, machine learning and implementation efforts. Prior to Tamr, he was a Post-doctoral Associate at MIT, researching quantum dissipative systems, before working as an internal consultant at Ab Initio, a data integration platform. He has a Ph. D. in Electrical Engineering from Trinity College Dublin.