As I’ve said before, much like DevOps focused on feature velocity for the big internet companies by enabling the continuous and accelerated delivery of high-quality software artifacts, the essence of DataOps is delivering orders of magnitude improvement in analytic and operational velocity to businesses by accelerating the delivery of clean, curated, versioned datasets organized by logical entities for all types of professionals in the modern enterprise to consume.
DataOps has a strong future, and with the basics in place, forward-looking vendors and enterprise adopters are already iterating on what the future will look like for everyone else.
In all cases, the most important steps an enterprise can take is to move to a consumption-based DataOps model, and carefully think about the role of their most valuable resource, people, in their ecosystem.
Move to a consumption-based DataOps model
For any data project, the first and most important questions are: “How will data be consumed or used?”, and “Why?” Understanding outcomes and usage lets you understand why the data you’re working on is important to your data consumers. There are four major groups of people: Data Citizens, Data Analysts, Data Scientists, and Developers. Any given person probably is a combination of these personas, but separating them by consumption patterns helps to illuminate their needs and the associated technical endpoints.
For example, Data Citizens probably want to consume data in an HTML page that looks like a table or a Wikipedia page. Data Analysts probably want their data in a visualization (Tableau, Spotfire, Qlik, Looker, Domo…) tool or spreadsheet. Data Scientists likely want to consume data in a modeling system such as SAS, R or DataRobot, or a proprietary model they’ve built in Python or something similar. And developers likely want to consume data as RESTful endpoints.
Ultimately, this effort to bring together data and deliver it to people in a way that best suits them is about unlocking the potential of enterprise data as a uniquely transformative asset. That means helping people make better-informed decisions, driving improved analytic outcomes, solving problems, and revealing unseen business opportunities in a way they’ve never done before.
Mastering customer data is a perfect example of a core enterprise data asset – that when managed proactively and continuously can be the source of incredible business value. The benefits of a holistic view of customers across an enterprise can bear significant low-hanging operational and analytical fruit. Sales can more readily act on upsell and cross-sell opportunities with existing customers, marketing can run highly targeted campaigns and offer personalized product suggestions to the right customers, and customer service teams can better retain customers by delivering individual and tailored experiences.
Over time there are many key entity types – customers, products, suppliers, employees, contractors, etc that serve as the logical organization method for all of the data a large organization has to master. There are also many industry-specific entity types that need to be mastered – for example at Tamr we work closely with companies in the energy industry such as Hess to master their Wells data.
To be clear, DataOps still faces significant challenges like inadequate visibility/governance and ineffective monitoring. There are many other gaps, but I see the big challenge right now as governance, and this is because most governance programs use “source-based governance” vs. “consumption-based governance.”
To understand the difference between source-based governance and consumption-based governance it’s helpful to imagine data as water.
Source-based governance focuses on understanding where the water (data) comes from: out of a well, from a river, a lake, the sky – the equivalent of all the operational systems in a company that are data-generating machines. But, the most important question for data (as with water) is not where it came from, but rather what exactly is coming out of the faucet, and is it fit to drink? Understanding the sources is important but not sufficient to answer those crucial questions.
Consumption-based governance focuses on whether data consumers are doing so appropriately according to your information-security and role-based-access policies. Without consumption-based governance, you lack the freedom to get data to where it needs to go. This is governing the quality and usage of data at the point of use–whether it’s a water fountain, tap, toilet, or hose to use our previous example.
Today, most companies are so distracted with source-based governance (where the water comes from) that they have neither the time nor the resources to govern the data as it’s being consumed, which is ultimately the more consequential activity.
While enterprises have many systems and applications, again and again, there is one key bottleneck to DataOps on both the source and consumption side of data.
People instinctively know that data is an asset. They expect to use it that way. But people also know the most about “their” data in its current state. And for both good and bad reasons, it’s common for people to be mistrustful of the data provided to them by “outsiders” and fearful of sharing their data with others. When “bad” data can undermine their jobs and make them look bad there will be friction, and that friction puts humans squarely in the critical path of DataOps progress.
To fully benefit from data as an asset–delivering consumable data at scale for competitive analytics and operations–DataOps needs to make people part of the solution.
Leverage people in your DataOps infrastructure
The bottleneck in governance–and DataOps–is not technology, it’s people and their behaviors. Perhaps the most destructive behavior is data hoarding. Data hoarding is when individuals treat their data like Gollum treats his ring: a coveted item that cannot be shared. For DataOps to be successful this behavior must end.
Good data leaders establish a broad framework of core behaviors to reinforce data as a shared corporate (not an individual or group) asset. Modern data engineering technologies and best practices rely on this data sharing and are essential to encouraging such behaviors and ingraining them in the data culture of the company. One of the most successful CDO’s I’ve worked with took a job at a very large life sciences company on the condition that he would have root access to every single data system at that company. For him, and other data leaders, access to data is table stakes.
For help with this (and to stay on the cutting edge of DataOps), look at creating DataOps Centers of Excellence (COE) that can investigate, pilot, and produce new projects. You need to do this anyway. But, with people being the gating factor in project success, DataOps COEs can also become conversion centers in popularizing new best data practices and behaviors in the business.
Good leadership and behavior can dramatically improve outcomes for organizations looking to lead with data, and one of the most important, tangible benefits should be a dramatic reduction in the requirement of human beings performing repetitive, low-value tasks. And so the third step is to automate as many of the low-value tasks as possible.
Automate with technologies that enable non-linear scaling of DataOps
The scale of the challenges of DataOps is so large that humans alone could never achieve what needs to be done without standing on the shoulders of machines. Much the same way Google pioneered how to use machines to make the consumer internet more useful and accessible broadly to the average person, machine-driven, human-guided DataOps requires the thoughtful collaboration of humans and machines to achieve the promise of DataOps – providing clean, curated, versioned datasets to those who can use the data appropriately to make the best decisions possible. The primary components of the modern DataOps infrastructure include:
Crawling and registration (Data cataloging tools)
Movement and automation (ETL and other pipelining tools)
Persistence and compute (Predominantly cloud platforms)
Mastering and Quality(Modern data mastering is machine-driven, human-guided)
Consumer data feedback and curation
Governance, both source-based and more importantly usage-based
At Tamr, we use human-guided, machine-learning-driven data mastering to enable the delivery of clean, curated versioned datasets. Tamr optimizes the use of data experts by offloading the heavy lifting of data mastering and curation to intelligent machines. This saves a DataOps team huge amounts of time and money to generate clean, trustworthy, and ready-to-use data, while also establishing a cycle of machine-driven, human-guided continuous improvement. Tamr replaces manual or legacy rules-based methods of data mastering, neither of which are scalable to today’s data volumes nor DataOps’ goals.
It’s still early days
Despite all the recent advancements in DataOps, the truth is we’re still at the very beginning of a significant change in how data is managed as an asset in large companies. For decades, data has been treated as exhaust from operational systems instead of as a strategic asset.
And while there is still a lot of work to build the next generation of modern data engineering infrastructure in large enterprises, the future will be largely determined by our collective behaviors and perceptions.