DataOps as a Discipline

DataOps is an automated, process-oriented methodology, used by analytic and data teams, to improve the quality and reduce the cycle time of data analytics. What does that mean in terms of benefits? Data-driven aspects of the business can respond rapidly to changing business needs.

DataOps, like DevOps, emerges from the recognition that separating the product—production-ready data—from the process that delivers it—operations—impedes quality, timeliness, transparency, and agility.

The need for the DataOps discipline comes about because data consumption has changed dramatically over the past decade. Just as internet applications raised user expectations for usability, availability, and responsiveness of applications, things like Google Knowledge Panel and Wikipedia have dramatically raised user expectations for usability, availability, and freshness of data.

What’s more, with increased access to very usable self-service data preparation and visualization tools, there are also now many users within the enterprise who are ready and able to prepare data for their own use if official channels are unable to meet their expectations. In combination, these changes have created an environment where continuing with the cost-laden, delay-plagued, opaque operations used to deliver data in the past are no longer acceptable.

Taking a cue from DevOps, DataOps looks to combine the production and delivery of data into a single, agile practice that directly supports specific business functions. The ultimate goal is to cost-effectively deliver timely, high-quality data that meets the ever-changing needs of the organization.

DataOps Builds on the Wisdom of the Agile Process

Agile software development arose from the observation that software projects run using traditional processes were plagued by:

  • High cost of delivery, long time to delivery, and missed deadlines
  • Poor quality, low user satisfaction, and failure to keep pace with ever-changing requirements
  • Lack of transparency into progress towards goals, and schedule unpredictability
  • Anti-scaling in project size, where the cost per feature of large projects is higher than the cost per feature of small projects
  • Anti-scaling in project duration, where the cost of maintenance grows to overwhelm available resources

We all know the truth of these issues—they plague many data delivery projects today too. With the same frustrations, it stands to reason we can learn from the same approaches seen in the Agile Manifesto:

  1. Individuals and Interactions over processes and tools. The best way to get the most from your team is to support them as people, first, and to bring in tools and processes only as necessary to help them be more effective.
  2. Working Software over comprehensive documentation. With working data, users can accomplish their goals significantly more readily than they could without the data. This means that the data meets the users’ functional needs, quality needs, availability needs, serviceability needs, etc.
  3. Customer Collaboration over contract negotiation. The best way to determine whether a product meets your customer’s needs and expectations is to have the customer use the product and give feedback.
  4. Responding to Change over following a plan. It is much better to plan only as much as necessary to ensure that the team is aligned and the goals are reasonable, then measure often to determine whether course correction is necessary. Only by adapting swiftly to change can the cost of adaptation be kept small.

So how do we do it? The Agile process also has the answers:

    1. Deliver working software frequently—in days or weeks, not months or years—adding functionality incrementally until a release is completed
    2. Get daily feedback from customers—or customer representatives—on what has been done so far
    3. Accept changing requirements, even late in development
    4. Work in small teams (3–7 people) of motivated, trusted and empowered individuals, with all the skills required for delivery present on each team
    5. Keep teams independent; this means each team’s responsibilities span all domains, including planning, analysis, design, coding, unit testing, acceptance testing, releasing, and building and maintaining tools and infrastructure
    6. Continually invest in automation of everything
    7. Continually invest in the improvement of everything, including process, design, and tools

These practices have enabled countless engineering teams to deliver timely, high-quality products, many of which we use every day. These same practices are now enabling data engineering teams to deliver timely, high-quality data that powers applications and analytics. But there is another transition made in the software world that needs to be picked up in the data world.

When delivering hosted applications and services, agile software development is not enough. It does little good to rapidly develop a feature, if it then takes weeks or months to deploy it, or if the application is unable to meet availability or other requirements due to the inadequacy of the hosting platform. The application of agile to operations created DevOps, which exists to ensure that hosted applications and services can not only be developed but also delivered in an agile manner.

Embracing Agile Operations for Data and Software

Agile removed many barriers internal to the software development process and enabled teams to deliver product features in days, instead of years. For hosted applications, in particular, the follow-on process of getting a feature deployed retained many of the same problems that Agile intended to address. Bringing development and operations into the same process, and often the same team can reduce time-to-delivery down to hours or minutes. The principle has been extended to operations for non-hosted applications as well, with similar effect.

This is the core of DevOps. The problems that DevOps intends to address look very similar to those targeted by Agile Software Development:

  • Improved deployment frequency
  • Faster time to market
  • Lower failure rate of new releases
  • Shortened lead time between fixes
  • Faster mean time to recovery (in the event of a new release crashing or otherwise disabling the current system).

Most of these can be summarized as availability—making sure that the latest working software is consistently available for use. In order to determine whether a process or organization is improving availability, you need something more transparent than percent uptime, that can be measured continuously and tells you when you’re close, and when you’re deviating.

DevOps, then, has the goal of maximizing the fraction of requests that are successful, at minimum cost. For an application or service, a request can be logging in, opening a page, performing a search, etc. For data, a request can be a query, an update, a schema change, etc. These requests might come directly from users, e.g. on an analysis team, or could be made by applications or automated scripts. Data development produces high-quality data, while DataOps ensures that the data is consistently available, maximizing the fraction of requests that are successful.

Bringing DevOps Together with Data

DataOps is an emerging field, whereas DevOps has been put into practice for many years now. We can use our depth of experience with DevOps to provide a guide for the developing practice of DataOps. There are many variations in DevOps, but they share a collection of core tenets:

    1. Think services, not servers
      The goal of the data organization is not to deliver a database or a data-powered application, but the data itself, in a usable form. In this model, data is typically not delivered in a single form factor, but simultaneously in multiple form factors to meet the needs of different clients. Each of these delivery forms can have independent service level objectives, and the DataOps organization can track performance relative to those objectives when delivering data.
    2. Infrastructure as code
      From the DataOps perspective, everything involved in delivering data must be embodied in code. Of course, this includes infrastructure such as hosts, networking, and storage, but, importantly, this also covers everything to do with data storage and movement. Nothing can be done as a one-off; everything must be captured in code that is versioned, tested, and released. Only by rigorously following this policy will data operations be predictable, reliable, and repeatable.
    3. Automate everything
      Automation is what enables schema changes to propagate quickly through the data ecosystem. It is what ensures that responses to compliance violations can be made in a timely, reliable and sustainable way. It is what ensures that data freshness guarantees can be upheld. And it is what enables users to provide feedback on how the data does or could better suit their needs so that the process of rapid iteration can be supported. Automation is what enables a small DataOps team to effectively keep data available to the teams, applications, and services that depend on it.

The Agile Data Organization Turns DataOps into a Discipline

DataOps, in conjunction with agile data engineering builds the next generation data engineering organization. The goal of DataOps is to extend Agile process through the operational aspects of data delivery so that the entire organization is focused on timely delivery of working data. Analytics is a major consumer of data, and DataOps in the context of agile analytics has received quite a bit of attention. Other consumers also benefit substantially from DataOps, including governance, operations, security, etc. By combining the engineering skills that are able to produce the data, with the operations skills that are able to make it available, teams are able to cost-effectively deliver timely, high-quality data that meets the ever-changing needs of the data-driven enterprise.

In short, DataOps is the transformational change data engineering teams have been waiting for to fulfill their aspirations of enabling their business to gain analytic advantage through the use of clean, complete, current data.

To learn more about the DataOps discipline and how to implement it at your organization, download our ebook, Getting DataOps Right. You can also reach out to us or schedule a demo.



Nik is a technology leader with over two decades of experience building data engineering and machine learning technology for early stage companies. As a technical lead at Tamr, he leads data engineering, machine learning and implementation efforts. Prior to Tamr, he was director of engineering and lead architect at Endeca, where he was instrumental in the development of the search pioneer which Oracle acquired for $1.1B. Previously, he delivered machine learning and data integration platforms with Torrent Systems, Thinking Machines Corp., and Philips Research North America. He has a master’s degree in computer science from Columbia University.