The DataOps Ecosystem Emerges

This guest blog post is by Chris Bergh, Head Chef at DataKitchen, a Tamr partner and co-conspirator behind the impending DataOps revolution.

In 2015, Andy Palmer of Tamr defined the term DataOps, a faster, more flexible approach to data analytics which recognizes the interconnectedness of IT operations, data engineering, data integration, data quality and data security/privacy.

DataOps has recently garnered a lot of media attention. Blue Hills Research produced a report on DataOps. Data Science Central has written an article. DataOps is now defined in Gartner’s glossary of IT terms. There are conferences in the pipeline. Companies are now offering solutions that implement and support DataOps, including my company, DataKitchen and several others. The word on #DataOps is spreading fast.

The DataOps Manifesto

Some of the key players and supporters coalescing around DataOps have produced a DataOps manifesto consisting of 18 DataOps principles which summarize the mission, values, philosophies, goals and best practices of DataOps practitioners. You can view the signatories to the manifesto here. If you agree with the principles of DataOps, please add your signature.

DataOps Explained

For those still new to the term, DataOps is a combination of tools and process improvements that enable rapid-response data analytics, at a high level of quality. DataOps adapts more easily to user requirements, even as they evolve, and ultimately supports improved data-driven decision-making.  DataOps delivers these benefits by drawing upon lessons learned in other fields and applying them to data analytics:

 

 

 

  • Studies show that software development projects complete significantly faster and with far fewer defects when Agile Development, an iterative project management methodology, replaces the traditional Waterfall sequential methodology. The Agile methodology is particularly effective in environments where requirements are quickly evolving — a situation well known to data analytics professionals.
  • Devops, which inspired the name DataOps, focuses on continuous delivery by leveraging on-demand IT resources and by automating test and deployment of code. This merging of software development and IT/Operations reduces time to deployment, decreases time to market, minimizes defects, and shortens the time required to resolve issues. Borrowing methods from DevOps, DataOps brings these same improvements to data analytics.
  • Like lean manufacturing, DataOps utilizes statistical process controls (SPC) to monitor and control the data analytics pipeline. When SPC is applied to data analytics, it leads to remarkable improvements in efficiency and quality.

 

Implementing DataOps in Seven Simple Steps

Simply stated, DataOps utilizes Agile Development, DevOps and statistical process controls to produce a rapid-response, flexible and robust data-analytics capability. The beauty of DataOps is that you can achieve these benefits while using the tools that you know and love. You can implement DataOps in seven surprisingly simple steps:

 

  1. Add data and logic tests – An automated test suite enables continuous delivery, robust execution and ensures that intermediate stages in the data analytics pipeline execute within an expected range.
  2. Use a version control system – Data analytics files are just code. Like any complex development project, design intellectual property must be managed.
  3. Branch and merge – Version control also enables members to check out code and work on enhancements in parallel, without getting in each other’s way.
  4. Use multiple environments – Data analytics team members should work in a development environment, with all the resources and data that they need, not on live production systems.
  5. Reuse and containerize – Container technologies like Docker can encapsulate complexity and foster reuse of data analytics components.
  6. Parameterize your processing – Design your data analytics pipeline with parameters that adapt to common run-time conditions.
  7. Use simple storage – Low-cost storage makes data lakes viable. Simple storage keeps the entire data analytics pipeline accessible so that it can be more easily updated; a critical aspect of continuous delivery. 

 

DataOps requires an Agile mindset and must be supported by an automated system which incorporates existing tools into a DataOps development pipeline as summarized by the seven steps above. DataKitchen supports a DataOps platform, which integrates your existing tools into a DataOps workflow. DataKitchen can work with the full range of tools used in data analytics including Tamr’s data unification platform, which identifies and connects disparate enterprise data sources.

 

DataOps Speeds Ideas into Production

At its core, Data analytics is in the “idea” business. These “ideas” help organizations understand their market and environment in order to better serve customers or users. In order for ideas to have impact, organizations have to engage in development, which ultimately bears fruit by bringing ideas into production. Later the staff improves and builds upon these ideas through refactoring and reuse of data-analytic assets (like data and code). For many companies today, every step in this process takes far too long.

The goal of DataOps is to reduce the cost of asking new questions and accelerate the speed of ideas. Using DataOps, organizations can be much more creative because ideas are easier to vet and implement. These organizations will make better decisions, more quickly, increasing their probability of success.