Written by Andy Palmer
Having worked with dozens of Global 2000 Customers on their data/analytics initiatives, I have seen a consistent pattern of key principles of a DataOps ecosystem that is in stark contrast to traditional “single vendor,” “single platform” approaches that are advocated by vendors such as Palantir, Teradata, IBM, Oracle and others. An open, best of breed approach is more difficult, but also much more effective in the medium/long term: it represents a winning strategy for a Chief Data Officer, Chief Information Officer and CEO who believe in maximizing the reuse of quality data in the enterprise and avoids the over-simplified trap of writing a massive check to a single vendor with the belief that there will be “one throat to choke.”
There are certain key principles of a DataOps ecosystem that we see at work every day in a large enterprise. A modern DataOps infrastructure/ecosystem should be and do the following:
- Cloud First – Scale Out/Distributed
- Highly Automated, Continuous and Agile (Data will change)
- Open/Best of Breed (Not one platform/vendor)
- Loosely Coupled (Restful Interfaces Table(s) In/Out)
- Lineage/Provenance is Essential
- Bi-Directional Feedback
- Deterministic, Probabilistic, and Humanistic Data Integration
- Both Aggregated AND Federated Storage
- Both Batch AND Streaming
Cloud First – Scale Out/Distributed
The center of gravity for enterprise data has shifted to the cloud. The full transition will take decades–but most of the large companies we work with on a regular basis prefer to start new big data projects natively on the cloud. This is a significant improvement, primarily because using cloud native infrastructure reduces project times significantly–in my experience by at least ~50%. Additionally, modern cloud database systems are designed to scale out natively and massively simplify operations and maintenance of large quantities of data. Finally, the core compute services available on the large cloud data platforms (Google GCP, Amazon Web Services, Microsoft Azure, Snowflake and DataBricks) are incredibly powerful and easy to scale out quickly as required. Replicating these services on premise would cost more than enterprises can afford–and the environments are elastic–so you can scale up/down as required with little to no capital investment.
Highly Automated, Continuous and Agile (Data will change)
The scale and scope of data in the enterprise has surpassed the ability of bespoke human effort to catalog, move and organize the data. Automating your data infrastructure and using the principles of highly engineered systems–design for operations, repeatability, automated testing and release of data–is critical to keep up the dramatic pace of change in enterprise data. The principles at work in automating the flow of data from sources to consumption are very similar to those that drove the automation of software build/test and release in DevOps over the past 20 years. This is one of the key reasons we call the overall approach “DataOps”.
Data changes constantly. Enterprise data sources should be treated as dynamic objects rather than static objects, and next-gen infrastructure should enable data to flow dynamically and treat data updates as the norm rather than the exception. It is critical to build an infrastructure that supports a continuous flow of data, as the enterprise begins to embrace the dynamic nature of data and the need to instrument the flow of data from many diverse sources through to all potential consumption endpoints.
In the past, enterprise data projects tended to become “boil the ocean” types of projects. For example, most of the Master Data Management (MDM) tools and projects epitomized a “waterfall”-like approach to organizing data. The next generation of data management infrastructure must enable a more agile approach to organizing, aligning and mastering data. Aligning data to consumers’ interests requires a more dynamic, agile approach. The emergence of “data wrangling” and self-service data preparation is a move in the right direction to support a more agile approach to data management. However, we believe that enabling consumers to customize the way they would like to prepare the data is necessary but not sufficient to solve the broader problem of data reuse, which requires some collaborative unification, alignment and mastering of data across the entire organization.
The key to success in the long term is to empower users to shape the data to suit their needs while also broadly organizing and mastering the data to ensure its adequate consistency, quality and integrity as it is used across an organization.
Open/Best of Breed (Not one platform/vendor)
The best way to describe this is to talk about what it is not. The primary characteristic of a modern DataOps ecosystem is that it is NOT a single proprietary software artifact or even a small collection of artifacts from a single vendor.
In the next phase of data management in the enterprise, it would be a waste of time for customers to “sell their data souls” to single vendors that promote proprietary platforms. The ecosystem in DataOps should resemble DevOps ecosystems–where there are many Best of Breed FOSS and proprietary tools that are expected to interoperate via APIs. An open ecosystem results in better software being adopted broadly–and offers the flexibility to replace, with minimal disruption to your business, those software vendors that don’t produce better software.
Closely related to having an open ecosystem is embracing technologies and tools that are best of breed–meaning that each key component of the system is built for purpose, providing a function that is the best available at a reasonable cost. As the tools and technology that the large internet companies built to manage their data goes mainstream, the enterprise has been flooded with a set of tools that are powerful, liberating (from the traditional proprietary enterprise data tools) and intimidating.
Selecting the right tools for your data workloads is difficult because of the massive heterogeneity of data in the enterprise and also because the dysfunction introduced by software sales/marketing organizations who all over-promote their own capabilities (i.e., extreme data-software-vendor hubris). It all sounds the same on the surface, so the only way to really figure out what systems are capable of is to actually try them (or take the word of a proxy, a real customer who has worked with the vendor to deliver value from a production system). This is why people such as Mark Ramsey, previously at GSK, are a powerful example. Mark’s attempt to build an ecosystem for 12+ best-of-breed vendors and combine their solutions to manage data as an asset at scale is truly unique and a good reference as to what works and what does not.
Loosely Coupled (Restful Interfaces Table(s) In/Out)
The next logical question to ask if you embrace best of breed is “how will these various systems/tools communicate? And what is the protocol?” Over the past 20 years, I’ve come to believe when talking about interfaces between core systems that it’s best to focus on the lowest common denominator. In the case of data, this means tables–both individual tables and collections of tables. I believe that Table(s) In/Table(s) Out is the primary method that should be assumed when integrating these various best of breed tools/software artifacts. Tables can be shared or moved using many different methods described under Data Services. A great reference for these table oriented methods are the popularity of RDDs and DataFrames in the Spark ecosystem. Using “Service Oriented” methods for these interfaces is critical, and the thoughtful design of these services is a core component of a functional DataOps ecosystem.
Overall, we see a pattern of three key types of interfaces that are required/desired inside of these systems.
Three Core Styles of Interfaces
There are many that want to consume data in a large enterprise. Some “power users” need to access data in its raw form, whereas others just want to get responses to inquiries that are well formulated.
The design patterns for interfaces that we see being most useful include data access services, messaging services, and REST services. A short description of each is below:
- Data access services that are “View” abstractions over the data and are essentially SQL or SQL-like interfaces. This is the power-user level that data scientists prefer.
- Messaging services that provide the foundation for stateful data interchange, event processing, and data interchange orchestration
- REST services built on or wrapped around APIs providing the ultimate flexible direct access to and interchange of data.
Lineage/Provenance is Essential
As data flows through a next generation data ecosystem, it is of paramount importance to properly manage this lineage metadata to ensure reproducible data production for analytics and machine learning. Having as much provenance/lineage for data as possible enables reproducibility that is essential for any significant scale in data science practice/teams. Ideally, each version of a tabular input and output to a processing step is registered. In addition to tracking inputs and outputs to data processing steps, some metadata about what the processing steps are doing is essential. With a focus on data lineage and processing tracking in place across the data ecosystem, reproducibility goes up and confidence in data increases. It’s important to note that lineage/provenance is not absolute–there are many subtle levels of provenance and lineage and it’s important to embrace the spectrum and appropriate implementation–i.e. it’s more of a style of your ecosystem than a component.
There is a massive gap in the enterprise, which is methods/infrastructure to collect feedback directly from data consumers and organize and prioritize the prosecution of that feedback by curators/stewards/data professionals so that data consumers issues are addressed broadly across an entire organization. Currently, data flows FROM sources through all the various intermediary methods (data warehouses, lakes, data marts, spreadsheets) TO consumers–but there are NO fundamental methods to collect feedback from data consumers broadly. What is needed is essentially, “feedback services” that are broadly embedded in all analytical consumption tools–viz tools, models, spreadsheets, etc–creating a“Jira for Data”. In the short term, this will be a dumb queue; in the long term this queue will become more intelligent and automated, helping to automatically curate data as appropriate.
Deterministic, Probabilistic, and Humanistic Data Integration
When bringing data together from disparate silos, it’s tempting to rely on traditional deterministic approaches to engineer the alignment of data with rules or ETL. We believe that at scale–with many hundreds of sources–the only viable method of bringing data together is the use of machine-based models (probabilistic) + rules (deterministic) + human feedback (humanistic) to bind the schema and records together as appropriate, in the context of both how the data is generated and (perhaps more importantly) how the data is being consumed.
Both Aggregated AND Federated Storage
A healthy next-generation data ecosystem embraces data that is both aggregated AND federated. Over the past 40+ years, the industry has gone back and forth between federated and aggregated approaches for integrating data. It’s my strong belief that the modern enterprise requires an overall architecture in which sources and intermediate storage of data will be a combination of both aggregated and federated data. This adds a layer of complexity that was previously challenging, but actually possible now with some modern design patterns. There are always tradeoffs of performance and control when you aggregate vs. federate. But over and over, I find that workloads across an enterprise (when considered broadly and holistically) require both aggregated and federated. In your modern DataOps ecosystem, cloud storage methods can make this much easier. In fact, Amazon S3 and Google Cloud Services specifically–when correctly configured as a primary storage mechanism–can give you the benefit of both aggregated and federated methods.
Both Batch AND Streaming Processing
The success of Kafka and similar design patterns has validated that a healthy next-gen data ecosystem includes the ability to simultaneously process data from source to consumption in BOTH batch and streaming modes. With all the usual caveats about consistency, these design patterns can give you the best of both worlds–the ability to process batches of data as required and also to process streams of data that provide more real-time consumption.
All of the above are obviously high level and have an infinite number of technical caveats. After doing hundreds of implementations at large and small companies, I believe that it’s actually possible to do all of the above within an enterprise — but not without embracing an open and best-of-breed approach. At Tamr, we’re in the middle of exercising all of these principles of DataOps with our customers every day across a diverse and compelling set of data engineering projects.
To learn more about the key principles of DataOps, as well as other best practices for implementing DataOps at your organization, download our Getting DataOps Right ebook below. To learn more about how Tamr exercises these principles of DataOps with our customers, schedule a demo.