Written by Andy Palmer
There are many ways to think about the potential components of a next gen enterprise data engineering ecosystem. Our friends at DataKitchen have done a great job with this post which refers to some solid work by the Eckerson Group. In the interest of trying to simplify the context of what you might consider buying vs. building and which vendors you might consider, I’ve tried to lay out the primary components of a DataOps ecosystem based on the environments I’ve seen people configuring over the past 8-10 years and the tools (new and old) that are available.
I’ve provided a brief summary of each of these components below and if you are interested, we can always meet to drill down into each of these components to discuss in depth.
Over the past 5-10 years, a key function has emerged as a critical starting point in the development of a functional DataOps ecosystem–the Data Catalog/Registry. There are a number of FOSS and commercial projects that are attempts to provide tools that enable large enterprises to answer the simple question “what data do you have?” I’ve always believed that best practice for a modern Data Catalog/Registry is fundamentally implementing a “vendor neutral” system that crawls all tabular/data and registers all tabular datasets (files and inside of DBMS) using a combination of automation and human guidance.
There are SO many options for data movement that it’s a bit mind numbing. They range from the traditional ETL/ELT vendors/tools (Informatica, Talend, Oracle, IBM, Microsoft) to the new breed of movement vendors–my favorites are StreamSets, Data Kitchen & KNIME–and the Cloud Data Platform vendors such as Google GCP/DataFlow, Microsoft Azure, AWS, Databricks, Snowflake and the dozens of new ones that every VC in the Bay Area is funding by the month.
One of the benefits of having a more open/interoperable/best of breed approach is that as you need high performance movement tools you can adopt these incrementally. For example, using Python/Airflow as the baseline or default across your organization and then titrating in high performance tools like StreamSets as required where you need the scale/performance. In the long term, this enables your ecosystem of tools to evolve gracefully and avoid massing single vendor lift/shift projects which are prone to failure, despite the expectations that any single vendor might want to set along with an eight-figure-plus proposal.
The tools required to create consistency in data need to be strongly rooted in the use of three key methods: rules, models and human feedback. These three key methods implemented with an eye towards an Agile process are essential to tame the large challenge of variety in enterprise data.
Traditional tools that depend on a single data architect to apriori define static schema, controlled vocabulary, taxonomy, ontologies and relationships are inadequate to solve the challenges of data variety in the modern enterprise. The thoughtful engineering of data pipelines and the integration of rules, active learning based probabilistic models of how data fits together and deliberate human feedback to provide subject matter expertise and handle corner cases is essential to success in the long term alignment and molding of data broadly.
The biggest change in data systems over the past 10 years has been the evolution of data storage platforms–both cloud and on-prem. When my partner Mike and I started Vertica back in 2004, the database industry had plateaued in terms of new design patterns. At the time, everyone in the enterprise was using traditional “row stores” for every type of work load — regardless of the fundamental fit. This was the gist of Mike’s “One Size Does Not Fill All in Database Systems” paper and and my Usenix talk in 2010.
Overall, the key next gen platforms that are top of mind now include:
- AWS – Redshift, Aurora, et al
- GCP – BigTable, Spanner
- Azure – SQL Services et al
The capabilities available in these platforms are dramatic and the pace of improvement has been truly exceptional. I’m specifically not putting the traditional big vendors on this list–IBM, Oracle, Teradata–because most of them are falling further and further behind by the day. The cloud platform vendors have a HUGE advantage relative to the on-prem vendors in that they can radically improve their systems quickly without the latency associated with slow/long on-prem release cycles and the proverbial game of “telephone” that on prem customers and vendors play.
When data is organized and cleaned, providing a mechanism to broadly publish high quality data is essential. This component delivers both a machine and human-readable form of dynamic datasets that have been broadly and consistently prepared. This component also has methods to recommend new data (rows, columns, values, relationships) that are discovered bottom up in the data over time. These methods can also be instrumented broadly into consumption endpoints such as analytic tools so that as new data becomes available, recommendations can be made dynamically to data consumers who may be interested in consuming the continuously improved/updated data in context of their analytics/operational systems/use.
This is perhaps the most impactful and least appreciated component of a next gen DataOps ecosystem. Currently, data in most enterprises flows unidirectionally from sources through deterministic and idiosyncratic pipelines towards data warehouses, marts and spreadsheets–and that’s where the story stops. There is an incredible lack of systematic feedback mechanisms to enable data to flow from where data is consumed back up into the pipelines and all the way back to the sources so that the data can be improved systematically over time. Most large organizations lack any “queue” of data problems identified by data consumers.
At Tamr, we’ve created “Steward” to help address this problem–providing a vendor neutral queue of data consumer reported problems with data for the enterprise.
Governance is a key component of a modern ecosystem–the evolution of data privacy has raised governance up to the top of the priority list, driven by a need to comply to many important new regulations. My belief is that the best governance infrastructure focuses on the automation of information access policy and the prosecution of that policy across users in context of key roles that are aligned with policy. Focusing on governance in the context of information use helps avoid boiling the infinite proverbial ocean of data source complexity. Having a codified information access policy as well as methods for running that policy in real time as users are consuming data should be the core goal of any governance infrastructure initiative.
To learn more about Tamr’s solutions and the importance of implementing a successful DataOps strategy, schedule a demo today or download a copy of our ebook, Getting DataOps Right, below.