As modern enterprises embrace the potential value of data to their organization, they are beginning to build out a new generation of data infrastructure and human behavioral norms to complement their traditional legacy infrastructure and data culture. The technical methods and behavioral expectations at large internet companies are serving as an aspirational frame of reference for more-traditional companies, as they ask “How can we (Global 2000 Companies) manage and monetize data as an asset as effectively as Google, Amazon or LinkedIn?”
New “DataOps” infrastructure and behavior enable enterprises to close the gap between the potential value of their vastly siloed legacy data (Data Suppliers) and the consistent realization of the data’s value by all the people in their organizations (Data Consumers). Data Consumers are not only the newly hired data scientists or traditional analysts. They are also developers embedded in businesses and individual employees on the front lines of the enterprise who have been enabled with next-gen analytic tools over the past 20 years, the result of the “democratization of analytics” in the enterprise. As Christian Chabot said in 2013 on the eve of Tableau’s IPO:
“We believe making it easy for people to see and understand data represents one of the great opportunities in computing this century. We believe there’s a tremendous opportunity to help people answer questions, solve problems and generate meaning from data in a way that has never before been possible. And we believe there’s an opportunity to put that power in the hands of a much broader population of people.”
The emerging popularity of self-service data preparation and the success of vendors such as Alteryx and Trifacta are indications of the significant increase in demand from the average corporate citizen for much higher-quality data aligned with their business needs.
Key Personas in the Enterprise Data Ecosystem
Satisfying demand from data consumers in the enterprise requires enterprises to understand and organize around new functional roles — some that are familiar and some that are new. Based on our work over the past 10+ years, we’ve outlined the personas of key people in the enterprise DataOps ecosystem in the chart below:
As with all organizational changes, as these new personas emerge there will be a complex combination of old roles/new roles and old/new skill sets as technology professionals retool their skill sets to prepare to be the “data brokers” in the enterprise over the coming decades.
People are More Important than Tech
The organizational consequence of the demand for more and better data in the enterprise is the creation of the Chief Data Officer role. Newly appointed Chief Data Officers in large enterprises have a significant task in front of them. Like Chief Information Officers in the 1980s, they need to execute on technical fronts while also facing the unique challenges that are human at the core.
I’ve seen over and over again that the human bits tend to be the primary bottleneck in successful upgrade/transformation of large companies’ data infrastructure (which is why enterprise software needs the same level of design-thinking users expect and appreciate in their consumer apps).
Some of the most common behavioral data challenges include:
Fearing to share data because of data quality (both because people are they worried about being judged and because they worry about being too busy to take on the responsibility of fixing data consumers’ requests). Of course, sharing is the first step to improving quality.
Hoarding data as a method of organizational control or job preservation. This is a fundamental disconnect. If an organization values the data as an asset (like cash), the idea that individuals would hoard the data is no less offensive than individuals in an organization hoarding the cash that they collect from customers.
Obscuring data complexity in terms of the actual number and variety of data sources that large organizations create as they generate data in many idiosyncratic ways. Failing to embrace this complexity, diversity and the idiosyncrasy of data generated in a large enterprise is naive.
Limiting access to a very small number of users as a method of control or as a reflection of insecurity of data quality. This often manifests as “data should be available on a need-to-know basis only.”
Data Hoarding – an example of behavior change required to practice modern DataOps
One of my favorite examples of how to tackle data hoarding is a large biopharma company. The person being recruited as their new Chief Data Officer insisted that he get root level access to all the data sources at the company before he would start. When the CEO asked why he needed this, the CDO recruit said that “the data is an asset of the company, but most people in the organization think that they own the data. If I spend all my time negotiating to get access to the data, I won’t have time to help anyone use any of the data.”
I’ve seen this data hoarding pattern repeat in many large companies. A great starting point to all data-driven transformation projects is streaming all of a company’s data into a core repository so that you can begin to assess what exists. Collecting and organizing the data is also the first critical step to appropriate governance and privacy: if you don’t know what data you have, you can’t possibly know if you have the appropriate governance in place. Eventually you will want to empower many people in the organization to reuse the data. However, it is critical to separate the use/consumption of the data from the collection/organization/shaping of the data.
Next Generation of Technical Infrastructure for DataOps
As Chief Data Officers begin to tackle the human/behavioral challenges, they need to also begin establishing their next-generation technical infrastructure. Having worked with dozens of Global 2000 customers on their data/analytics initiatives at Tamr, we’ve seen some key principles that work well as companies begin to establish their next-generation data infrastructure. These are summarized below and detailed in this post.
A modern DataOps infrastructure/ecosystem should be and do the following:
Highly Automated and Agile
Take Advantage of Best of Breed Tools
Use Table(s) In/Table(s) Out Protocols
Have Layered Interfaces
Track Data Lineage/Provenance
Feature Deterministic, Probabilistic, and Humanistic Data Integration
Combine Both Aggregated AND Federated Methods of Storage and Access
Process Data in Both Batch AND Streaming Modes
We also see many patterns in the design of various components of a next-gen DataOps ecosystem. We have described the key components in this post and summary diagram below.
Storage & Compute
Make no mistake, transforming an enterprise to be data-centric is incredibly hard and very complicated. The buzzwords alone are enough to make your head spin. However, we’ve seen consistently that there is a large quantity of low-hanging analytical fruit waiting to be picked by cleaning up data — tens or even hundreds of millions of dollars in both cost savings and growth potential.