Written by Mark Marinelli
The Expectations Gap
The gap between business users expectations around data quality and availability, and the actual reality of the state of enterprise data has never been larger. Like a digital version of the Borg, new data proliferates across the landscape of large enterprises — vast, relentless; and futile to resist. Unlike the Borg, this new data, whether from organic growth, acquisition, or new applications, assuredly does not assimilate. As my colleague Mike Stonebraker details at great length, we’ve spent decades trying to win the battle against data variety, and generations of technologies have failed to stem the tide.
The Problems: Speed and Scale
The problem isn’t that previous generations of technologies don’t work — they do — it’s that they don’t work at scale. There are plenty of MDM and DQ tools that do what they are supposed to do for a handful of data sources. I’ve seen some successful MDM systems, but I’ve seen plenty more where the MDM software doesn’t become the one system to rule them all, but rather another data source in the environment – in effect, adding to the problem. There are a few reasons for the breakdown of these systems in the face of modern data environments, but common ones I see are:
- Domain experts for a particular data type are involved at the beginning of the project during requirements gathering and at the end for validation. Without more consistent input, the end result is more likely to be off the mark; making the results less useful or worthless. This inhibits the speed of delivery of these projects.
- As the amount of data sources increase, small changes in schema variety or field formats compound and create huge variations in data. As a result, the rules required to accurately map schemas and deduplicate records have to become equally complex. The need for complicated rules inhibits the scale that these projects can tackle.
The Problems Continued: Old Approaches, Old Results
A large part of the problem isn’t the technologies, but the approach. The department most often tasked with solving data unification and quality issues — IT — tend to have project management backgrounds shaped by years of software implementations. These projects are characterized by heavy up front investment in specifications, long cycle times, and validation at the end of implementation.
Old approaches to data management have a central flaw: they don’t acknowledge that data environments and end user requirements are constantly changing. These changes create the demand for rapid answers and also the complexity of the data landscape that makes those answers hard to deliver, quickly. Agile data mastering embraces changing environments by taking an iterative approach, allowing businesses to adapt flexibly and change quickly. I go into this process a bit more over at Dataversity.
An Agile Approach Is the Answer
Luckily, there is an example of a similar problem that has a solution. If we look back at software development, not even two decades ago, we can find an analogous approach to feature development. Engineers would drive the specifications and components up-front with architects, write software and release it in major releases. The Waterfall approach to software development resulted in a lot of failed releases and applications that didn’t solve the problems they were intended to. As engineers thought more critically about their role in the process, they realized that they needed to aggressively and continuously engage users in requirements gathering, examine the actual usage of software to develop improvements, and iterate constantly through rapid releases. The agile approach to software development rapidly rose to prominence and is now the gold standard in building software applications.
That same agile approach, applied to data management, can help us address both speed and scale. By treating data unification as an iterative process, engaging stakeholders early and often, we can correct issues, accommodate emergent requirements, and react to changing data quickly. We avoid rework and optimize our efforts and can thus solve the speed problem. By augmenting or replacing codified rulesets with new technologies like machine learning models, we can build automated data mapping and cleansing which adapts to data variety and volatility, allowing us to cast a wider net and address the data scale challenge. We can achieve truly scalable data curation, we just need the right approach.