AWS Glue, ETL, and the Persistent Challenge of Data Variety
Yesterday Amazon announced the public availability of AWS Glue which they describe as a fully managed ETL service that aims to streamline the challenges of data preparation. The service was previewed back in December 2016 at Amazon’s re:Invent conference, so while it’s not a surprise to anyone watching the space, the general release of AWS Glue is an important milestone.
The ETL market isn’t going away, but it’s about to get a lot more interesting. AWS Glue is a big deal, and will be a disruptive force in the traditional ETL market - think Talend, Informatica, IBM, Oracle. While those committed to hybrid and / or multi- cloud architectures will probably view Glue with some trepidation [insert joke about Glue making AWS more sticky here], it will surely create a lot of value for those who are already committed to AWS, and will also attract new users attracted to single vendor, full-stack AWS solutions. It will put serious pressure on traditional ETL vendors who are fighting for relevance in the cloud. And it will also force competing cloud and PaaS providers to move in a similar direction to try to match Amazon’s new hosted multi-tenant ETL offering.
Amid all this swirl, the general availability of Glue presents an opportunity to reflect on the limitations of the traditional ETL paradigm (cloud-based or otherwise), particularly when it comes to solving today’s biggest unmet enterprise data challenge. The deterministic nature of traditional data management approaches embodied by ETL and MDM tools fails to solve the core issue of data variety, the third, and most problematic ‘V’ of the classic view of big data that has historically emphasized volume and velocity. If you’re a shiny new ‘born on the Web’ company, data variety probably isn’t your biggest problem. If, however, you’re a mature, large-scale enterprise that has been around for decades or even centuries, you’ve almost certainly endured waves of technology adoption, acquisitions and divestitures, shadow IT groups, and successive re-orgs, all of which have exacerbated the problem of data variety -- multiple data silos, differing schemas and formats, and wildly variable completeness and quality.
Data variety is the biggest obstacle stopping enterprises from realizing analytic and operational breakthroughs, and traditional ETL and MDM tools and their deterministic approaches haven’t helped these companies overcome the challenge of their data silos. Cloud-based ETL won’t solve the problem either; it simply relocated the issue. While it may be economically more attractive and scalable, it shares the fundamental flaws of traditional ETL:
- Rules break: The logic of ETL systems are based on rules determined by developers and enshrined in code. Deterministic, static rules don’t scale well as data variety increases. A better approach is to use machine learning techniques to create bottom-up, probabilistic models for combining and cleaning data. Not only is this more scalable as new sources get added, but it is also more adaptable to deal with entropy in the underlying data sources themselves.
- Context is king: Human expertise about data’s business meaning is essential. ETL rule developers often lack the domain knowledge necessary to interpret the data that they are trying to integrate. As a result, they are either forced to make assumptions which can lead to data quality problems when they guess wrong, or they can go off to interview subject matter experts and attempt to codify their findings which is a time-consuming process subject to all the vagaries of human communication. A more effective approach would be to create a data integration system that easily captures and integrates SME knowledge to enhance the results generated by machine learning.
- Best-of-Breed is Best: DataOps, the practice of applying the same principles behind DevOps to the challenge of increasing analytic velocity, requires a mix of best-of-breed tools that work well together, less like the data management platforms offered by the big, monolithic traditional vendors (Oracle, IBM, Microsoft, et. al.)
ETL isn’t going away anytime soon, and AWS Glue is going to make the market a whole lot more dynamic. It will precipitate a series of moves and countermoves by incumbents and new entrants alike. We also think it will shine a brighter light on the enterprise-scale data variety problems that ETL approaches are ill-equipped to tackle. That’s the challenge that we’re focused on at Tamr, and we welcome the opportunity to partner with like-minded enterprises and vendors who see probabilistic data unification as both a complement to existing ETL and the path to unlocking new value from their data assets.
Tamr co-founder and Turing Award winner Dr. Michael Stonebraker has written about new approaches to scalable data unification here if you’re interested in diving more deeply into the topic.