The kind of broad data unification that Tamr does typically leads to complex deployments that involve many steps across many systems. This type of unification requires a reliable, manageable way to automate and coordinate all of this activity while supporting our approach to DataOps. We’ve tried many different approaches across dozens of projects, but haven’t found anything that really met our needs until recently. Cron provides no redundancy or failover and is challenging to integrate into version control; Bash and Python scripts provide no intrinsic means for monitoring, centralized logging, or restarting from a point of failure. This dramatically increases the cost of building and extending reliable, repeatable, extensible data processing workflows. We needed something better. We looked at scheduling tools such as Rundeck, Metronome, and kube-scheduler. We looked at data movement tools such as Luigi, Pinball and Streamsets. Nothing provided the capabilities we needed in a way that supports the core principles of DataOps, until we encountered Airflow.
Apache Airflow is a workflow management system. Airflow provides tools to define, schedule, execute and monitor complex workflows that orchestrate activity across many systems. The motivation for Airflow is described eloquently in two blog posts by the original author, Maxime Beauchemin, then of AirBnB: The Rise of the Data Engineer and The Downfall of the Data Engineer. Over the past couple of years, Airflow has emerged as the clear “best-of-breed” tool for orchestrating data systems.
Data unification, by definition, involves bringing data together from heterogeneous sources. To enable this, we need a tool that is broadly interoperable – one that can manage many systems, and provides easy extensibility to bring additional systems under management. Airflow meets this need nicely: the base package and contrib module contain ‘hooks’ for many databases, web services, applications, and other tools. The folks at Astronomer have also built an impressive library of additional hooks and add-ons. It was also very straightforward to setup up an Airflow plugin to provide a hook and suite of operators for Tamr Unify. This means that we can use one tool to orchestrate end-to-end data production, so all data production tasks are visible in one place.
So why Airflow?
Orchestration: One of the appealing aspects of Airflow is that it does one thing extremely well – orchestration. When we were evaluating tools and platforms for orchestration, we eliminated many candidates because they insisted on owning data movement as well as orchestration. In a modern datacenter, data movement is handled by varying tools based on the systems involved, and these tools typically have dedicated infrastructure already. Airflow empowers us to use existing tools and infrastructure for data movement, while centralizing orchestration.
Python-based: Airflow brings the ‘infrastructure as code’ maxim to orchestration. By using Python as the language to express orchestration, Airflow enables us to use a broad, existing toolchain for developing, managing, reviewing and publishing code.
Production focused: Deploying data unification into production carries a suite of concerns including availability, scalability, fault analysis, security, and governance. Airflow has given consideration to all of these. It uses a write-ahead log and distributed execution for availability and scalability. There is a plugin to enable monitoring using Prometheus, and the use of standard Python logging makes integration with an ELK stack, for example, straightforward.
Deploying data unification workflows with Airflow provides the ability to coordinate activity across diverse systems in a scalable, reliable, restartable way, with excellent integration into contemporary deployment, scaling, and monitoring frameworks. What’s more, as we have developed more and more complex workflows in Airflow, we have found its workflow-as-code approach to result in cleanly structured, highly maintainable code that fits extremely well with other aspects of DataOps. The net result is a significant reduction in cost and increase in extensibility for the orchestration aspect of DataOps, meaning existing organizations can deliver more data, faster.