Written by Nik Bates-Haus
The kind of broad data unification that Tamr does typically leads to complex deployments that involve many steps across many systems. This type of unification requires a reliable, manageable way to automate and coordinate all of this activity while supporting our approach to DataOps. We’ve tried many different approaches across dozens of projects, but haven’t found anything that really met our needs until recently. Cron provides no redundancy or failover and is challenging to integrate into version control; Bash and Python scripts provide no intrinsic means for monitoring, centralized logging, or restarting from a point of failure. This dramatically increases the cost of building and extending reliable, repeatable, extensible data processing workflows. We needed something better. We looked at scheduling tools such as Rundeck, Metronome, and kube-scheduler. We looked at data movement tools such as Luigi, Pinball and Streamsets. Nothing provided the capabilities we needed in a way that supports the core principles of DataOps, until we encountered Airflow.
Apache Airflow is a workflow management system. Airflow provides tools to define, schedule, execute and monitor complex workflows that orchestrate activity across many systems. The motivation for Airflow is described eloquently in two blog posts by the original author, Maxime Beauchemin, then of AirBnB: The Rise of the Data Engineer and The Downfall of the Data Engineer. Over the past couple of years, Airflow has emerged as the clear “best-of-breed” tool for orchestrating data systems.
Data unification, by definition, involves bringing data together from heterogeneous sources. To enable this, we need a tool that is broadly interoperable – one that can manage many systems, and provides easy extensibility to bring additional systems under management. Airflow meets this need nicely: the base package and contrib module contain ‘hooks’ for many databases, web services, applications, and other tools. The folks at Astronomer have also built an impressive library of additional hooks and add-ons. It was also very straightforward to setup up an Airflow plugin to provide a hook and suite of operators for Tamr Unify. This means that we can use one tool to orchestrate end-to-end data production, so all data production tasks are visible in one place.
So why Airflow?
Deploying data unification workflows with Airflow provides the ability to coordinate activity across diverse systems in a scalable, reliable, restartable way, with excellent integration into contemporary deployment, scaling, and monitoring frameworks. What’s more, as we have developed more and more complex workflows in Airflow, we have found its workflow-as-code approach to result in cleanly structured, highly maintainable code that fits extremely well with other aspects of DataOps. The net result is a significant reduction in cost and increase in extensibility for the orchestration aspect of DataOps, meaning existing organizations can deliver more data, faster. If you are interested in learning more about deploying data unification supported by Airflow, feel free to contact us.