Declarative DataOps

AWS revolutionized IT. Instead of managing hardware, it allowed developers to set up an entire cloud infrastructure with a few clicks of a button. AWS also released APIs to automate the clicking of said buttons, making it not only possible, but fairly easy to write a script to “create ten servers” and run it anytime you wanted. There were problems, though. For example, each time you created ten servers, you would get ten more servers! And if you weren’t careful, and ran your script too often, you could end up with thousands of servers and a scary bill. One typical solution was to try to patch it up with more complicated logic. Still, this wasn’t a silver bullet, because the script was still fragile because the state of the system was often too complex to accommodate.

Fast forward and DevOps came along and revolutionized IT again. DevOps allowed you to write scripts like “I want ten servers,” which describe, not what you wanted the system to do, but how you wanted the system to be. Instead of always creating ten new servers, the script could create or remove them depending on how many currently existed. It could adapt to all sorts of different situations and dynamically generate a plan to achieve the desired results. This made DevOps scripts reliable, which was a game-changer and enabled:

  • Version tracking. Scripts became reliable and thus reusable– and reusable scripts are worth keeping around! People would track different versions of their scripts so that if a newer version didn’t do the right thing, they could revert to running an older, working version.
  • Collaboration. Multiple people could each work on, and experiment with, independent versions of the script and then integrate their improvements back into a master version later.
  • Testing. A small instance of the infrastructure could be created for testing and, if it worked well, could be reliably recreated at-scale for use production.

Declarative scripts, scripts that describe how a system should be, is also what enables DevOps to create reliable, reusable infrastructure-as-code that is open to collaboration and testing. It is how large DevOps teams can all work together to consistently make and deploy improvements.

By making data pipelines declarative, like in the above context, we can bring these same benefits to DataOps. One way we do this today is through Machine Learning. Instead of telling the pipeline what to do with the data in the form of complicated, fragile rules, we instead teach the system what the data should be through machine learning labels. Tamr learns the rules dynamically, adapting to the inherent nuances in the data.

Machine learning learns the rules that operate as part of the data pipeline. But we are also improving how the pipeline itself is constructed. Today, we construct pipelines by telling Tamr what to do. “Create a project, make some mappings, set the DNF, run a job, etc.” As part of our ongoing modularization work, we are making it possible to tell Tamr what the pipeline should be. For example, “run a job with this exact pipeline, including projects, mappings, DNF, etc.”

By making the construction and the execution of pipelines reliable, we will make it easier for DataOps engineers to collaborate, experiment, test, and deploy their pipelines, ultimately leading to consistent improvements in the quality and insightfulness of their data.