Clint Richardson, my colleague and friend recently published this blog post on Tamr transformations. One of the great things about the post is that it was written out of passion for Tamr software and what Clint saw as the transformative (no pun intended) experience in the ease of use of Apache Spark’s computational power for data wrangling.
Here I will explain how small, incremental improvements to an initially pretty narrow functionality created one of the most powerful features that got a lot of people at Tamr excited, but first some context: In July 2017 Gartner published a version of it’s Data Science Hype graph.
While opinions may vary, I believe in the general shape of the curve and the place where Apache Spark fits.
The reasons for the cycle are self-inflicted, the products, segments, and ideas going into market before they have strong communities, tooling, knowledge, or experience. Everyone understands or tries to understand the power and the potential, but when the time comes to actually do something with the next great thing (Apache Spark) – organizations hit walls and then the hype bubble deflates.
With this little preamble, let me talk about history of Tamr transformations and how we think it could be one of the tools helping Spark through the Trough of Disillusionment bringing it into the Plateau of Productivity.
Milestone 0: Was built to bridge Tamr’s schema-mapping machine-learning artifacts and mostly (SDTM) based target data models as that was the focus of the day. The version worked fine and was a great help to reshape name-value pair like sources into tabular form and vice versa, but the execution had been constrained to a single JVM.
Milestone 1: Tamr transformations were refactored to have the computation logic moved from the code running in a single JVM to the code running in Apache Spark.
Milestone 2: A lot of data-type effort went in. On its own, the work didn’t have to show a lot to the end-users but it positioned the functionality well for Milestone 3.
Milestone 4: A great deal of new functions have been added to Tamr transformations language, including lambda expressions, record explosion, etc allowing amazingly enjoyable operations on array types. Additional UX enhancements include type-ahead suggestions, and syntax highlighting.
At this point it’s fair to look back at what we started with and have an OMG moment, I can’t believe that I can take pretty much any data of any size (whatever Spark can handle) and work with it right now.
What would you do with Spark if all of the potential learning curve of Scala and/or Python, Data Frames, and RDD was removed from your workload and that of your team?
What if, the machine learns about your data and complements your ability to manipulate the data at scale freeing you from being bogged down in the trough of schema mapping, and details of the elaborate Spark APIs?