Building The World’s Best Spark API With SQL

Transformations are perhaps the least sexy part of every data project. Everyone loves the cutting edge algorithms used to provide new insights, but those algorithms are useless without the transformations that have come before them that are needed to whip the data into shape.  Performing GROUP BYs, joining datasets, and normalizing attribute formats are all essential elements of making data usable for consumers, regardless if those consumers are people or technologies. The problem? Current technologies designed for transforming data at scale are unusable by the people who understand the most about the data and how to make it valuable.

This is unacceptable for companies aspiring to make data initiatives consistently agile and collaborative, which are core principles of DataOps. Making DataOps a reality is only possible if the people who understand the most about the data are involved throughout. Applying this concept to schema mapping, record matching, and classification has been a central focus of Tamr’s software platform, Unify, and is the key reason we began developing transformations as a component of our system. Tamr Transformations give users the power of Spark in the form of SQL commands to minimize the technical barrier to transforming data at scale. With a simple ‘SELECT *’, users can begin tapping into scalable, distributed compute using a language they already know and love. Let’s look at an example.



One of our customers, a large biopharma company, has realized the benefits of simplified transformations at scale in the form of reduced spend on external consultants to the tune of millions of dollars. The fundamental problem they were facing is the expense of transforming their clinical studies data into SDTM format, the regulatory format for FDA submission. The problem is actually very well defined; there are a set of rules for how to transform the collected data to SDTM format and this clarity of definition is in large part because of the common data model the company has already imposed on its clinical studies data. But, because of the disconnect between the people at the company with domain expertise on individual transformations and the people who have skills to execute the transformations at scale, they have historically hired expensive consultants to perform the transformations manually, one at a time. Not anymore. The next-generation DataOps ecosystem they developed, which includes Tamr Transformations, has enabled them to leverage the knowledge of internal subject matter experts to build automated pipelines for preparing data about tens of millions of clinical trials records.

This is just one of many examples of companies spending less and getting more from their data by leveraging best-of-breed technologies to bring subject matter experts closer to the data. The scarcity of data scientists has been making headlines for years, and the problem is only getting worse as new technologies are introduced that require new skills to use. Consider this: according to Time, in 2016 a data scientist could expect a salary increase of ~18% simply by knowing Spark.

Spark is a powerful tool – its syntax isn’t terribly dissimilar to pandas dataFrames, it feels powerful, and because of MLlib, it interfaces well with many useful ML algorithms. But, if you just want to manipulate and move data around, you have to write a lot of code, and invest significant time to make it reusable across multiple sources. In other words, you’re paying a premium for talent that knows Spark and you’re getting taxed for the complexity of the technology.

We don’t think this is sustainable, and believe there is a better path forward for enabling enterprises to unleash the value of their data. Our aim is to build the world’s best Spark API that makes data transformations at the scale of billions of records as accessible as SQL. We’ve had clear successes obtaining that goal through our collaboration with companies like Toyota, GE, GSK, Thomson Reuters, and Samsung. We would love to show you what we’ve built. If you’re interested in a demo, please reach out.