DataOps and Data Science

There is a lot of mental energy being put into the topic of FAIR data in the biopharma industry these days. For us here at Tamr, we think of the FAIR data movement in biopharma as very similar to the broader DataOps movement that is playing out across industry writ large. With so many people turning their attention to this topic, there is a lot of great material coming out, but there is also, frankly, a lot of noise. To cut through that noise, we need to get to the actual problems facing data scientists in the industry and what the ideal solution would be (spoiler alert: it’s not a giant George Jetson button).   

As a starting point on this topic, I thought I’d spend some time talking about the connection between the science that many a data scientist did in their past lives and the Data Science/DataOps engineering that surrounds us every day, because there is an important link between the two. Thinking of the two together can also, I think, show how people start to miss the mark regarding what the whole DataOps/Data Science transformation is about. So, let’s start with that important link: doing science well naturally engenders an agile mindset to problem solving, and thus leads to the development of habits that are crucial for both Data Science and DataOps.

A Refresher on the Scientific Method

The scientific method has 4 steps:

  1. Ask a question
  2. Make a hypothesis about answering that question
  3. Derive a prediction based on that hypothesis
  4. Test the prediction.

That’s it–4 straightforward steps. When you’re approaching problems in both academia and business, it’s pretty easy to talk about how what you need to do is use (data) science to solve the problem. Doing science on the other hand, is extremely hard. You have to be careful, thoughtful, diligent, thorough, patient, and OK with the fact that 90% of the time you realize you’ve either done the wrong thing or that your thinking on a topic is wrong – and you have to face that wrongness without compromising any of the above steps.

It is the sum of these characteristics that leads (good) scientists to naturally inculcate an agile mindset. You start with defining a minimum set of requirements to test the prediction of the day and build out the experiment that satisfies those needs. Most of the time when you finish that stage, you realize there is one or more (hopefully not too large) errors in the way that you are testing your prediction: your code is wrong, your data is wrong, your question grew based on what you learned, your assumption about X is wrong, etc. As you come face-to-face with these errors the natural inclination is to improve what you’ve done and re-run your experiment [1]. That is, good science is naturally agile.

This process of learning all the things you did wrong (systematic errors in a scientific publication) is how you come to actually understand the question you are working on and what the data is telling you about that question at a deeper level. Keeping track of the things you learned about your data and the myriad ways it revealed prior ignorance is also the only way to get anyone else to believe your results [2].

The Connection to DataOps

The way this facet of science ties into the larger conversation about DataOps is that a key talking point (we make it too) is that data scientists spend too much of their time getting data to be usable and not enough time analyzing it. We (and everyone in this space) also talk about how sad all those data scientists are because they have to do this. It’s true that DataOps can make their lives better by providing better data with faster improvements. It’s not true, however, that they can do data science well without understanding how the sausage is made. The point isn’t to get them out of the data factory, it’s to give them better tools and habits to deal with their data both for quality and quantity.

Let me highlight the issue with a recent example I heard. A software vendor was giving a talk to demonstrate their data pipelining tool, how it could do all of these fancy things to the data and even run ML/AI on top of it automatically, and how you basically just had to sit next to it and watch it go. Their whole pitch was about the above: let your data scientists be more enabled. Their solution though, was to completely remove the data scientist from the data collection/cleanup/aggregation process by having their tool run predefined algorithms on the data without any humans being involved whatsoever. Someone in the audience even commented to the effect of “I think you’ll even put all the data scientists out of jobs with your product because you can just put a source in and press a button and get analytics”. 

Why is that a (shockingly) wrong point of view? Because magic isn’t real. If you don’t have any insight into what’s actually happening to your data, how can you possibly trust the story being told with it? It goes back to understanding what’s wrong with your data and your assumptions about your data being the only reason you can confidently stand by your results.

I also want to dispel this notion that data scientists are sad about having to be a part of the data collection/aggregating/cleaning process. The people I know love getting up close and personal with data. The reason pre-processing can be a drag isn’t because of some anathema to dealing with the nitty gritty about data — that’s what most data scientists cut their teeth on in academia and they know the importance of doing that part of the job. It can be a drag because sometimes you have to use tools that exacerbate the pain points, usually because they either don’t work well with things at scale or because encoding your knowledge into them and sharing that knowledge can be difficult. This is a place where Tamr Unify excels: encoding human knowledge about how to improve data quality and operating at scale are our sweet spots and its why our story about DataOps stands head-and-shoulders above the ‘magic button for AI/ML analytics’ story.

anWhat you should expect from a DataOps/Data Science transformation isn’t the removal of humans from the data unification process and magical analytics to show up at your doorstep; it’s getting to the problems in your data easier and faster and so building your awareness of your previous blind-spots quicker; it’s the acceleration that comes from enabling humans to be in-the-loop in an easier way; it’s the acceleration of the improvement to the data by the application of their knowledge at scale using ML; and finally it’s the enabling of advanced analytics that are meaningful because of the steps that came before.

To learn more about DataOps and best practices for implementing it at your organization, download our ebook, Getting DataOps Right

 

[1] Obviously blinding is a critical piece of the full story here and please forgive me for not going into detail about implementation of improvements of analyses while staying blind to the actual test you care about.

[2] If someone publishes an article where they say everything worked perfectly, there will be a collective “I call BS” from the reviewers. The nuances in the data is the hard part, but it’s also where the learning occurs and how you can trust the results.

Getting DataOps Right

In this report, five data industry thought leaders explore DataOps—the automated, process-oriented methodology for delivering clean, reliable data across your organization.

Download Now



Clint is a Senior Data Scientist/DataOps Engineer at Tamr where he leads several efforts across the life sciences space. Prior to Tamr he got his PhD in Particle Physics and is a co-author on one of the most widely downloaded reviews of machine learning in physics.