datamaster summit 2020

Best Practices for DataOps in BioPharma

 

David Cowen

Director, Data and Computational Sciences at GlaxoSmithKline

Best practices for modern DataOps in life sciences — fire side chat

Transcript

Speaker 1:
DataMasters Summit 2020 presented by Tamr.

Speaker 2:
And off to you.

Bernie Kuan:
All right. Hi. My name is Bernie Kuan. I lead the solutions and the pre-sales team at Tamr for our life science and healthcare vertical. Joining me is Clint Richardson who is our technical lead in life sciences and works very closely with many of our pharma customers including one of our speakers today, David Cowen, who is the director of data and computational sciences at GSK. Clint, do you want to give a little introduction about yourself?

Clint Richardson:
Yeah, sure. Thanks Bernie. Like you said, I’m the technical lead in life sciences. I work across several different accounts, but I’ve had the pleasure of working with David for a few years now at GSK in both the main work streams we do there around sort of rationalizing data sets sort of at the source and then also unifying legacy clinical trial data. And David, do you want to say a few words to introduce yourself?

David Cowen:
Certainly. David Cowen, currently I’m the data engineering lead for GSK’s data and computational sciences group within it, our R&D organization. I’ve been working in this data engineering space specifically around the clinical area with Clint and others, [Barrett 00:01:34], Tamr, for about three, four years putting in place our clinical pipelines.

Bernie Kuan:
Thanks David. So, in your experience, what have been some of the main challenges for life science companies in transforming the way they manage data?

David Cowen:
So, in my experience, within GSK, I think there’s a number of factors that play a role in the challenges. Some of the institutional level challenges that we see are just simply resistance to change on items. More specifically in life sciences though, we see a lot of regulatory demand across the organization in different space, and those demands can have an impact. Also related to that is an undervalue of data in my mind in the life sciences space. And to expand a little bit on those, the resistance to change is not unique to life sciences and it’s not unique to changes in data management, that’s probably well understood. I think in the area of regulatory compliance, it is somewhat specific to life sciences in the GXP world. We got different areas that need to be clinical, good clinical practices, good laboratory practices, good manufacturing practices. And so those are specific to the areas of life sciences.
Regulatory constraints aren’t necessarily specific to life sciences, certainly the financial industry has them. One of the problems with GXP compliance, making modifications in data management is it’s a relatively high bar to be compliant with these rules, you have to establish processes and systems that adhere to your rules. And so when you get that in place, you would like to stay with it and so, when groups come along and say, “We’re going to modify the way we handle data,” there is substantial resistance. And so that regulatory piece is important. The last one that I had mentioned there was around undervaluing data. I think the DataOps changes that we’ve seen in the industry have been led by companies like Google and Amazon that have a real data focus. They understand data is a key factor in their product.
In the life science industry, that’s not so much the case. GSK to some extent, is pharmaceutical with drugs, and so data is on the periphery and while for years we’ve known that we got a goldmine of data that’s available to us, really having the will to go after that and make it manageable that’s something that has not been there. It’s only been recently here at GSK that the value of data assets and information assets has been realized and we’re in the process of trying to capitalize on that.

Clint Richardson:
Thanks David. And I think we’re probably going to dive through just a whirlwind of a lot of key processes that we’ve talked about in DataOps, but I wanted to take a minute just to follow up on that and say from my point of view being involved with you guys at GSK over the last few years, I’ve been really impressed by a lot of the commitments to basically treating data as an asset and then building in the sort of key functionality that you need to make sure you can improve quality over time. And that sort of commitment to delivering quality data, which I think goes to probably a lot of the things you were talking about, especially around sort of regulation and process. And then also just the ability to know when and where and how to expand scope of projects because like you said, it’s always hard to get the buy-in to start these things. And I think it’s also hard to maintain the buy-in. And I think that while we’re doing all these things, if I look back, I realize that they sort of fall into what we now just call the DataOp’s principles, but at the time, we were just sort of trying to do the hard lift and shift of a lot of the clinical data. And so I think that’s why we’re so excited to dive into this stuff with you today.

Bernie Kuan:
Yeah, thanks Clint. I think GSK has been very exemplary in kind of championing the DataOps concepts. That’s now kind of an official term that we use, and we use very often. To dive more into your DataOps and maybe we can generally frame it around the concepts of process, technology, and people. So, let’s dive a little more into that framework in respects to GSK. So for process, one of the key things about the DataOp’s approach is an emphasis on agility, being able to iterate and capture results very quickly. So David, I want to ask you how do life science companies adopt more agility in their data management and can you talk a little bit more about your experiences with that approach in your work?

David Cowen:
Certainly. I think with some of those DataOps principles, especially agility, I mean there are issues that as a data engineering team we were addressing a number of pieces that I was interested in improvement and efficiency were around reduction in the cycle time not only of deploying data pipelines, but also in being able to run data through the pipelines, to speed that piece of up. Part of that delay was the reduction of the number of handoffs that we see, especially in terms of deployment of our pipelines, we have a number of groups that are engaged where we had separate testing teams, we have separate development teams, we had various gates between those and trying to reduce that is an impedance for our deployment is one of the key pieces. There are other aspects of the data value that we can get. I mean, we talked to having valuable assets, but getting the community to use them I think that’s one of our key challenges and some of these concepts, certainly within DataOps, look to address that in making that data more approachable and available.
I mean, you’ve asked specifically around agile. And some of the approaches that we’ve taken on with agility is kind of this iterative implementation that is key with the agile approaches that have been rolled out. And I think that’s been key with our DataOps, our data engineering direction. We are very keen to kind of do pieces where we can take on manageable chunks and try to add value to the data asset and make that available quickly, so that we can get feedback from the business community on those aspects. So, that’s something we’ve embraced and utilized. The other thing we do have in that iterative engagement and trying to get the feedback from the business, one of the components of agile that we’ve also brought in is more business interaction. With the work that Clint and I have done on the clinical conversion pipeline here at GSK, we had directly on hand subject matter experts that were part of a project team and were very involved with doing the verification of the data, giving us feedback as to what we were doing, and that’s certainly one of the key concepts with agile is being able to get that feedback quickly. And it was instrumental in kind of the work we were doing.
So the other aspect that I want to focus on from agile is this continuous improvement. It may seem like continuous improvement is hand in hand or a restatement of iteration, but I do see somewhat different aspect of it. It’s not just I’m adding new value incrementally over these iterations, it’s also looking back at what we’ve done previously. I know more now and we should revisit some decisions we made in the past and I don’t need to approach perfection with that, but certainly as we learn in our deployment process or development process, we have been good at looking at decisions that were made and is there substantial value in revisiting those we do that. So the continuous improvement aspect of agile is important to what we’re handling.

Clint Richardson:
Yeah and David if I could follow up on that. I think sometimes when we see people say, “Okay, I want to do agile stuff,” and what ends up getting rolled out is really a waterfall plan that they’ve chunked up into two week or month long segments. And really it’s that key piece that you brought up is that ability to actually stop after each time and say, “Okay, well, what do I need to do next based on what I’ve learned in the past?” And I think that’s a really hard thing to actually do and while you’re doing this because you have a million other fires to put out and you still have to deliver the data and you still have to satisfy user requirements. So, I’m curious to hear if there are any thoughts you have on how to juggle those things and buy yourself the space to actually have that sort of reflectiveness, to actually improve the process?

David Cowen:
Yes. Certainly, you’ve been heavily involved, so you know we’ve not hit 100% perfection of eliminating waterfalling, and so we do still have some of those aspects where maybe we set the plan up too early. I do think there is a balance of tension between the two because you don’t want to get into this iteration where you lose sight of what is the end goal. I reference it a little bit, I don’t need to do continuous improvement on a single point of perfection. I do think we did a reasonable job of number one, learning what those aspects were. DataOps was new to the entire team, and I think that key is you got to manage that expectation with your business users, kind of the executive committee that’s given you that sponsorship. And make sure that they’re aware, look, we do need some time to come back and do the learning from each iteration, reevaluate what we would like to do at the time and tolerance, so that if we change direction in those iterations that there is a [inaudible 00:14:16] for that. That that’s accepted by the end user.

Clint Richardson:
Is that a conversation you can have at the beginning of these projects or is it one that you always sort of have as its ongoing?

David Cowen:
I certainly think you always have it at the beginning of the project and I think it’s quickly forgotten afterwards. I think people will come to you and say, “Well, we expected this by this time.” So, I think yes you have it at the beginning, but then you revisit it throughout. You continue to have those open lines of communication and ensure that people maintain that expectation. I think also from the project team, you’ve got to understand that you’ve hit some level of value and there are other objectives, so I think as a project team, we need to make sure that we are moving forward on delivering additional value around the assets or information.

Bernie Kuan:
You talked a lot about the changes in process and how that’s creating conflict [inaudible 00:15:30] these conflicts, but I want to shift a little bit to the technology side where in the DataOps framework, we advocate often for a [inaudible 00:15:42] lead and marketing approach, [inaudible 00:15:44] in our data pipelines. I’m kind of curious to hear your thoughts around how do you engage with all the variety of technology out there and how do you implement an approach to help you automating some of the manual pains that comes with traditional approaches?

David Cowen:
So, I’m sorry, Bernie, I missed a little bit of that, but I think your question is the automation, where do we apply automation and how do we determine the right value and what to do?

Bernie Kuan:
Right.

David Cowen:
Okay. And so, automation is a key component of DataOps and it is one that I approach carefully because automation for automation sake, it can be dangerous in my mind. I think with certainly the clinical conversation pipeline, those are one of the aspects where automation was key to what we were doing. We used it in a number of spaces. The testing that we did was very rigorous, the size of the data, the number of studies, the datasets, it is onerous, and so without automation, being able to programmatically handle some of this, we just never had been able to achieve the conversion of a thousand studies that we have.
That said, a lot of the process, especially early days, was trying to understand what is our process. And what do we want to do in terms of the conversion, the data movement, and other pieces of the pipeline. And trying to understand where automation is correct was a balance of how well do we understand the process that we got here. And there are a lot of areas that look, we were doing going through the pipeline and we really weren’t sure this was going to work out the way we needed to. So in those areas, I was really resistant to put a lot of automation in place there. And with that, we also had manual verification checks where we needed to ensure that we didn’t automate those out because they were key to some of the work we did. We also had places where the automation was exceptionally expensive, it could be exceptionally expensive and so those were areas were I also tried to downplay that.
That said, we had a lot of automation around a number of pieces of what we were doing with the work that Clint and then Dominique had done for us on the conversion within Tamr, so they had an exceptionally capable regression test where we would identify problems with the conversions that we were putting in place and getting those issues resolved and then having test in that regression test that they were automatically run each and every time we did a conversion was key. We have a lot of data movement that is above and beyond the conversion, we have anonymization pieces that needed to be QC’ed, so in the data movement where we had large sets of domains that needed to be moved, we had automation in those places. And we also had automation in verification outside of the conversion, the anonymization, the ones I looked at and we put in place a very comprehensive verification of our anonymization pieces. So, we have used it heavily and will continue to develop it more, but I do try to balance it with the value that you’re getting from the effort.

Bernie Kuan:
Right. So [crosstalk 00:19:48]

Clint Richardson:
And two things I wanted to pull out about a little bit more are I think if I look at the process from its inception to now, its changed a lot and I think I remember a workshop we had two and half years ago, we sort of put everything up on the whiteboard and said, “Okay, which pieces can be automated?” Even just conceptually, which ones could we actually automate? And like you said, sort of going through that and really let us understand the process and give that insight into the cost of automation versus the benefits of automation. And one I think is sort of striking is just the growth in how much we’ve been able to automate over the years. And I think that’s sort of another key, it ties into all these principles around [inaudible 00:20:34] improvement and continuous improvement, you’re growing what you can automate away and subject these cycles. And one thing I wanted to ask you about is how the role of testing plays into your ability to do that because I think one of the things we noticed is that it’s a lot easier to automate and understand the cost of automating if you have tests in place and you actually can know what happened and if the automation was successful.

David Cowen:
No, and I would agree with that point. I think that’s a key to the DataOps philosophy is getting the ability to make changes and have high level of confidence at a low cost that those changes have not negatively impacted your design. So, as I said from the conversation, a lot of our automation efforts were around testing because that was so key to what we were doing. And we put a lot of energy into testing, so that we had it readily available and we could make changes, we could run new datasets, new studies, and have confidence that they were in good shape based on the various testing pieces. Its kind of like you said, Clint, the key for me in this work has been our ability to have that full set of test suites and then we could modify our pipeline or bring new data to it. Its key that we have that confidence level in DataOps.
As I mentioned earlier, the amount of data that we’re handling is impressive. We’re not big data in terms of physics or weather or things like that, but for human review of the data we’ve got it’s pretty substantial, so as you said, it’s been key to me to have that testing automated and available to us. One of the key components that we’ve got in the agile approach is the ability to experiment and make changes. Within Tamr, it’s made it very easy for us to propose an algorithm or a conversion technique, get that in place quickly, run that through, and then we got that testing to make sure that that’s not impacted anything else, and we’re able to verify this is what we’re trying to accomplish, it’s there, all the other pieces look like they’re in good shape. So yeah, that testing has been a crucial part of what [inaudible 00:23:24] has.

Clint Richardson:
Yeah and I think from our point of view, it’s also been critical on our ability to extend the scope of what can be delivered at quality because you sort of rapidly … You do the big thing, and then you have a nice sort of deliverables, but then you have this really long tail of data quality that’s like well I have really a few thousand things I’d like to fix, how do I sort of go about fixing them without breaking anything else? And again it is goes to the value add of the activity versus the cost of doing it and I think having this in place has let us really attack that tail in an impressive way while knowing that we’re preserving sort of the core data quality levels.

David Cowen:
Absolutely. And actually, a piece that I referenced early is the GXP compliance at life sciences institutions. And here in GSK, here in this work for the clinical, we made a decision early that we were not going to utilize this data for FDA such that the bar would be lower on what we needed to validate. Now, kind of as what you were talking about Clint, one of the pieces with that confidence is that we’ve produced an analytical data asset, we can utilize it for discovery type analysis. And as we do that iteration like you said, we’re addressing additional things. We’re constantly improving the data quality and we can move to higher levels of validation if you will, and utilize that data in other areas. And I think that will be key moving forward because I think the original target was for simplicity and not to have too much burden in that governance piece of with this testing we do have that ability to improve the data quality.
Beyond just that aspect of moving along the validation spectrum, data quality is key to our data assets, our information assets because the quickest way to turn off kind of our data scientist is to not necessarily have imperfect data, there’s always going to be quality issues, but show an ability to improve or ramp up on the data quality to be able to address data quality questions that come to us, so that is a key aspect that moving into the new DataOps world is being able to constantly not only improve your process, but improve the quality of the asset you’ve got.

Bernie Kuan:
David, I want to followup on what you said earlier about how you’re ability to select things to automate and certain things to not automate. I hear a lot of folks talk about DataOps and it being that it can look at the data pipeline and say just to automate everything, which is apply all the technologies that we have and just improve it. It’s almost like it’s magic. So how do you decide what to automate, and how do you prioritize what component in that pipeline to work on first?

David Cowen:
Yes. A lot of my difficulty discussion are with those that want automation and it is typically unclear to the uninitiated as to what the cost of that is and what the value from getting from it. And so typically, that is the underlying decision point is what value am I getting from it and how much effort is it taking me to put in place. Underlying those decisions is how frequently is this going to be done, it is one of them.
In the early days, the release of data was substantially measured in months, it was not measured in days or even weeks. And so, automation at that point was not as valuable for us because the manual cost was not as great. Now within those releases there is the task and … Clint and I are talking about where that’s happening much more frequently and so that’s where you could see automation value. And so, one of those drivers is what’s the frequency that comes out of it and then what is the cost and it comes from the other side. And that is in terms of if I’m really fully automating this, how complex does that need to be? How much thought needs to go in to all of these decisions and is that really buying me a whole lot or if I have a human sitting there and looking at that in a manual way, is that really five minutes of their time and so it’s not that expensive.
The other aspect of it is how dynamic is this process that I’m now automating. And as I said, we were iterating a number of pipelines process to understand how they should work and what they should work. One of the problems with automations is if I put in automation, in some way that stabilizes that process … or not stabilizes it, it puts it in concrete if you will and now changing that becomes more difficulty. So that is another measure I look with automation is how confident are we in this process we’re not automating. If we really feel we’re above 70% and this is going to be relatively stable, then it’s more likely for me to look at that for automation.

Clint Richardson:
Yeah I think you draw a really good point there. As soon as someone has an automatic way to do something, they totally forget how that thing is done because you just exported the mental load to the automation. And I think you also touched on that buys a lot benefits in the right spots, but if you are doing it sort of everywhere, you actually lose the ability to apply DataOps to that piece of your pipeline because now it is just sitting in concrete and no one’s ever looking at it.

Bernie Kuan:
Right, that’s in important point I think. I’m kind of curious, with all these changes that you’ve been managing at GSK, I’m curious if you can address some of the cultural shifts you might be seeing around the people that are trying to adopt these DataOps practices?

David Cowen:
Yeah. In many ways, there is substantial change and actually GSK at the highest level at R&D is being driven in the DataOps direction. Its very clear that our organization and I expect life sciences companies around the world are very much aware data and technology is a far more prominent player in trying to execute on R&D, and so that in some ways, the cultural differences, the cultural changes are already leading. I think there is a community of scientists coming in that are already keen to have the data available to them, readily available to them. They are much more greater familiarity with these DataOps tools, the data scientist tools where they want the ability rather than a monolithic system to provide them with certain results, they want the data raw and the ability to do their own analysis and plan.
So I think, as I said, that cultural change is happening, but that is one of the pieces that I see in GSK where the capability of that community is far greater. The set of tools that they have at their fingertips is better, their command of that is greater, and so targeting the data aspect of it is important, and so I see a lot of that cultural change. And the conversations that we have with the scientific community have changed substantially from kind of the end analytic that they 10, 15 years ago to now, here’s the data content that we would like to see. And there are still discussions around what is that target data model that’ll be ideal for us and optimize, so that continues to be … But those are some of the aspects of the cultural differences that I’ve seen here.

Clint Richardson:
And I was curious to tie that back to point you mentioned at the beginning, one of the sort of pain points around undervaluing data as an asset because sort of what I just heard you say is now there’s demand for data, and so in some ways, the value starts being driven up because of that. So I want to ask a question on that light and also one of the sort of I think key tenets here is the work that you guys are doing actually adds value to that asset because it makes it more available, it makes it higher quality. It makes it more usable and consumable. So does that help in these cultural shifts, showing that value, does these things come together in any way?

David Cowen:
I think they do come together and I think it’s something we got to be proactive about. I think this concept of scientists have access to data is a cultural change, I think the piece that we’ve got to connect is if I’ve got scientist A with that dataset and scientist B with that dataset, they don’t necessarily know that, they don’t know that they’ve been brought in independently and they’re being analyzed slightly separately, so I think that’s a lot of value that we bring to the data is … we bring the data in, we hand it to you, we get the metadata around the data, we do a better job of cataloging and saying here’s what’s available. Oh, by the way, it’s also being used here and here. So I think that’s part of the value we add to the data assets. It is that cataloging the inventory, the tracking and understanding what has happened to that data. It is one of the other areas that GSK is currently working with the Tamr product to build our ability to utilize machine learning to intelligently understand new products and how new data products and how they relate to other data assets, so that we can make that connection with the data assets for our scientific community.

Clint Richardson:
Yeah, and I guess one question I had was in the conversations around consumption and use and saying I’ve done all this work to rationalize your data assets and present it, what are the most powerful things for getting people to buy into that whole process? Is it people on the ground actually can do things that they weren’t able to do, is it you can actually point to real ROI on data processes? What things are the most powerful?

David Cowen:
Unfortunately, I don’t think we are particularly good at being able to measure that ROI today. I think the area’s so new that those type of measures are not as well understood as we like. I think the most tangible pieces that we see for our value add is the speed of which we can now do analysis. And so we got a strong AIML community and they had many, many ideas, but the sped of which we can now make the data available to them and they can start running is a huge difference. I don’t know if it registers with them or they’re just expecting it, but I think that’s where I see the value in what we do. And the other piece goes back to the governance, somewhat the ability for … I know my [inaudible 00:37:31] ad hoc, but kind of these individual analyses now being able to put in place enough governance and metadata to understand where the data came from, how it was analyzed such that those individual teams can go about their business of doing their AIML analysis and we’ve got enough record of that so that we could reproduce that at a later point or point auditors to what was done so they’re comfortable in that space.

Bernie Kuan:
Right and I think that’s very important and shouldn’t be trivialized where you’ve enabled your scientists and researchers to do analysis and opening up use cases that previously they couldn’t even think about or even do because it takes so much time finding and gathering that data together. It’s interesting now they almost take it for granted, that the data should just be there thanks to how well run these [inaudible 00:38:36] are I’m hearing.

David Cowen:
Yes. I mean it’s good that they can’t take it for granted, but yeah, I really think that value and for those who have been around for a long time, they probably do recognize some of these things used to take months to years now can be done in days or weeks.

Bernie Kuan:
So the DataMasters Summit is about helping life science companies and other companies implement a modern DataOps practice to really bring about big transformational impacts. So on closing, do you have any learnings that you could share with others on trying to take on a DataOps approach?

David Cowen:
And I’m sure there are many things I could put down for learnings, I would say to simplify, I probably have a handful. I think the first piece is this iterative approach that I would recommend to companies that are looking at DataOps. Don’t think you have to have the entire DataOps picture explained, understood, and solutions in there, I would very much recommend that you take pieces of DataOps and start building on individual capabilities and grow that over time. I would recommend that you actually look at a specific use case. Within GSK as we build our data center of excellence, we kind of had a mix of use case driven opportunities and more of the build it and they will come opportunities.
And I think in the early aspects, having a real use case that someone has passion or real interest in getting addressed is a much better driver for those pieces. I think at some point you should be building new capabilities that aren’t really necessarily asked for because people can’t see that vision, but early days have a use case that people are invested in to get to … The other thing I recommend is look internally. As I said with the scientific community, DataOps is happening in your company whether you know it or not. The new skills that are available and new technologies that are available, people are using these. Try to find that within your company, understand what they’re doing, understand what value they’re getting, and try to build on what’s having success within your own group.
I would extend that and say look externally also. And once you do identify where you would like capability and DataOps, there is an enormous range of tooling of technology systems that help within this space. Understand that, try to find those tools that are best for your purposes and then bring those in. I would caution against constantly turning over to the newest shiny new object, I would make a commitment to the tools that you identify and try to use those, so that you’re not just thrashing with changes and that is certainly something that we had a little of impact with in GSK and we are trying to stabilize. Still trying to stay very agile in designs such that we can switch things out quickly, but do have a little of staying power with our technologies. I think those are the four things that I would really recommend out of the box for picking up DataOps.

Bernie Kuan:
Awesome. Those are really good suggestions and I appreciate the insights too.