Read part one of the series.
Answers are edited for content, clarity, and brevity. To watch every video from DataMasters, you’ll need to enter your information one time.
Why is now the time to manage enterprise data as an asset?
Andy Palmer, CEO and co-founder, Tamr: It’s become an urgent competitive necessity. It’s no longer enough to have a bunch of IT projects. Every large enterprise is in a position now where it absolutely has to transform into a digital native as quickly as possible.
As large companies begin their digital transformation experience, it’s important they begin to manage their data as an asset and to start thinking first and foremost about how to organize that data, and how to prepare it so that other people in the enterprise who use and consume that data have a common set of understanding about what they’re working from.
Over the last 40 years, we spent a tremendous amount of time in the enterprise automating business processes. Companies like Oracle, SAP, and IBM were all founded on the belief that automating business processes was a good thing. As a result, we’ve created a whole bunch of data generating machines in the enterprise. Every single day, every ERP system, every procurement system, every CRM system is kicking out data that is a potential asset. Unfortunately, more often than not, this data gets treated like an exhaust.
In the last 20 years, a lot of great companies, such as Tableau, Qlik, and Domo have led in the democratization of analytics. This is preparing essentially all the consumers of data in the enterprise to actually take that data and do something useful with it. There’s also been an influx of big data infrastructure solutions (companies like: Cloudera, Vertica, Pivotal, Informatica, and Talend) that have started the process of preparing the data and putting the infrastructure in place.
But now with the advent of the cloud, as large enterprises move the center of gravity for their data to the cloud, it’s time to take advantage of the scalability and elasticity of the cloud and manage data as an asset in the enterprise. The table is set to realize dramatic value from all of this data that’s coming up from the bottom of these data generating machines and empowering all these consumers of data in the enterprise that are using modern analytics platforms.
The time is right to do this now. It hasn’t always been easy to do these kinds of projects, but now between the impetus for digital transformation in the enterprise, the opportunity to apply great state-of-the-art machine learning and the center of gravity now being in the cloud, it’s never been faster, easier and more affordable to manage your data asset and deliver that to all the consumers in your enterprise.
If you have data managers. If you have data stewards. If you have organized a team around these “data people,” why not just hire a whole bunch more of them and continue to use whatever technology you already have?
Marc Alvarez, Vice President of Data Management and Operations, Thomson Reuters: The answer is, you can’t. That’s a myth — sorry. The reality is, you need to centralize your operations to get the economy to scale and to get consistency. You need to normalize your content from a very, very heterogeneous landscape, everything from Salesforce to Siebel to everything else. We decoupled the Oracle data warehouses. We’ve got it all. The reality is I don’t think you could hire a big enough team and get to that point. We’ve already off-shored as much as we can. We run big operations in Costa Rica, big operations in Bangalore, big operations in Europe, and Eastern Europe, so we’ve already done that and the reality is we weren’t generating the value. The value comes from the normalization and the integration (of data).
Normalization, integration, governance, all of that is what you need in order to generate the value that our analytics teams are looking for as they analyze the likelihood of a customer to cancel the subscription, for example. They are moving very quickly. They’re pretty smart. We have large teams of data scientists who do this stuff. They rely on the content. The biggest impact we can do for them is to make sure that the content is accounted for, make sure it’s timely, make sure it’s not duplicated. You can only do that with automation.
How important is testing and automation in a DataOps strategy?
David Cowen, Director, Data and Computational Sciences, GlaxoSmithKline: I think a key to the DataOps philosophy is gaining the ability to make changes and have high levels of confidence at a low cost A lot of our automation efforts were around testing because that was so key to what we were doing. And we put a lot of energy into testing, so that we had it readily available and we could make changes, we could run new datasets, new studies, and have confidence that they were in good shape based on the various testing pieces. The key for me has been our ability to have that full set of test suites to then modify our pipeline or bring new data to it. It’s key that we have that confidence level in our DataOps.
As I mentioned earlier, the amount of data that we’re handling is impressive. We’re not big data in terms of physics or weather or things like that, but for human review of the data we’ve got it’s pretty substantial, so as you said, it’s been key to me to have that testing automated and available to us.
One of the key components that we’ve got in the agile approach is the ability to experiment and make changes. Within Tamr, it’s made it very easy for us to propose an algorithm or a conversion technique, get that in place quickly, run that through, and then test it to make sure that that’s not impacted anything else. We’re able to verify this is what we’re trying to accomplish, it’s there, all the other pieces look like they’re in good shape. So yeah, testing has been crucial.
What’s the difference between data science and data engineering, and where do you see things going right, in terms of the two functions?
Jeremy Achin, Data Scientist and CEO, DataRobot: I’d say data science without data engineering, or machine learning without data engineering, or vice versa, it doesn’t have a lot of value. And so I think you can’t just sit down and do machine learning without deeply understanding the data and working with the data engineers. You have to be willing to engineer the data yourself as a data scientist, or you need to work very closely with data engineers, all the way down to the very raw collection of the data.
If you picture hospitals around the country, or in laboratories around the country, sending in data regarding the pandemic, all of those individual places might be sending it in a slightly different way. Just trying to wrangle that and turn it into something that’s even useful for data science — that’s a lot of work. So I think it’s clearly garbage in, garbage out.
On the data science side of it, really you can’t do much if you don’t have somewhat clean data. You don’t need perfect data, but it has to be somewhat clean. And when it comes to data engineering on its own, there always needs to be a purpose. It’s either something predictive or machine learning, or it could be business intelligence, but there has to be something there too.
Why should organizations think about implementing a best-of-breed ecosystem of solutions, as opposed to writing a big check to Palantir, IBM, or Oracle?
George Fraser, CEO, Fivetran: Well, I think one of the implications of the cloud and of SaaS is that the advantage of having a single vendor is much less than it used to be. All of these tools can talk to each other natively. The sort of impedance of the server closet is gone and that has really changed that trade off. You’re better off with three great vendors that do different things well, than one mediocre vendor that does all three. In our domain, and then that’s the other thing from a company perspective, as a customer you don’t really see this, but from a company perspective, when you really focus on one problem and you capture more of the market for this whole slice of the stack, you can devote just a crazy disproportionate amount of effort to solving that problem.
If you look at what’s gone on at Fivetran over the last few years, the amount of work we have put into understanding all of the nitty gritty corner cases of every single database and every single app we support and making sure that no matter what happens, the data will match between source and destination. You can only do that kind of thing if it’s the purpose of your whole company. From the perspective of running a company, you can see why you end up with best of breed companies emerging in this cloud-based world. From a customer perspective, you just see that if you buy X from a specialized company that just does X, it just works. And if you buy it from a mega vendor who does a million things, you discover that you got a broken car delivered and you got to fix the car. I think that the cloud has really changed the way you build businesses. And then as a customer, that changes the way you do purchasing.