datamaster summit 2020

Getting the Most Out of Your Data Warehouse: Data Movement and Mastering with Snowflake, Redshift, and BigQuery

 

George Fraser

CEO Fivetran

Data warehouses have transformed the data management ecosystem. But what else is needed to ensure that data is accessible and optimized for business outcomes? Hear from George Fraser, CEO of Fivetran, and Andy Palmer, CEO of Tamr, about modern data movement and data mastering capabilities that ensure data warehouses have enterprise-wide data assets that are analytics-ready.

Transcript

Speaker 1:
Datamasters Summit 2020 presented by Tamr.

Megan:
Thank you everyone for joining this session. I hope you’re having a great conference and today we’re joined here by George Fraser, the CEO of Fivetran and Andy Palmer, CEO of Tamr. And they’re going to discuss modern approaches to getting the most out of your data warehouses. Topics will include data movement, data mastering and strategies for ensuring data from your warehouses are analytics ready. And with that, I’m going to hand things over to Andy.

Andy Palmer:
Thanks, Megan. And I’m really thrilled to be here today with George. George is an amazing entrepreneur and both George and I have worked in the life sciences for many years. George’s amazing career as a scientist and neurobiologist, I think is if I’m right, is that right, George?

George Fraser:
That’s true. In a previous life, I was a scientist.

Andy Palmer:
We both share a passion for biology and all things data. And it’s really great to have you here with us at Datamasters and psyched to hear what you guys are doing at Fivetran and talk about all the things that you’re seeing in the market and that we’re seeing in the market and maybe help all of our customers and potential prospects take data ops to the next level. Would love to hear more about kind of what you’re up to at Fivetran. And I think you got some slides to share.

George Fraser:
Yes, I do. And I’m very glad to be here and very glad to be in such illustrious company. I am a big fan of Tamr and of the work you and your co-founder did years ago in Vertica and C-Store.

Andy Palmer:
Oh, thanks.

George Fraser:
Which in many ways laid the foundation for everything I’m going to talk about today.

Andy Palmer:
Cool.

George Fraser:
Getting right into it, that’s us. And this is the problem that we’re going to talk about today. It’s a problem that I’m sure everyone on this webinar is very familiar with. It’s the problem of centralizing data. If you’re running a business today, you no doubt have many, many operational systems. And these operational systems may include databases like Oracle or PostgreSQL or MySQL. They may include apps like payments systems like Stripe or ERP systems like NetSuite or marketing tools like Marketo or Google Ads, hundreds of marketing tools.

George Fraser:
I joke that the typical marketing team adopts a new tool every Monday and sometimes one on Tuesdays. You have profusion of tools. These tools are great. They do lots of great things for your business, but if you want to know what is happening globally in your business, you are going to need to centralize that data into a data warehouse. And that is going to be a huge challenge. Now, the reason why you want to centralize this data, the most obvious reason is for BI, so that you can build your classic dashboards, telling you about the performance of your company, your OKRs, you name it. But dashboards are not the only use case for centralizing data. You also are going to centralize data in order to power operational systems, in order to build products that rely on that data and sometimes to share data with your customers or with your partners.

George Fraser:
And this problem is big, but the data warehouses that exist today in particular data warehouses like Snowflake, BigQuery and Redshift, that are built for the cloud are so much better and so much cheaper than the previous generation of data warehouses, that they are going to give the opportunity to radically simplify how you solve this problem. On the one hand, the cloud and the profusion of sources is making this problem harder but on the other hand, this new generation of technology, which really has its roots all the way back in the academic work that Andy’s co-founder did years ago in the form of C-Store, these are the descendants that work. Every modern data warehouse is descended from them.

Andy Palmer:
And the BigTable guys, frankly were sort of doing a similar thing at the same time.

George Fraser:
Those modern data warehouses are going to help you to solve this problem in a much simpler way. I often like to think about what I call the modern data stack in this layout. You’ve got your sources, you’ve got a data warehouse and you’ve got maybe some BI tools if that’s what your goal is to build BI dashboards. Or if you have one of those other use cases, you have something else on the right side of this diagram. And in the modern data stack, the data is going to flow from left to right in the stages that we’ve outlined here. I’m going to walk you through each of those stages and how I think you should tackle each step of this process.

George Fraser:
The first step is going to be replicating the data from all of the sources, your databases, your apps, your events from your website, your files coming out of S3 or FTP servers or you name it. First step is to replicate all of those data sources into your data warehouse. Now you’ll note, I said, replicate not integrate. The critical choice that you must make here if you want to have a simple modern data stack is to defer the process of integrating your data until later and to treat replication as its own stage of the process. At Fivetran, we often talk about this in terms of the ETL paradigm, extract, transform, load versus the modern ELT paradigm.

George Fraser:
And this diagram tries to explain the difference between these two ways of bringing data into your data warehouse. On the left, is the classic ETL pipeline. This was the conventional wisdom for decades about how you should build a data warehouse. Your data warehouse would have a de-normalized dimensional schema. That’s what you see at the bottom there. Your data sources, being production systems, will have a normalized schema and you would use an ETL tool to transform the data from the normalized schema in your production systems, into the dimensional schema of your data warehouse. In flight, as you move the data, you would transform it.

George Fraser:
The modern approach that we advocate for is ELT. In this approach, you replicate the normalized schema of the production systems directly into the data warehouse. You create a set of permanent tables with a normalized schema in your data warehouse that act as a complete replica of all of those production systems. There’s a trade off that you make here. The downside of doing this is that you’re going to use more storage. You’re going to end up still needing to transform the data into a dimensional schema inside your data warehouse, which means you’re basically going to store everything twice. It also means you’re probably going to store a bunch of data that you’re not even using. You’re going to store a bunch of columns or maybe even a bunch of tables that aren’t even part of any of your dimensional schema out right now, but critically, they might be part of it in the future.

George Fraser:
And that is the advantage of the ELT paradigm. The ELT paradigm is inherently much more future proof because your transformations are non-destructive. In a traditional ETL tool if you change your transformations, you have to replay all the data from the source through your ETL tool and that can take weeks or months or sometimes even be impossible because the data may have been deleted from the source and it’s just gone. Whereas in an ELT paradigm, you retain in this staging area, a complete replica of everything that’s happened, unmodified. And so if you want to change your transformation, if you realize you made a mistake and you need to fix it or you just want to do a new analysis and you need to add something in your dimensional schema, it’s as simple as rerunning your new transformation against these tables that still exists in your data warehouse.

George Fraser:
And that’s going to allow you to be much more iterative in the way that you build your data warehouse. Instead of having to design the perfect schema upfront on day one, that’s going to work forever, you can design a schema that solves the problems you have today and then tomorrow you can change it. It’s very akin to the difference between waterfall development and agile development.

George Fraser:
Once you complete this replication process, you’re going to arrive here. You’ve got all your data centralized in your data warehouse, but it hasn’t been integrated yet. It’s still in a normalized schema. Every single data source has a separate set of tables in your data warehouse. The problem is only half solved, but it is half solved. It is definitively half solved. You can count on the fact that these tables contain a complete history of everything that has happened in all of your systems and they’re going to be there as long as you don’t delete them, forever.

George Fraser:
The next step is going to be transformation and modeling. This is a step that in the previous world, you might’ve accomplished with a tool like Informatica or a bunch of Python code that your data engineers had written that does that transformation. That turns the normalized schema that came out of all the production systems into a dimensional schema that’s easier to understand and more convenient for analysis.

George Fraser:
Today, you’re going to accomplish this either using SQL that runs directly against your data warehouse or using an orchestration tool for SQL, like for example, DBT or using a machine learning based tool like Tamr. This is the second half of the process where you take that copy of all the production systems and you rationalize it all into a global view of what is happening in your business. And critically this transformation and modeling process isn’t something you just do once. This is going to change over time. You’re going to transform the data one way to support the analysis you want to do today but then in the future, you’re going to have new ideas, your business is going to change and that modeling process is going to change as well.

George Fraser:
And then the last step is Nirvana. This is where you get the insights that you built this whole system in order to, this is where you get the insights that you built this whole system for. And by the way, this is actually going to be the hardest part is going to be to get people inside your company to actually look at and change their minds on the basis of the insights that come out of all this data.

Andy Palmer:
Awesome. It’s so great, George, so inspirational in so many ways. And I’ve got all these notes about stuff I want to dive into. One of the things you said that sort of struck me first is this idea of storing everything twice. And it’s amazing how few people sort of realized that they’re already doing this inside of Oracle and Teradata in the form of materialized views. And it’s just kind of under the cover and you’re paying for all this storage that Larry Ellison I’m sure appreciates, but the reality is when people are sort of fearful of replicating their data the way you’ve described, sometimes they’re worried about replicating the data and storing it twice. But my argument is always, you’re already storing it twice, not only twice, but probably three, four or five, 10, a 100 times, if you’ve got a lot of materialized views, is that the right way to think about that?

George Fraser:
I agree that it’s just the most oversold problem. The idea that we need to pre-transform our data on the way into the data warehouse to save on storage. If you sit down and you just do the math on how much the storage costs in Snowflake, in BigQuery, in Redshift, in Databricks, it’s just, it’s nothing, it’s the cost of paying your data engineer’s salary for a day. And so you should make that trade off. You should choose to use more storage in order to get higher productivity. That’s really the goal of the modern data stack is to make your team more productive and able to iterate faster.

Andy Palmer:
Totally. And if you, again, if sort of you do this in these modern platforms, you’re actually saving a ton of money because your footprint in your Oracle Exadata box is diminishing over time. For the cost of Oracle Exadata running at scale is overwhelming compared to what it costs you to run this stuff.

George Fraser:
That’s exactly right. The modern data warehouses are an order of magnitude cheaper than the previous generation. I think it was really Redshift that started that. and Redshift, what was revolutionary about Redshift was it wasn’t radically different than the data warehouses that existed at that time, at the time that it launched. Redshift was not radically for, it was par Excel. It wasn’t radically different than Vertica or Neteeza, but it was in the AWS console and it was as cheap as dirt. And so that really changed the game, it changed the trade offs that you could make. You could say, “I’m just not going to worry about storing the data twice. I’m going to prioritize productivity.”

Andy Palmer:
Yeah. And it was familiar. Again, AWS is so good at doing this. It was familiar enough that it did feel like just anybody else’s data warehouse. And I think that the Snowflake guys have taken that to the next level. It’s so easy to use Snowflake and the collaboration features in Snowflake are just so compelling.

George Fraser:
Yeah, they really did. They established a new bar for user experience when they really started to get traction in 2015, that was the biggest reason why is the user experience was so good. It was so solid across the board and both Redshift and BigQuery have made progress over the last couple years in catching up on that front. But there was a period of time where the gap was just huge.

Andy Palmer:
Amazing. Really, really incredible. One of the other things you said that really resonated with me, about this moving trend, this ELT kind of pattern was that if you do this, you kind of move the transformation work kind of closer to this really cheap set of compute resources. And of course at Tamr, we’re just using this, sort of hacking away on this N squared plus N cubed problem of mastering lots and lots of data and having this cheap compute so close to the data is incredibly powerful. You must see that across lots of your customers. Once they get the data into the cloud and it’s kind of there, it’s been replicated, you can do all kinds of cool things with it really fast.

George Fraser:
Yeah. That’s a really good point. If you’re transforming the data in flight, you’re sort of bottlenecked on bandwidth, but you’re also bottlenecked on compute. It’s much harder to grab a lot of compute resources and kickoff a giant transformation. The canonical example is sort of, I made a mistake in the way that I modeled my data. I realize it. Now I have a fix. I think it’s a fix and I want to try it. How long does it take to try and make a fix? If you’re doing a classic ETL approach, you have to replay all the data. It could take two weeks. If you’re doing ELT and the fix is a SQL query or it’s a change to your Tamr model, you can grab without even necessarily realizing this is happening, you can grab these huge compute resources and try that fix in tenths of seconds, one minute. And it’s not an exaggeration, you’re talking about your iteration time going from two weeks to one minute. I don’t think I have to explain why that’s a productivity benefit.

George Fraser:
And this is really the thing that I think the cloud brought to data warehouses. I think of data warehousing is really having two revolutions. One was the column store and that’s what made them so fast and that was in the two thousands and I don’t have to tell you about that. You were there. And then the second revolution was really taking that technology to the cloud and the infinite scale of the cloud, the infinite scale out and the low cost of storage that the cloud provides. And then that really changed it again, because no longer did you have to wait for your vendor to show up with a box on a hand truck and put it into your, can you imagine? It’s like, well, I want to rerun all my transformations right now because I want to try out a different variation of them so I’m just going to call Andy in 2005 at Vertica and be like, “Can you come in a truck and bring me some more Vertica?”

Andy Palmer:
But it was amazing, Homer Trimen, I don’t know if you know Homer, but he was at Vertica or with us early and then went over to Cloudera with Mike Olson and kind of built up all the outbound engineering stuff at Cloudera. But we had all of our pre-sale stuff at Vertica running natively on AWS back in 2006. Because it just made sense. And then we had, I’ll never forget we had this sales call it Priceline down in Connecticut and the CIO there said, “Rather than installing this stuff on physical hardware, can you guys just run this for us on AWS Instances and just host this?” And I remember thinking at the time, wow, that’s a great idea and I bet you we could probably figure that out. And so then I didn’t do anything about it. That probably cost us, I don’t know, 40 or $50 billion. But it’s amazing that the opportunity is there now.

Andy Palmer:
And back to this idea, so this agile approach, so one of the things we like to advocate for a Tamr is avoiding the kind of one size fits all schema or the one schema to rule them all. You sort of implied this with this agile approach that you get to generate this data into schemas that make sense and maybe multiple schemas. Talk to me a little bit about how you view schema is changing and being more dynamic in this new world order.

George Fraser:
Well, first of all, if you’re a small company, you’re a 200 person startup, it’s common for them to not even go through, to not even build a real dimensional schema. They just build every analysis directly off the base tables. And at that sort of stage of life, that can be the right thing to do. That’s what Fivetran did for a long time. Only really a year ago, I would say, did we start actually building a real secondary schema on top of the tables we delivered. And that’s a great benefit of this approach. If you’re a small company or if you’re just a small use case within a big company, you can just do extract, load, analyze. Apparently you just skip that. But then, once you get into that mode of doing dimensional modeling or doing master data management, you don’t have to boil the ocean all at once. You can start with whatever’s most important.

George Fraser:
I like to say, take your most important dashboard or your most important analysis, let’s get that working tomorrow. And then you can iterate on that and you don’t have to do these giant migrations. You can have an existing dimensional schema that’s working, but that can’t support everything you ever want to do. And then you can build a second one later that supports a wider family of use cases. And then you can keep the first one around for a while. Maybe it’s deprecated, maybe you’re planning to get rid of it in a couple years, but you don’t have to do it today. And that just, it really is in many ways, taking the lessons of software engineering and bringing them to data warehousing. Don’t try to think too far ahead. Let’s solve these problems one step at a time.

Andy Palmer:
Yeah. Like you said, that’s the same thing that happened in modern software engineering with DevOps. And one of the reasons we like this term data ops is, it’s sort of a reflection of how agile you need to be and you can be and the productivity improvement, I’m sure you see this, we see our customers that do this. Well, it’s orders of magnitude better than the traditional stuff.

Andy Palmer:
And talk to me for a second about, so us as sort of fellow small up and coming companies, we’re sort of operating in this world of next gen, best of breed kind of ecosystem, kind of capabilities. And I’ve always thought what you built at Fivetran over the last eight years, just amazing. And how should customers think about the best of breed ecosystem of folks that they can work with as opposed to like writing a big check to Palantir or IBM or Oracle or somebody?

George Fraser:
Well, I think one of the implications of the cloud and of SaaS is that the advantage of having a single vendor is much less than it used to be. All of these tools can talk to each other natively. The sort of impedance of the server closet is gone. It’s really changed that trade off. You’re better off with three great vendors that do different things than one mediocre vendor that does all three. In our domain, and then that’s the other thing from a company perspective, as a customer you don’t really see this, but from a company perspective, when you really focus on one problem and you capture more of the market for this whole slice of the stack, you can devote just a crazy disproportionate amount of effort to solving that problem.

George Fraser:
If you look at what’s gone on at Fivetran over the last few years, the amount of work we have put into understanding all of the nitty gritty corner cases of every single database and every single app we support and making sure that no matter what happens, the data will match between source and destination. You can only do that kind of thing if it’s the purpose of your whole company. And so I think, from a running a company perspective, you can kind of see why you end up with best of breed companies emerging in this cloud based world. From a customer perspective, you just see that if you buy X from specialized company that just does X, it just works. And if you buy it from mega vendor who does a million things, you discover that you get a broken car delivered and you got to fix the car. I think that the cloud has really changed the way you build businesses. And then as a customer, that sort of changes the way you do purchasing.

Andy Palmer:
Yeah. Well, it’s amazing to say this. There was a period of time when at Tamr, we were helping people with the sort of overall problems. And they kept talking about data cataloging and data replication to get things into these sort of base level target systems. And we started a big project to do data cataloging. This was maybe six years ago or so and I’ll never forget when I saw the first demo of Fivetran and how focused you guys were, I killed the project. I just said, “Listen, these guys are going to do data replication across many heterogeneous sources way better than we can ever do it and they’re completely focused on it. And we’re going to be downstream of what they’re doing and we’re going to be really good at and focused on mastering and quality once the data is kind of replicated and over there in this modern storage.”

George Fraser:
And it’s so funny, we kind of went through some of the same process from the other direction. I remember having conversations with people at Fivetran and with BCs about, “Oh, maybe at some point we’ll build this, we’ll build that, we’ll build a BI tool, we’ll build a data catalog, we’ll build whatever.” And then what happens is you get more customers, they break everything and you’re like, “Oh my God, this problem is much harder than it seems at first.” And much more important than it seems at first because you discover that your first customer did X, but then your second customer does X and something a little bit different. And then you eventually, you see these use cases that are just so different. A major moment for me was when we got a customer who uses Fivetran to run payroll, because the data they need to decide how much to pay everyone lives in all these different systems.

George Fraser:
And Fivetran needs to be very reliable for that to work, but it is very reliable, but that’s just it was so out of, it was such a different use case than where we started, which was BI dashboards. And so it just goes to show this problem of data replication is both very hard and very valuable way beyond what we originally saw. And so for us, it’s like, we’re going to be doing this, I joke with my co-founder Taylor, we’re going to be like Warren Buffett and Charlie Munger we’ll be 90 and we’ll still be running the company, building our 10,000th connector.

Andy Palmer:
Well, I’ll be long gone by then. And I do feel like I’ve been doing this forever. It’s really amazing to have a chance to work with you. And we have such deep respect for your company. Can you tell us a little bit about, you guys have been at this for a long time and you’ve been really focused for a long time. And sometimes, that sort of flies in the face of kind of a lot of how venture capitalists and some people think about building new companies from scratch. And one of the things I respect about Fivetran so much is your commitment to building a big company in the long term and focused on a very specific problem. And I think the Snowflake folks have the same kind of team. They had a really similar kind of a culture and attitude. Can you talk about building a modern data company these days and how you feel about it now that you’re sort of eight years plus in?

George Fraser:
Yeah. The first two years we were really figuring out what to work on, finding product market fit. That was sort of the first chapter of the company was finding product market fit. It was really an iterative process. It was three of us and it took really two years. And then there was that period that you’re kind of getting at where we were still pretty small. We had customers, but we we were growing from one to a few hundred customers and the growth was actually slower in 2015, 2016 than it was in 2017, 18, 19. It actually accelerated through those years. And it’s because to do what we do really well, you really have to hammer away at it for years. There are so many data sources and there are so many corner cases.

George Fraser:
Once we got that initial set of customers and we knew we were really onto something, there was kind of this grind for years of making it work reliably across many data sources, across all possible configurations of those data sources. And it’s true that we were not the most attractive company from an investment perspective at that time. We sort of looked like this company that had been around for four or five years, still didn’t have that much revenue. It was growing, but not that fast. It didn’t look great. And then we started to accelerate as word started to get out that, hey, Fivetran really works and that really solves this replication problem in a definitive way that you will not have to worry about that part of your data pipeline anymore. You’ll still have to worry about all the other parts, but at least this chunk will be gone

Andy Palmer:
Well, it’s amazing. I remember Mike and I got all these calls from all these random PCs out in the Bay area, asking about you guys. And Mike and I, our unequivocal answer was, “If you have a chance to invest in Fivetran, don’t even think about it, just get in as fast as you can.” It reminded us a lot, way back in the day, there was this replication tool that Oracle bought called GoldenGate. And GoldenGate’s becomes so critical to Oracle’s whole ecosystem that, we’re like, listen Mike, at the minimum, in the next gen cloud ecosystem, Fivetran is as valuable as GoldenGate.

George Fraser:
It’s such a great analogy. It’s so funny that you say that. The people who are really in the know always make that analogy between Fivetran and GoldenGate. It is actually the best precedent for Fivetran as a company. It’s funny, I remember emailing with Andy Pavlo, who’s a well known database professor at CMU and he referred to us as GoldenGate plus plus, which I thought was pretty funny. And the reason it’s a good analogy is because they really focused on that replication problem. And they said, “We’re not going to do the transformation. We’re not going to be modeling. We’re not going to do master data management. We’re just going to be really good at replication. We’re going to be really reliable and low latency and high data volumes.” The difference is that Fivetran was founded many years later in this era of cloud applications where companies are now using so many different tools.

George Fraser:
It’s number one, the problem is bigger. It’s not just about databases, it’s about apps and you’d have to figure out how to do change data capture out of JIRA, which is just ridiculous don’t even get me started. But the opportunity is also that much bigger. And so there’s this opportunity to build sort of GoldenGate multiplied by all these different use cases. But we’re very much in the same mode. It’s just focus on being great at that thing. This is so important in so many different contexts. The UI will never be that complicated, but that’s a good thing.

Andy Palmer:
Yeah. And you guys have done such a good job and it’s a different time too. Data is now considered a strategic asset in the enterprise. And back in the GoldenGate days, they were just automating business processes and it was important and work on SAP, built their companies based on that. But now it’s a whole different level. Data is so critical as a core asset that companies, it’s not just an IT thing anymore.

George Fraser:
Yeah. It’s really the rise of analytics in importance, I think. That’s changed a lot.

Andy Palmer:
Absolutely. Well, George, this has been absolutely fantastic. I’m a huge fan of what you guys are building and so privileged to be a partner of Fivetran and to have the chance to work together and really appreciate you joining us for Datamasters and look forward to work in together in the next couple years.

George Fraser:
Thanks very much for having me on.

Andy Palmer:
Great. Thanks.

Megan:
Thank you, Andy and George. And if anyone has any questions to follow up with, please don’t hesitate to reach out to both Andy and George’s team. We’ve got some contact information up here and look forward to more discussion in the future.

Andy Palmer:
Thanks Megan.

Megan:
Have a great day. Thank you both.