As deputy CTO in the Obama administration, Nick oversaw the Open Data Initiative, which lead to government data being publicly released for use by the private sector.
In this episode of DataMasters, Nick talks about the connection between government data and your favorite mobile apps, the technical challenges agencies faced as they catalogued and listed data sets, and how he created a data culture.
Nick Sinai: US Constitution talks about the census. And Thomas Jefferson was the first director of the census and so this concept of providing data back to everyone is enshrined in our Constitution.
Nate Nelson: Welcome everyone to the Data Masters podcast. I'm your host, Nate Nelson. I'm sitting down with Mark Marinelli. He's the head of product at Tamr. He's going to introduce for us the subject and the guest of today's show.
Mark Marinelli: Hey everyone, if you've ever used GPS to navigate or the weather app on your phone to see if it was going to rain today, you've been using federal government data. The United States government has made a lot of their data publicly available in an attempt to spur private sector innovation like we've seen in the aforementioned examples. Today's guest is Nick Sinai and his job for a while in the Obama administration, was to make increasingly more of these data sets available to the public in hopes to spur some more innovation in the private sector. He served as the deputy chief technology officer for the White House and led the Obama administration's open data initiative. This was a very large scale project. Deliberately purposed to get more public data into the hands of the private sector and work with private sector companies to leverage those data for a lot of mutual benefit.
Mark Marinelli: Today he's going to talk about the technical challenges that the individual government agencies faced as they cataloged and listed all of these data sets and how storytelling helped him create a culture where government employees were better understanding the value that these data could drive and better working with the private sector to drive even more beneficial outcome outside of their traditional uses of the data.
Nate Nelson: All right, let's get to my interview with Nick Sinai. Nick, if you could start off by just briefly introducing yourself.
Nick Sinai: Yes. My name is Nick Sinai, senior advisor at Insight Partners, a large venture capital and private equity firm in New York. I'm also adjunct faculty at Harvard Kennedy School. I teach a tech and innovation class, field class there and I was the US deputy chief technology officer in the Obama administration.
Nate Nelson: Let's get started today with the state of things before you entered the Obama administration. What kinds of government data were openly available back then? And what are some of the ways that they were used in the private sector?
Nick Sinai: Yeah, so the federal government has a long history of making data open and available. In fact the US Constitution talks about the census. Thomas Jefferson was the first director of the census and so this concept of providing data back to everyone is enshrined in our constitution. I'd say more recently, you have kind of these canonical examples of GPS, global positioning system, which was originally a air force system designed for precision targeting of weapons. That was opened up by initially Reagan and then Clinton. And so now you have a whole variety of applications and services and devices that use GPS. It's in your phone these days.
Nick Sinai: Another kind of canonical example would be weather data. The National Weather Service is part of NOAA, the National Oceanographic Atmospheric Administration, has been making weather data available for decades to help prevent or to help folks avoid hurricanes. But also to private sector weather forecasting companies. And so when you see a weather forecast on the news, you see a weather forecast on your app, that's actually using weather data that your federal government, NOAA has collected and made available to the private sector.
Nate Nelson: You entered the executive branch and oversaw the open data initiative. That's what we're going to be talking about today. What was this initiative?
Nick Sinai: The open data initiative was really a recognition that the federal government can and should do more around making data open and machine readable for the benefit of everyone. And so whether that's weather data, demographics, health, education, agriculture, government collects a wide variety of information that can be used by private companies, by the public, by students, researchers, journalists, you name it. And so the whole point of the open data initiatives was to take, really to build on the progress in past administrations and update this for a modern information age. And so what we did was worked with a series of agencies to make that data more open, more machine readable, of course, subject to privacy and national security constraints. And so there's a variety of different plays that we ran from the White House.
Nick Sinai: But the biggest one was to get going. How do you convince people to make their data more available to the world and convince them that there are outside consumers for that data? And so one of the things that we did was started these little events that we called Data Jams. And this was with my boss at the time, US chief technology officer, Todd Park. And so we'd do these data jams where we would bring some subject matter experts, some entrepreneurs, some managers, designers, and some government officials together and put a subject, some data on the table and just see what they could brainstorm in the course of the day. And we'd have them vote on some of the most promising ideas. And it could be an app, it could be a data visualization, it could be a product extension, it could be an API, a little thing. But then taking that and getting them excited about those ideas coming to life.
Nick Sinai: And a handful of those would actually turn into real things. And so we would do our best to celebrate those ideas in a larger event. We used to call these datapaloozas and so it sounds silly, these Data Jams and these datapaloozas But what they really did was celebrate the opening up and the use of federal data in private sector innovation. And it showed the data stewards inside of federal government just how powerful and useful that data was outside of their usual stakeholder group.
Nate Nelson: You said at the beginning of your answer that the open data initiative started because the government could do more. What was your personal motivation for doing this? Why did you and your team feel that doing this was worthwhile?
Nick Sinai: Well, we saw, I'd say there's really two motivations. There's on the one hand, you'd see these entrepreneurs who were building great businesses and you felt that we could unleash more of them and the job creation and the innovation that they were bringing was important for the economy. And I had a past career at the time as a venture capitalist, so I was familiar with entrepreneurs and all of the innovation that they could bring. And in many cases they were doing things that were beyond the scope of government. And so they would do things that were outside of the mandate that a government program should be doing.
Nick Sinai: The other thing that motivated us was around transparency, accountability and good government. It's important that government be responsive to the needs of the residents and taxpayers and citizens. And it's also important that government be transparent about how it's operating and whether it's living up to its obligations. And so both the desire to promote additional innovation was something that I felt very strongly about. But I really came to appreciate having served in government for almost six years, the importance of transparency.
Nate Nelson: You mentioned it briefly there, I'm wondering how your experience in the private sector informed your ideas and your actions once you got to government?
Nick Sinai: Well, I had the privilege of working with a number of entrepreneurs before I got into government and you could see the relentless drive that entrepreneurs have. You could see the speed at which they would continue to iterate and grow. And so that really informed some of our thinking, which was, we're going to provide some of the fuel to their fire, but they're going to go on and do some fantastic things. And so let me give you a specific example, one that we've talked a lot about. David Friedberg, founded this company called WeatherBill. And he was, as the story goes, he was driving by a bicycle hut in San Francisco and it was raining. And he wonder why, could that bicycle hut entrepreneur insure against the weather. And so he started WeatherBill and the idea of the company was to insure against weather events.
Nick Sinai: And he went and got a lot of data from the federal government. And so he got the farmer's data from USDA. I think there's something like 26 million fields in the US and he went and got all of that, got all kinds of soil data from interior. Of course he got all this weather data from NOAA and the company that he started, WeatherBill, initially to insure against weather events, to sell to sporting leagues and those kinds of things. They pivoted to become a company that sold crop insurance and crop management applications to farmers. They renamed the company Climate Corp, they grew the company and actually ended up selling for over a billion dollars. It was a classic venture backed company that created a number of jobs in the Midwest and in Silicon Valley, wealth creation as well.
Nick Sinai: And it's an example of using data from, from NOAA, from USDA, from interior, to help farmers manage risk against weather and kind of have higher yields and so forth. It was a really great story that inspired us. It's something that Michael Lewis in his book, The Fifth Risk, spent some time jumping into as well. And it's that kind of of thing that we wanted to make easier because we knew that entrepreneurs like David Friedberg had a real tough time getting data from these different agencies and oftentimes they'd have to get in person and get it on tape and get it on disc and it should just be available in the cloud via API, data set. It should be a lot, especially if there's no personally identifiable information and nothing security related, national security related. If it's something like weather data or soil composition data or something like that, we should make that available to the world and let the entrepreneurs like the David Friedbergs of the world, go out and create these companies that can help farmers and grow jobs in America.
Nate Nelson: Nick, you've got all this data. I imagine it didn't arrive at your desk neatly on a silver platter. What were the challenges that you faced in preparing it all? Cleaning it up, eliminating redundancies and so on.
Nick Sinai: Yeah, it's easy to tell the story of opening up data. It's somewhat easy to get people to engage with you and pry these kinds of things. But how do you get organized? That's a real challenge. And you have to remember that most of the data doesn't live inside of the White House or the executive office of the president. It lives in these cabinet departments and inside the agencies that make up those departments. And so how do you get them to be more organized? And how do you get them to start this process of cleaning up, de-duping, all those kinds of things? And so for us it was clear that developing a standard taxonomy and a data schema and having some rules of the road around a catalog were important.
Nick Sinai: And so all of that ultimately led to the reboot of data.gov. And so just as context, data.gov was started in 2009, at the very beginning of the Obama administration with a handful of data sets. I think two from the agency and it had grown, I think now by the time that I got involved was around 2011 and 2012, it had grown considerably and there was a need to make sure that we were standardizing what was on that catalog. And so using an open standard around a data schema was super important, but also a set of rules around how agencies were supposed to get organized in terms of creating their own catalogs. And one of the things that was really important to us was while we needed to essentially harvest and create a master catalog that data.gov, we recognized that there was a lot of differences between a scientific agency, between a service delivery agency, statistical agencies. Agencies just do very different things.
Nick Sinai: How they think about securing their data, how they think about making it available to the public, how they think about these functions were very different. And so this all culminated in an executive order in 2013, the open data executive order, which said that data should be open and machine readable as a default. If you're starting a new program or a new system or something like that, that openness and machine readability should be kind of first order principles as long as there aren't privacy or national security or business confidentiality type of concerns.
Nick Sinai: But it also agencies to think about openness and machine readability throughout the lifecycle of data. And I think that's another thing is so often when we're talking about making data available, we're talking essentially about data dissemination at this end point. But it's important to go upstream and think about how can we make data more clean when we're ingesting it? When we're asking human beings for that data, when we're starting to transform it, all of those kinds of things and not just at this point of dissemination.
Nate Nelson: Mark, I know you have a ton of experience in organizing and cataloging big datasets.
Mark Marinelli: Cataloging is massively useful, but also typically massively challenging. Most people think about the challenge of applying a common taxonomy or labeling scheme to the various data sources to categorize them for use, but upstream there's this huge challenge of even knowing who to include in the cataloging process. And more importantly, knowing what data may be sitting under their desk that they can include in the catalog. Incomplete data are just another form of inaccurate data, so it's essential to understand who knows the most about data source X and what other sources they're using alongside data source X. I think that's an area where the data culture is so important to get right.
Mark Marinelli: If people are proactively contributing to the corpus of data and maybe surrendering some control over their data, but knowing that they're contributing to the greater good and they're no longer keeping these data below the radar, then the true breadth and depth of available data can be cataloged and thus leveraged. And technology can help here as well. This is an area where automated discovery of consuming applications, of analysis of usage patterns for the core data sources, these can all be really helpful to compliment the traditional core manual process of cataloging the data.
Nate Nelson: Could you give a sense for the scale of the data that we're talking about here? And how many different departments and channels they're all coming from?
Nick Sinai: Yeah, well the scale of the federal government is massive and so there are hundreds of federal agencies. There's a couple dozen kind of cabinet level departments. One way to look at it is on data.gov. There's a couple 100,000 data sets, but any one dataset could be massive or it could be a collection of many years that goes on. And as I said before, this could cover administrative data, scientific data, statistical data, service delivery, IT and log and machine data is exploding as I'm sure you guys can appreciate. Financial data and grants and contracting data.
Nick Sinai: One concrete example, I'll give you is back to the National Oceanographic and Atmospheric Administration. They were collecting 20 terabytes a day of weather data from sensors they have from ocean buoys, from ground sensors, from satellites. But they really were only making a terabyte or two available to the public. And weren't really making use of the other 18 or 19 terabytes a day. And that may be fine because the mission of NOAA, but that data could also be useful in terms of climate modeling, in terms of other kind of weather or farming applications, that are outside the remit of NOAA. And so that's just one small example of, 20 terabytes a day is not small data by any stretch, but it's one example of the scope that we're talking about here.
Nate Nelson: The government is a big place. We all know that. And I imagine, Nick, correct me if I'm wrong, that there's some bureaucracy involved in any major initiative. How do you possibly begin to change the culture around dealing with data? Not just the data itself, but how people around you look at it and approach it at a place so entrenched in certain ways of thinking?
Nick Sinai: Yeah, it's a great question. I'd say there's a couple things. One is you want to tell a story of what you're trying to do and so we would tell the story of David Friedberg or there's this other entrepreneur from Denver who created a medical app to help people with their symptoms. And so that they would go to the ER, urgent care if they had emergency symptoms. And so those stories of those entrepreneurs were once we told over and over again until we're blue in the face because you really have to connect the data organization and the data initiative to the mission. In our case it was government, but you could imagine this would be true in a business as well. Is, how do you connect those two? And so I think storytelling is an important piece of people want to know the why you are so focused on this particular data initiative. That'd be one thing.
Nick Sinai: I think you have to make it easy for them to collaborate, so how do you bring them together? And this is true in any large bureaucracy or organization, is there's a whole series of silos. How do you connect people and get them to come together? And one of the things we did, it sounds silly and maybe it's outdated, but we created a listserv across government and anyone with .gov or.mil email address could join. And so I think we had something like a 1,000 people on this listserv and we would do conference calls every Tuesday and anyone could join. And so through the course of this, there was a lot of back and forth that people would ask questions, but we also would learn about these great data initiatives and transformations and folks who were opening up data and changing policy.
Nick Sinai: Because a big piece of this was well, changing some of the policies about how they buy data and how they make data available inside the agency as well as publicly. And a lot of times that would happen asymmetrically. It wouldn't necessarily be because of the executive order or some of the OMB guidance, but it would be because they had gotten a good idea, were inspired by some of this and then took that and went to their data leadership, to their CIO, to their senior administrator and figured out a way to incorporate that. Those are two things that I think are helpful, is getting people to talk together and telling the story.
Nate Nelson: And what are you up to these days?
Nick Sinai: I'm still working with Insight Partners, helping a number of Insight portfolio companies. A number of them are helping the public sector and so I'm proud to work in that particular mission. And you can imagine in today's COVID environment, a number of the companies are offering pro bono product and support to help the federal, state and local governments. I'm very proud to be helping a number of companies with that.
Nate Nelson: What did you learn from your work in the executive branch with the open data initiative that you now apply to your work in the private sector?
Nick Sinai: I think there's a number of things around my own transparency. I had the great fortune in the White House of being able to blog about the work that we were doing. And sometimes I authored, sometimes I would help someone else write it or I would amplify someone else's writing. But I think it's important to be transparent about what's happening, even if it's not as sexy of a thing. And trust me, open data and open government is not really a sexy topic per se, but to the extent that we could be transparent, it meant that the entire bureaucracy understood what we were doing and understood why we were doing it. And so I try to be transparent in my work as I work with Insight portfolio companies and other great companies that I'm proud to advise. Trying to be transparent with them and with the entire set of stakeholders, including federal, state and local government about all of the great things that these companies are doing and why it's so important and how it can help support the important mission that everyone is undertaking.
Nate Nelson: What advice would you give to other CDOs, CIOs in positions like you were in, who are trying to do big things with data, really leverage it for the power within and change the culture around data at their organizations?
Nick Sinai: Yeah, I have three things. One would be start small and time bound. I don't know why we insist on MVPs and lean processes with agile software, but we allow the data inventorying and the data cataloging processes to take a long time and kind of not learn from them. I would say, let's take that same iterativeness with data organization initiatives.
Nick Sinai: Two would be, let's use the best of humans and machines. And we did this in the federal government as we thought about harvesting data catalogs from the agencies up to the data.gov level. We used automation wherever we could, and yet we knew that finding and encouraging the data stewards, the human beings who knew the date of the best was absolutely critical. And we had to work with them to understand the sensitivities around, and the context around that data to make sure that we were telling the story of that data and getting it accessible in a responsible way. Number two would be used the best of humans and machines.
Nick Sinai: And three, back to my earlier point is tell a story. Tell the story of what you're doing and why you're doing it. I think it can become contagious because people want to have their thing plug in and be lifted up as part of this initiative. Especially if it gets executive visibility. And in our case, we had President Obama talking about the power of open data to spur private sector innovation. And that's contagious.
Mark Marinelli: To make a point here on machine learning as a specific mechanism for automation. Automation is not just about accelerating the initial functionality, but also the long tail of any data initiative. If you assume that even a sophisticated rule set was able to accomplish the original goal of unifying or cleansing your data, the data and the requirements thereof are going to keep changing. And you're going to end up in a losing game of catch up as you try to retrofit those rules or add new ones to accommodate the ongoing variety and volatility of the data going forward. And this ongoing automation is a perfect application for machine learning. Models are more resilient than rule sets in the face of change. They incorporate a broader array of inputs implicitly without having to add more logic and they have a higher tolerance for deviations from discrete conditions in the data that may pass or fail a rule but will not wholly perturb a model's ability to figure out the answer.
Mark Marinelli: Models can also proactively communicate when their confidence in the results is degrading and oftentimes why. That's really helpful to get that alarm from the monitor that you need to go do something. And when you have to go do something, they can be corrected with really low touch updates. A little bit of feedback from the end users of these data to give a little more training to the model, refresh of the model and you're in better shape. To next point, this is making the best of humans and machines, but there's a strong skew in the automation cycle toward offloading a lot of this ongoing maintenance to the machines.
Nate Nelson: You're actually a bit ahead of me. I also took note of Nick's bringing up machine learning because it stood out from the rest of the subject matter in our interview. It's why my last question to him was about that point. In particular, how to meld machine learning with the culture around IT. Let's listen in and then you and I Mark, we'll hop back on for some final thoughts at the tail end.
Nate Nelson: Before we go, could you give a word or two on the value of machine learning in IT and how we can convince IT folks of its usefulness to their industry?
Nick Sinai: Yeah. I think there's, from my experience in governments, my experience working alongside government, since you see a lot of excitement around machine learning, but there's always this question of how to get started and how to do it to work smarter. And in my experience, people want to go home and be with their kids. They want to get the analysis done. They want to be heroes. They want the mission to happen faster and better. And so there's sometimes there's this misconception that well, people are defending their turf because they like doing things the hard way.
Nick Sinai: And in all my experiences in and around government, that's really not true. Folks have a protectiveness around their process and around their data for reasons that are statutory, that are legislative around policy. There's a whole series of reasons why people are protective, but ultimately they want to get things done faster and better. And so I've seen a lot of interest in how can we apply machine learning and other automation techniques to clean up data, to make it more available, to automate some of these processes that are now required to make it available to the general public.
Nate Nelson: Mark, you just heard Nick and I there. What are your thoughts coming out of this?
Mark Marinelli: A couple of thoughts spring to mind. The first is the challenge that everybody faces of getting data outside of the silos that it's buried in. We're all very well familiar with that, but how acutely that has felt in the federal government where you have different agencies, each of which has different data governance policies, each of which has different applications of the data, each of which has different ways of generating the data. If we think that getting the marketing department and the IT department to share a single view of our customer is hard, getting multiple federal agencies to collaborate on just about anything I think is categorically different. It's really interesting to hear how Nick was able to surmount that broad organizational challenge.
Mark Marinelli: The other point is on a data culture, which we've covered in a couple of different episodes here. Important to have everybody internalizing that data has value and thinking about ways to drive value. And when I think about data culture and exemplars thereof, federal government does not spring to mind. It's interesting again to hear how Nick was able to start viewing these federal government agencies with a bit more of that data culture, seeing how they were exposed to private sector organizations that may be more mature in their thinking and how together they can collaboratively make each other better and lay the groundwork for even more productive use of these data as they income become increasingly available.
Nate Nelson: All right, that sounds like a good place to end. Thanks to Nick Sinai for sitting down with us and thank you, Mark for sitting down with me.
Mark Marinelli: Cheers.
Nate Nelson: This has been the Data Masters podcast from Tamr. Thanks to everybody who's listening.