datamaster summit 2020

COVID-19: Building Trust in Data To Save Lives


Paul Balas

Consultant @ 303Computing / Former Chief Advisor Advanced Analytics @ Newmont Mining

Data is an asset to fight the spread of COVID19. Data also needs to be clean and mastered for accurate reporting. Paul Balas embarked on a project to master COVID19 data using Tamr for the data mastering part of the project. Paul will share his work and his findings in this session.


Speaker 1:
DataMasters Summit 2020, presented by Tamr.

Mingo Sanchez:
Hello everyone. And thank you so much for joining us for today’s session, COVID-19: Building Trust in Data To Save Lives. My name is Mingo Sanchez, and I’m a sales engineer here at Tamr. And today it’s my privilege to introduce Paul Balas. Man of many hats, but has worked for over two decades in the data space on many MDM implementations, data warehouse initiatives, and so much more. Paul, the floor is yours.

Paul Balas:
All right, thank you very much, Mingo. First off, I want to thank Tamr. Not only are they a great software vendor, but they’re also a great partner. So if you have the chance to engage with them, I think you’ll have a great experience. I know idea during this project. We undertook together a little bit into the first quarter of this year.
This presentation is about uncovering the systemic challenges that our government faces in managing the COVID-19 pandemic through the use of data, and a proposal to fix it. There was a team of people that came together. We have Mingo, Elizabeth, Katie from Tamr, who participated on the project. And then a good friend of mine and old coworker, Keith Worfolk. And Kamal Maheshwari from InfoWorks. And then we had Brian Haagsman, who I also worked with in the past while I was at Newmont Goldcorp. And then we had some interns who wanted some real-life experience on a project, and they found this one meaningful.
So what’s our agenda today? Today, we’re going to talk about why the data’s wrong. We’re going to talk about why the new system that the Health and Human Services Agency, who’s responsible for this data, is not really addressing some of the core problems in improving the data quality. And then we’re going to talk about how to fix it. We actually built a POC to demonstrate the power of good data governance.
So why is COVID data wrong? Well, the data quality is suspect. And it’s due to a few things. The CDC has an aging system. They are trying to modernize it, but there are some challenges. The virus is spreading very quickly. And in fact, it is the most virulence infectious disease of modern times. And it’s created a large volume of data.
Now, the old system with other infectious diseases was able to deal with these challenges because the volume wasn’t as big. But when we started seeing a lot of cases, they require a lot of hand-touching. And there’s a lot of things that are missing from the data that we want to be collecting. So we’re not capturing what we need to manage to response to the pandemic.
So why do we need this data? What does it do for us? It does a few simple things. It allows us to manage our supply chain. So as we’re overwhelming our hospital capacity, we’re also overwhelming our stock of supplies like PPEs, those are protective gear that healthcare workers wear, testing supplies. The tests have not been able to scale up to the amount that is recommended by epidemiologists and experts to be able to understand and contain the virus as it spreads. And things like ICU beds. We’ve been overwhelming the amount of ICU beds that have been available in the hospitals, and we have to turn people away.
Ensuring that we have enough doctors and other healthcare professionals, at the right place at the right time, to meet the surge demand, is important factor. And then issuing orders to the public at large about things that will help prevent the spread of the disease, stay-at-home orders, social distancing, shuttering businesses. All these things are critical and are driven by the data.
So if we don’t have trusted data, no decision that our government, and federal workers, and state workers are going to make is going to be driven with conviction. And we’re going to be asking ourselves, could we have saved more lives?
Deborah Birx, Dr. Birx, is part of the federal response team for the pandemic for this administration. Earlier this year, she made this statement. She said, “There is nothing from the CDC that I trust.” Now, that’s a pretty damning statement to make about an agency that has been really the best in the world in terms of public health response in managing pandemic.
So why don’t people trust the data? In fact, it’s not limited to Dr. Birx. I think if you look at the new cycle, every one of us have seen this a few times, some official raises concerns about how to CDC is counting test data. We’ll dig into that one. A Harvard Business Review and others. So there’s literally thousands of people who are saying that the data is not good. The data is not standardized how many tests have been given?
So this is a really simple and fundamental question we should be able to answer, but recently the CDC had some counting errors. Now, you can’t blame them. How could they possibly get this right? They have an archaic system that’s inflexible, and it’s got to adapt quickly to a new reality. So when you look into this in detail, you can see that there are a lot of challenges in getting things that should seem simple right.
So there’s thousands and thousands of these types of examples in the news cycle where we’ve got incorrectly reported data, and everybody is saying the data is not good. Obviously, the CDC system isn’t solving the problem of data quality to manage the pandemic. So what’s the scope of the problem? The scope of the problem is it’s large. The CDC has over 1200 users and about 950 state partners that participate and touch this data as it’s processed across our nation.
There’s over 6,000 hospitals in the U.S. about 2000 of those are providing data directly to this system called HH Protect, which is the umbrella system. Additionally, there’s data entry systems, one notably called TeleTracking. Which is a way to manage patient intake and patient care through a hospital visit and treatment. The idea that was proposed earlier this year and executed on was to expand the use of this system so that we could record the data that we needed. Now, that sounds like a great idea, and it is a good idea, but it doesn’t really solve the entire data quality problem. Because, as you’ll see, standardized data entry does not necessarily mean data quality.
So the CDC is trying to solve this problem. They’ve been doing it since 2014 when they undertook a modernization process. Getting alignment on what standards need to be in place and how the data or process is creating gaps to achieve that standard can mean extremely time-consuming process.
And without a framework for understanding the as-is condition of the data, years can be wasted trying to define and then close those gaps on quality. And the current focus of the CDC is more about speed of transmission of the data, standardization of what the data is supposed to look like based on validation rules. And those validation rules are implemented in procedural language. And so we’re going to talk a little bit more about the idea of how procedural rules become inflexible, and they’re not very agile to respond at speed.
This system is broad. We’ve got case reporting happening from all the agencies that I’ve talked about. These are healthcare providers in addition to hospitals. There’s also laboratories. And then these go to usually regulatory agencies, they aggregate it, they touch it. And then that goes into databases all across the nation. And then it gets integrated. And so you can start to see how really complex this system is. And it may look like your organization… You’ll say, “Well, yeah, my organization has that type of complexity too.” And it’s true. It probably does.
So, the system and the objectives of HHS around governance is a critical thing to think about. They recognize they have a problem, and they’ve made statements that they’ve got siloed data, they’re not getting the sharing of data that they want, and yet they’re really not applying technology or process innovation to deal with that issue. So let’s take a look at how we could do that.
Standardized data entry versus agile data management. With a standardized data platform, what you end up doing is creating screens with rules on them, and you create data dictionaries, and you train people. And that’s how your piece systems work, any data entry system works in this fashion. The challenge is that there are a lot of room for error in this type of framework where you still have to have a lot of knowledge. And all the knowledge that you have to have shades, how you put that data into that system.
So this is really a people problem. And we’re trying to address it as best we can with systems and they’re falling short. So the fallacy of standardized data entry solutions is just that. Is there’s room for ambiguity. And over to the right you can see, from the TeleTracking system, a hospital’s instructions about how data is supposed to be entered. I don’t know how many times I’ve built data dictionaries and they gather dust.
So it’s about definitions and it’s about transferring really subject matter expertise about how the data’s supposed to behave into a framework. And the added complexity of medical forms creates a very big daunting challenge to get the data right.
Maybe you’re one of these managers who say, yes, we should put in a standardized data entry framework and it’s going to solve our problems. That’s how we’re going to solve data entry in our company. I bring this up because I think there’s a little education that is important. Talking to the people who have to deal with the data as it flows downstream into the analytic pipeline, using data as information to make decisions, these people are integrating data across systems.
And if you talk to your data scientists, and you talk to your data warehouse team, they’re going to enlighten you about how dirty the data is even when you thought it was coming from a really good system. This problem becomes exacerbated when you start to integrate data across systems. Because, oftentimes, you’ll have customer data, for example, in one system and another system, and you have to master it, and the rules for each system are different. So how do you conform and standardize those rules? It has to happen in the middle.
And so, what we’re going to show you next is about how to fix the data in the middle. This is the POC that we really built. It’s addressing some of those issues that we’ve highlighted now, it’s getting people to work towards understanding the problems in the data, and coming up with the rules on how it should be standardized.
So this new system that is being built by our federal government is going to fall short. And the reasons why it’s going to fall short is because they’re not solving the people problem. How are people aligning towards these new standards? How are people combating those standards into the systems, that idea of procedures versus machine learning?
So let’s see how we envision this approach. Imagine that this is your problem to solve. So everybody listening, put on your CDO hat, you are now the CDC-CDO. Probably not an enviable position. And your task to improve our nation’s ability to better manage the next pandemic through the use of data. Your first goal is to understand what are the key issues with the current framework, that’s the as-is, and develop a roadmap to address them. The stakes are very high.
So you come up with some architectural principles. These are guiding principles that will help align a very broad team towards focused outcomes. And you have four simple goals; build better trust in the data. Everybody will agree with that. Understand which issues to fix first and prioritize them. The system should be agile to change. You need to respond with this framework in days and weeks, not months and years, because time equals lives. And you want a way for people to collaborate where they can’t get in a room together. In our current world, people have to collaborate efficiently across our nation and even across the world. And they need to do it around the data.
So you come up with two paths. You say, “I’m going to split this problem into two. Path one, I’m going to standardize the data entry systems because that is a good thing. But we’re not going to roll that out quickly. We know what it takes to roll a standardized application. The change management effort, the communication management, the training, the blueprinting. If any of you’ve been through an ERP rollout, exactly what I’m talking about. And you probably have that deer in headlights look, and it does bring a little panic to us as a data practitioners.
That’s the long path. So you say, “Well, I really want to focus on this people problem and get alignment in how people should understand the data and what the challenges are from all the state agencies, all the hospitals, all the laboratories, how can we do this quickly?” And so, Mingo, maybe you can take us through a little bit about the reference platform.

Mingo Sanchez:
Yeah, absolutely. Well, we here at Tamr specialize in taking the faster path. So the way that we tackled this in the POC was using Tamr for collecting that feedback from people collaboratively, as Paul mentioned, to quickly iterate and get at the data quality issues that were plaguing these data sources. Now, in order to do that, we didn’t do it alone. We integrated with some other systems as well. So InfoWorks was really critical for scraping those data sources in the first place, and getting them into that structured format to be fed into Tamr.
And both of these platforms were built on top of PCP. So we were able to leverage that power of the cloud and use those cloud-native technologies to do this processing. And not only that, but we were able to use some of Google’s other capabilities for enriching the data sources so that we were able to get that same information into Tamr that a person would be using to do this exercise themselves.

Paul Balas:
So the POC was about understanding the data issues, and we wanted to focus on the systemic issues in underlying data flow across all the stakeholders. Now, if you’ve been an information architect, data architect, then you’re very familiar with the process of understanding what the data says, data doesn’t lie, you’re understanding how people think it’s supposed to work, and you know that when you create a conformed data model, you have to get agreement.
And so that process can be very time consuming, especially if you have a lot of attributes, a lot of data elements that you need to deal with. So you wanted to short circuit that and make that conversation happen faster. So you chose to look at issues around testing as the topic. So everything about testing, what’s wrong with the testing data? What are the issues you believe that if you can focus on testing data, you can get some immediate benefits for public health if you can improve it and build confidence in that data?
So what problems are states having in processing their testing data? Is the testing data being reported consistently and accurately? We’re all familiar with John Hopkins University at this point. They’ve got a dashboard, it gets billions of hits and it’s become the defacto authority on COVID-19 data. But did you know that they’re pulling it from other agencies? In fact, they’re getting a lot of their data from the same place the CDC is getting their data.
There’s one major data source that is being provided to the JHU. It’s called the COVID Tracking Project. And it’s a crowd source, a group of right-minded people recognizing a problem around data aggregation. And they’ve been pulling it from all the state and local governments and where they can. So, they’re pulling the data, they’re providing it to JHU, JHU is taking out the data, everybody’s massaging the data. How do we trust it?
So what type of problems does JHU say they see in the data? They have a GitHub repository where they’ve done a great job of recording all the data issues. So if we could classify, that and if we could classify all the issues across all the other stakeholders, their verbatim about what the problems are, we believe that we could have a tool to get this alignment that we’re seeking. And then standardize the data, and then look at a systemic roadmap or roadmap to address the systemic problems and challenges and get alignment on what’s most important to solve first.
So when is a test not a test? The CDC, John Hopkins, and the COVID Tracking Project, and hundreds of other sites all deal with test data differently. It seems like it’s a simple problem, but it’s not. So the CDC recently made that mistake I alluded to earlier in the news article. What happened here is that the CDC was adding two different types of tests. And it was really conflating the number of tests that we were actually achieving.
Now, the goal of measuring the total number of tests that were being given, it was kind of achieving a high water benchmark of how many we had to achieve to be able to manage the spread. And just as a side note, testing alone doesn’t really solve this problem, you also have to do contact tracing, which is another beast. But, if there’s something that we want to correct for and monitor our POC, it’s testing and test data. Let’s solve for that. Can our system compare test reports from various agencies to help explain why it’s different?
So here’s the DataOps in action. If you’re not familiar with the term DataOps, it’s basically a way in which to guarantee quality in your data and monitor it through KPIs and a management framework. And it’s continuous. DataOps is not something you do once a quarter, once a week, once a month. It’s continuous monitoring of the data. And with Tamr and InfoWorks, we’re able to provide this framework in our option.
So what we see here is this dashboard that shows data quality issues and the data measurements that are being reported by the various organizations. We can compare data between reporting agencies and make the variance as obvious. We can also use the platform over time to monitor the health of this system. So DataOps, in its most meaningful incarnation, is to continuously perform this monitoring, and notify you as someone who cares if we’ve got unusual occurrences or problems that are cropping up.
And so that way we’ve got a very tightly closed loop in how to monitor and then address. So if you change the timeframe on the slider, there’s a slider down here on the bottom of submission date, you can look at different time slices over time and see if changes have happened in testing. That slider will also affects these quality issues off to the right where we’ve classified the quality issues on a taxonomy that’s bent towards different types of problems that are inherent in data quality. Things like compliance to standards. Things like concurrency or timeliness.
So what you’ll notice in this example is that there’s a variance between the CDC and the COVID Tracking Project, CTP. CTP is the orange line. CDC is the blue line. The red line is JHU. Let’s talk about the variance first. That’s that gray line between the orange and the blue. And you can see it. Wiggle is all over the place. Why is there a variance in something like a total people tested metric? And then off to the right, you can start to see some of the reasons why.
I want to highlight one thing here, which is, you don’t see the John Hopkins University line on testing start until about April 12th. And that’s because that’s when they started reporting and recording it. You can see that there was some wiggle between JHU and the COVID Tracking Project. And they started to align closely after that. But for some reason, the CDC is reporting bigger numbers.
This is the type of problem that we want to explain, we want to understand, and we want to short circuit the conversation for people that have to be engaged and involved in setting the standard. So it’s going well, you’ve got a framework for DataOps that can help you start to understand and solve this problem of data. You want to do a little bit more. You understand that the framework maybe able to do some interesting things, and there’s this the opportunity of showing how people in the public eye might influence outcomes in people getting tested, in mortality rate. You’ve got a theory and you want to start tracking the data to see if that theory proves out, and it might shape public response to how the pandemic’s being managed.
So you asked the team to classify news data that’s made by public influencers. You want to classify it by events and you want to classify it by locations. Because with those three things, you might have a nice rich data set, by which you can then correlate it to test results. For example, are more people getting infected when we shutter the economy or less? We would help less. So, using a traditional tool, a master data management tool to do this type of mastering classification is almost a nonstarter. But with this platform, we were able to do some things very quickly.
So here’s the net result of your influencer exercise. What you see here is a correlation of, in this case, positivity rates. So these are people who are tested, and these people who were tested showing a positive trace of the virus. And you can see, it starts slowly, maybe when were tracking very well in March, and then it starts ramping up, and it gets very worrisome.
And then these lines are actually news events that you classified. And these news events are about things that we might care to track. So we created a taxonomy and we classified our news articles against it. And in this example, we can see in May, Governor Andrew Cuomo announced limited phased re-openings. What you would hope to see in may is that the test positivity rates would not increase significantly. But, after the reopening, we can see in fact that it’s starting to increase. Now, this is not causal, but we can develop data science models and look at correlation and strength of correlation.
So we’ve got a very interesting framework here, we built it very quickly, and it’s allowing us to do DataOps and more. So this new DataOps system is going to be able to provide more than just good data quality for COVID. It will also provide it for other pandemics. Because it will scale very easily. It will allow you to conduct data science experiments. So you can see if there’s correlation between what people do and say with outcomes in managing the pandemic. And you’re going to be able to have a quicker way to solve the data quality challenges and get people aligned around what to focus on first.
Interestingly, just this week, Florida’s governor cleared restaurants and bars to fully open. Now, we’re going into winter, I know Florida has a mild winter, but still, it’s an opportunity for people to spread the infection. And so, what I’ll be interested to see is where these actions impacting the infection rates, and the death, and mortality rates for COVID pandemic.
So, what did you achieve as a CDC CDO? You delivered a DataOps framework that will expedite realization of data standards. It puts the power of data governance and master data management into the hands of the experts at the CDC. It doesn’t rely much on IT to get this done. You use people who are subject matter experts to fix the data, and you make good use of their time.
It works in compliment with systems like TeleTracking. It’s that idea of data quality in the middle, of data ops in the middle. And it’s going to scale beyond this infectious disease data, and it can serve as a model for HHS to solve other types of similar problems in order to promote good data quality for the benefit of all citizens.
So now we’re going to talk a little bit about how it was built. We had about 60,000 news articles that we captured from March till about August using an API. We pulled some data from Twitter, some tweets by Trump, some tweets by some celebrities, Kanye West and others, to kind of see if their statements have an influence, could be correlated to the data and outcomes. And then we had data from hundreds of State Health Department websites, where they report and record what their data quality challenges are.
And then we took the John Hopkins University GitHub data. So the infrastructure sitting on Google Cloud platform, we have some VMs. Very cost effective. We had InfoWorks for data integration and orchestration, harvesting data, running it daily to pull in the next day set of data, and then telling us if we have any challenges with that process. It’s another piece of the DataOps framework. We use BigQuery, we use some Python, and we use Google Natural Language for feature extraction. And then Google Cloud Storage.
And then once we had processed that data, the magic happened in Tamr where we did all the master data management classification. And then for visualization, we put it into Tableau. So, briefly, how do you extract meaning from text? Years ago, this would be a very lengthy process and not very good. But we had to be able to take news article data, text, and identify, for example, Florida date of active cases, around 13 April daily data in U.S. So, there’s a state in there, there’s a date in there, there’s some sort of action or event. And how do you get meaning out that textual sentence?
What we did is we use Google Natural Language API. It’s a very robust API. And it allows you to extract entities, people, places, and things, sentiment and syntax. And then categorize those articles, if you choose. And we’ve focused and leverage heavily the entity extraction.
And then Tamr allowed us to really spend time analyzing the data rather than time processing the data. We’ll go a little bit more detail here. Typically, when you talk to most people who do data work, 80% of their time is spent fixing data, 20% analysis. Tamr helps you flip that on its head.
Additionally, this idea of procedural roles versus machine learning. And every single MDM vendor in the market today that you may have in your company uses procedural rules to master data. The old way is you go through a very lengthy cycle where you look at the source data, the developers do this, the developers talk to the business people about the data that they’ve looked at and profile.
They document what the rules are supposed to be. They write those rules into code. They implement them. Then a QA team validates that the developers did their job right. Then they get in front of the users and they show them what they did. Users go, “Oh, not exactly.” And so it goes back to square one, “Okay, we need another rule.”
And so, that cycle happens over and over again. And the current date panacea for that problem is agile methodology. Agile by itself is a very good way to approach a faster iterations and faster dev cycles, but it really doesn’t solve for the problem of procedural rule-based modeling. And so, this process can take months, two years for complex data sets. So you end up not mastering as much data as you’d like to. But with Tamr, it does something really novel in the market. It uses machine learning to actually curate the data.
And what does that mean? What it means is you don’t write code. You don’t write rules in a procedural way. What you do is you give Tamr some data, and Tamr gives you back some pairs and it asks you to match those pairs. It’s a very simple exercise. Anybody could do it with a half-hour training. And it’s got a way for data experts to collaborate on those rules. Maybe I don’t know if these two customers are really the same person, or these two are really the same asset, but I can say, “I don’t know, I need Bob in Omaha to take a look at it because he’s been dealing with this data forever.” And so then Bob can say, “Oh yeah, that’s this.”
And so it’s got a way to collaborate around the data. And I can’t emphasize how important that is in getting people to be more actionable, those subject matter experts in valuing their time in a much more efficient way. And so with Tamr, I’m going to demonstrate and show you that I was able to build seven projects. I did it very quickly and I got very high accuracy.
So here’s a little bit of an example of the burden of procedural rules versus machine learning. The orange line is procedural and the gray line is machine learning. The Y-axis is the level of effort that you as a data steward have to put into the system to get it working. And time is the X.
So, let’s take a look at the orange. Procedural learning, you go through that dev cycle I explained. And over time, you start to get it right. “Okay. It’s finally good.” And then what happens is, time moves on and you’ve got in your data source, you’ve got some new rules. And then the level of effort goes back up and it goes up significantly. “Okay. We get over that hump.” And then, “Okay. My burden, my time spent to manage the data to make sure it’s good goes down a little bit.” And then you get another business change.
And this cycle repeats over and over. But what happens over time is, your technical debt and your burden increases. The amount of time you have to spend fixing data can actually increase. And it gets very complex to debug and diagnose when you don’t quite get the data right because the rules become layered. And the rules sometimes conflict with one another.
With machine learning, what happens is, you go through the pair-matching or the classification exercise, and you spend a day, a few days, a week, at the outset, a few weeks, training the model with this pair-matching exercise of the classification method. And then it learns. And so you’ve transferred your IP into a framework. And that is a problem for industries where the labor force is aging and you want to capture their knowledge. It’s a big thing to get that knowledge and that tribal knowledge into your system. Right? So it lives on and we don’t lose it. With machine learning, there’s an efficient path forward to get that done.
So, seven projects built in a few weeks. I did a COVID data quality taxonomy, an event taxonomy, location taxonomy. Here’s all the States mentioned all those 60,000 articles. That took literally a couple of hours to build an organization which is identifying all the organizations like the World Health Organization. That one took a little bit longer. It took about a day. People said it was an interesting one. I wanted to capture from this people mastering project I did, and creating a golden master project when actually people who said things, Paul said something, Trump said something. And so now this is going to talk a little bit about taxonomies before and after.

Elizabeth Michael:
Absolutely. So taxonomies are extremely valuable for analysts like me because they allow you to aggregate and drill down into your data in a uniform way across data sources. So when a unified taxonomy is applied across multiple data sources, the analyst or business user can do a side-by-side comparison of records and data from disparate sources while filtering down on a cross-section as a category or a classification within those data sources.
And Tamr has a really unique classification solution that allows the user to build a unified taxonomy across those data sources and collect and implement feedback from a subject matter expert who knows where those things should be categorized. In this project, both the ability to build those cross-source taxonomies and the ease of collaboration made Tamr a great solution.
For this project, we did a combination of both mastering and classification. In the example you see here on the left, you can see the entities that Paul was able to extract from the articles. And while they have tidbits of information, we were interested in, as we would expect with language processing, they weren’t uniform.
As you can see on the right however, Tamr was able to master the public policy influencers, mastering all instances and variations of, for example, Trump as president. Which I, as the analyst, needed to then use to drill into the number of statements per state, the influence of policymakers on testing rates, et cetera. Each of these entities, Paul either mastered or classified around in Tamr, we were able to use to drill into the entire dataset. Overlapping entity filters to identify, for example, Trump’s statements around event types, is classified in the event taxonomy.

Paul Balas:
Great. Thanks, Liz. Mastering people, this is pretty impressive. I’ve done four master data management projects in my career. I’ve blueprinted a few others. We went from about 530,000 entities, people, places and things, and we were able to extract the people and then mastered them into golden records. About 9,000 people identified in a few days. And we got very high accuracy. We got, I think, +93% accuracy.
Now, it wasn’t perfect, but it’s got a framework to improve it. And typically, when you’re doing a customer master data management, that’s good enough because you’ll start with your high revenue customers anyway, if you’re looking at like a CRM. And then you’ll work the long tail. Same thing here, maybe we focus on the most important people first, and then we worked the long tail. And the framework doesn’t require more IT work, it just requires the subject matter experts to train the exceptions.
So I estimate that this system can be maintained in a couple hours a week at scale. And it can decrease to minutes a week as the model learns and progressives. So I don’t even have to monitor this. Tamr can notify me when it’s got an exception and it needs some attention from me as the data steward. It’s a very few minutes to go in there and say, “Oh, these are the high value pairs. Let me go look at what Tamra’s having a problem with and see how they’re supposed to match.”
So that people master workflow I alluded to earlier, was really simple. We did a mastering project for the people. And then if you want to do a golden master, you have to take all those people records, maybe Trump’s in a thousand different articles. It might be spelled a little differently. You want one record to represent him in the way that he should be represented in a dropdown list in Liz’s this dashboard that she showed.
So, what are our conclusions? COVID pandemic data challenges are a macro view of the same challenges we all face in our own companies. As we try to use data as the information to improve outcomes for our business or outcomes in the public sector, in this case, really it’s a people problem. People need to work together more effectively so that we can erase this pandemic from our lives. Trusted data can truly help us in this. Thank you very much.
I’ve really enjoyed working with Tamr. I want to thank Tamr again, and thank InfoWorks. And especially thank Liz and Mingo for helping on this project.

Mingo Sanchez:
And thank you to you, Paul.

Elizabeth Michael:
Thank you, Paul. It’s a pleasure working with you on this project.

Mingo Sanchez:
Absolutely. Just to echo what Liz said, Paul’s been a great partner on this project. And as much fun as we have working with Paul on this POC, we love to work with new customers on the challenges that they’re facing too. So if this was of interest to you and you’d love to learn more about Tamr and how you could collaborate with us, we encourage you to visit our website to learn more. In the meantime, thank you so much for attending the session and enjoy the rest of the summit.