Written by Michael Brodie
Data curation following strong data science principles is required to demonstrate the veracity of data analysis
Capital in the Twenty-First Century by Thomas Piketty is the most popular and controversial economics book in decades. As a May 31st editorial in The Economist summarized it, “The book’s thesis, that wealth concentrates because the returns to capital are consistently higher than economic growth, has spawned furious debate. Mr Piketty’s preferred remedy (a progressive wealth tax) even more so.”
Piketty’s thesis concerns sustained wealth inequality for the superrich and is based on an analysis of worldwide economic data since 1810. While the thesis and remedy pit supporting liberal economists like Paul Krugman against conservatives like James Pethokoukis in a debate over modeling assumptions – e.g., output will grow more slowly than the return on capital – the most fervent and objective public debate concerns the Data Science used in the analysis, specifically the Data Curation used to select and combine economic data into curated data for the analysis.
This debate illustrates the role and importance of data science and data curation.
- Data curation, following good data science principles, is critical to achieve and demonstrate the validity, data quality, and significance of data analysis results over small data and even more so over Big Data. That is, data curation is a critical task that is independent of the subsequent analysis and involves complex decisions and tasks.
- Human expertise, of leading economists in this case, is indispensable for some data curation decisions.
- Data analysis with potentially significant impacts should be accompanied by data curation provenance including data governance (veracity, data quality, and significance) as a basis for verification and, if necessary, adjustment of the data curation to address gaps and rerunning the analysis, as Piketty did.
Piketty followed excellent data science principles in publishing his source data and his data curation methods. Chris Giles, economics editor of the Financial Times, while alleging political motives (i.e., bias), focused his argument on gaps by Piketty in data curation: specifically, incomplete provenance, such as missing data curation steps including “unexplained adjustments to the raw source data” and interpolating for missing source data by “migrat[ing] from data based on estate tax records to data from the Survey of Consumer Finances.” Giles goes on to debate three methods used to address conflicting source economic data before criticizing curation methods such as using a simple versus weighted average of data from Britain, France, and Sweden. Giles further questioned Piketty’s data quality, alleging “confirmation bias.” Finally, Giles demonstrated the significance of the data curation errors by showing that Piketty’s thesis failed when the analysis was rerun once the curated data was corrected according to what he understood the book to claim.
While political bias and debates can amount to ships passing in the night, the debate on the data science and data curation can be more objective – as seen as the story unfolds. Piketty again followed good data science practice by acknowledging data curation errors and providing a detailed “technical response” in an addendum to a technical appendix of his book. Meanwhile other economists and commentators joined the fray, refuting Giles and supporting Piketty’s revised results. The coup de grâce came when the very wealth inequality economists used by Giles reverted and dismissed his thesis as “simply untrue” – a scenario reminiscent of my childhood playground argument “Is Not! Is Too!”
From a Data Science perspective, the “Piketty Panic” (as Paul Krugman refers to it) has clear implications for all rigorous big data analysis:
- Data curation following data science principles is required to demonstrate the veracity of the results of data analysis. Therefore, data curation will be increasingly important as its ultimate role and value is understood.
- Data analysis must define data curation requirements (veracity, data quality, significance) that must be achieved and verified to be acceptable as input to data analysis.
- Data curation is a separate function that precedes data analysis. Data curation products – such as Tamr – must provide means for curators to manipulate data sources to achieve data curation requirements.
- Human data curation expertise is indispensable in many if not all rigorous data analysis.