Written by Julia Neagu
Every quarter since 1964, the U.S. House of Representatives publishes the Statement of Disbursements (SOD), a public report of all their receipts and expenditures. The available datasets span almost 10 years, from Q1 2010 to Q2 2019, and includes 21 attributes, as well as almost 4 million transactions accounting for more than $19B in spend.
In order to facilitate public analysis and research, since the data became available online in 2009, nonprofit organizations like the Sunlight Foundation and ProPublica have been collecting and converting the SOD data into text files to facilitate analysis and research. The SODs have been widely used by reporters and analysts to identify congress members with high spending, fact-checking spending claims and tracking spending habits.
However, gaining timely (and, ideally, actionable, cost-saving) insights from the glut of raw data in these reports remains a challenge. Ideally, we would want to be able to directly tie variations in the overall congressional spend to particular transaction categories as granular as possible.
In order to be able to perform a more in-depth analysis of congressional spend based on the SODs, I used Tamr to classify the spend before performing some exploratory analytics using Jupyter Notebooks. Whereas this did not magically solve all challenges one might face in deciphering on what, and where, Congress is spending money, it was a great learning experience of how the spend data is classified and reported, and how analysts can begin making sense of it all through rapid data unification and analysis.
An accurate and comprehensive taxonomy is essential for deriving strategic insights from data, such as spending trends and cross-organization comparisons. The lack of a unified and consistent taxonomy (both between different members’ offices, and also from year to year) can make it impossible to drill down and identify the root case of a change in spending.
We can infer the taxonomy Congress is relying on by looking at the Category and Purpose columns in the dataset. The Category attribute contains 10 distinct values, and is very concentrated, with the top three categories (Rent, communications, utilities, Supplies and materials, and Travel) making up over 60% of the total number of transactions. The Purpose can be considered as the second level in the taxonomy, but is extremely fragmented, as it contains almost 6,000 distinct entries.
For Rent, communications, utilities transactions alone there are almost 60 distinct purposes, and multiple seem to overlap (e.g. “POSTAGE / COURIER / BOX RENTAL” and “POSTAGE/COURIER/BOX RENTAL”, “TELECOMSRV/EQ/TOLL CHARGE” and “TELECOM SVC, EQUIP & TOLLS”). This indicates discrepancies in how the transactions were recorded, and the lack of a mutually exclusive, collectively exhaustive (MECE) taxonomy.
In short, deriving any reliable analytical insights from this data set first requires a more effective spend classification using a unified taxonomy. As a starting point, I used one of our standard five-level indirect spend taxonomies, with 10 top-level categories.
I used Tamr to classify transactions into the above taxonomy, using three attributes from the dataset: Category, Purpose, and Payee. Tamr supervised learning algorithms use of user-generated training data (in this case, transaction classifications) to generate the most likely taxonomy classification for each transaction based on these attributes. While incomplete data (e.g. missing categories or purposes) makes the transaction classification more difficult, Tamr can classify transactions up to a certain level in the taxonomy tree, depending on the confidence level (see below).
The main challenge in transaction classification is the lack of complete information in the original dataset. For example, over 130,000 transactions only feature “OTHER SERVICES” as the original transaction category, with no other information about the use of the services. There is therefore an upper bound to the granularity of the classification, as these transactions would only be classified up to the 2nd level in the taxonomy (Office Supplies and Services – Office Services).
For this particular example, I trained the model using my best guess as to the correct transaction classification in order to generate labeled training. Validation from a congressional spend expert would be required in order to guarantee the absolute accuracy of the classification model and its results.
The suggestedClassificationPathAboveThreshold column in the classified dataset contains Tamr’s best guess for the category in which a transaction should be placed, up to a taxonomy level that depends on a specified confidence level. For this particular dataset, transactions were classified up to level 5 in the taxonomy, which allows us to analyze the spend in more detail.
Between Q1 2010 and Q2 2019, the House of Representative has spent over $11B Human Resources, IT, and Office Supplies and Services.
Of the $5.3B spent on Human Resources, almost all was spent on personnel compensation (salary & benefits), with negligible amounts directed towards consulting and training.
A more in-depth analysis could classify the salary spend at a more granular level, updating the taxonomy to take into account the particular positions that are being compensated (listed under PURPOSE). We notice, however, that over $1.5B (40%+) would not be broken down further, as it lacks further information about the compensation’s purpose or payee; this may be either because of how the original transaction data was recorded, or because of errors in the data parsing.
In Office Supplies and Services ($1.21B), only 30% of the spend ($364M) was classified up to L3 or above. This means that almost one in three transactions contained no more information beyond the fact that they were either a service or a supply.
For those of you who want to continue to manipulate and play with the data, I have shared the Tamr classified dataset, as well as a Jupyter notebook for the analytics on Tamr’s GitHub page.
Now that we have classified the data and performed some initial exploratory analysis, I would be curious to hear what kind of analytics and models you would like to see in upcoming blog posts. For example, I am curious to look at year-over-year trends, especially with the goal of weeding out any inconsistencies in transaction reporting.
Tamr is hosting a webinar next week (January 15th) on a related topic, How to Implement a Spend Analytics Program Using Machine Learning in 2020. If you’re interested in learning more about it, save your spot today!