Matching Imperfect Data: String Similarity

blot jpg

Any Big Data effort that brings together data from multiple sources must determine when records in different sources are referring to the same real-world entity. This process is called record matching, and it needs to be accomplished even when the sources do not share identifiers or data representations. Furthermore, real-world data is often imperfect, and inconsistency or errors in data entry or data processing make this task even more challenging, as they can lead to multiple, different representations of an entity even within a single source. Rule-based matching just isn’t effective in these scenarios – it would take a vastly unmanageable number of rules to cover the full variety of data representations, and a fixed set of deterministic rules can’t handle data varietals that have never been seen before. Probabilistic or fuzzy matching is more forgiving of data inconsistencies, and therefore a very effective tool enabling record matching to achieve much, much higher recall at some expense of precision. Probabilistic matching of text utilizes string similarity functions, which compare two pieces of text and produce a single similarity metric.

Box again

Record Matching: String Similarity in Amazon/Google Product Catalogs

Product records from Amazon and Google

To illustrate the use of string similarity in matching, we can look at an example from the Amazon / Google Products Catalog entity resolution benchmark data set. In the “perfect mapping,” there is a match between an Amazon product titled “evergirl” and a Google product named “evergirl: pc cd-rom video game.” To find this match probabilistically, we need to break the strings down into smaller pieces, and compare the pieces. The two  main methods of string decomposition are n-grams and tokens. Tokenization is more geared toward linguistic text, so it better meets our needs. Tokenization is an art unto itself, especially across multiple languages, and we at Tamr are delighted that others have taken on this challenge and built successful businesses around it.

Now we need to compare token lists. Just looking at which tokens they share isn’t enough:they have only one token in common, but the second has many tokens in common with other pc cd-rom video games, and it should match those less well. The token “evergirl” is much more important than the other tokens, so we want a similarity function that takes this into account. The most common measure of importance of a token is called Inverse Document Frequency (IDF), illustrated below, which is a measure of how unusual a term is in a given corpus. This is an approximation of importance, but it does well. IDF is the log of the number of strings in the corpus, divided by the number of strings that contain the token. The entire corpus contains 4589 records, two of which contain “evergirl”, so its IDF is log(4589/2) = 3.36. In contrast, “cd” appears in 189 documents, so its IDF is 1.39. By this measure, “evergirl” is about 3 times as important as “cd.”

Graph

Comparing Token Lists: Inverse Document Frequency

We now need to combine the IDFs of the 6 different tokens across the two strings into a single measure of similarity. There are several ways to do this; one popular method is to use cosine similarity. We treat each string as a vector in a vastly multi-dimensional space, in which the length of the vector in the nth dimension is the IDF of the nth token. The cosine of the angle between the two vectors is a good similarity metric for several reasons: it accommodates strings of different length, it is 1 for identical vectors and 0 for orthogonal vectors, and we can compute it efficiently using the vector dot product. Using this metric, the similarity between “evergirl” and “evergirl: pc cd-rom video game” is 0.644.

Arrow

Cosine Similarity

It’s always good to check your work; let’s check cosine similarity between “evergirl: pc-cdrom video game” and another string that overlaps on more, but less interesting tokens: “dora the explorer: pc cd-rom video game.” We’d like the similarity of these two to be less than 0.644, since they don’t overlap on high-IDF terms; and, indeed, their similarity is 0.499, as desired.

Of course, probabilistic record matching might be imperfect in some cases. It will probably find the right matches, but probability is not a guarantee. By combining multiple probabilistic methods with the continuous feedback of active machine learning and expert sourcing, we can minimize the downside, while gaining the enormous benefit of a system that is able to perform reliable record matching while automatically accommodating variety, inconsistency and imperfections in data.