SHARE
August 4, 2017

How Tamr’s Entity Resolution Capability Helps Border Security

Indian film actor, Shah Rukh Khan[/caption]

Shah Rukh Khan is one of the biggest stars in Bollywood. He was worth ~$600M in 2014, putting him just ahead of Tom Cruise and George Clooney. Despite having one of the most recognized faces in the world, the TSA sometimes fails to recognize him and detains him. Each time, he is questioned for a few hours before being released. He then tweets wryly about the experience to his 20M followers, and it becomes a minor diplomatic incident.

In 2009, he was detained at Newark on his way to screen a film about racial profiling. The irony did not amuse his twitter followers. In 2012, he was detained on his way to give the commencement address at Yale. He joked to the audience the next day, “Whenever I start feeling too arrogant about myself, I always take a trip to America”. In 2016, when detained at LAX, the clearly annoyed Khan tweeted, “I fully understand & respect security with the way the world is, but to be detained at US immigration every damn time really really sucks.” He followed up a few minutes later by tweeting, “The brighter side is while waiting caught some really nice Pokemons.”

Khan seems to be able to laugh at these mistakes, but missed identifications by the computer systems that govern our borders have a real cost. They frighten thousands of innocent people each year as a human screener tries to figure out whether these travellers are terrorists. Conversely, screenings sometimes fail to identify drug smugglers as they drive across the Mexican border. Perfection is impossible, of course, but every point of accuracy matters as it relates to border identification.

When it comes to building the most accurate models for screening, machine learning is king. ML powers the Google search bar that is so uncannily accurate it seems to read the user’s mind. It has enabled the speech recognition that drives Alexa and Siri. Computers powered by ML have conquered chess, Jeopardy, and Go. Now, CBP is using ML to improve traveler recognition at the border.

The problem is straightforward to set up. An individual presents at the border with certain attributes: a name, a document id, an itinerary, and, if you’re lucky, some biometric data. The goal is to match that traveller to all past trips that they’ve taken (and past interactions that they’ve had at the border) and to internal sources that identify good guys and bad guys. It’s a straightforward question of entity resolution. Fortunately, applying machine learning to entity resolution is what Tamr was founded to do. We love this problem.

What makes the problem hard is the asymmetry of the information. The whitelist entry for Shah Rukh Khan is richly populated. But at the border, the information about the traveller may be limited. Tamr’s approach is two-fold. First, we enrich the traveller data as much as possible before making a final determination. If presented with a traveller with a common name, like “Juan Garcia”, we can start by at least aggregating all the trips taken by that passport number to see if there is any helpful historical information or if the travel pattern itself is an interesting signal. Second, we dig as much signal as possible out of what we have. If all we have is a name, we analyze that name from many angles. We use trigrams to work around typos. We use a metaphone package to create a set of tokens that represent the sounds of that name. We weight tokens by “inverse document frequency” to put more emphasis on the name “Thorbjørn” than the name “Matt”. We use regional knowledge of how names are constructed to parse names correctly. These approaches are not applied sequentially but in parallel. We feed each approach as a separate signal into the machine learning classifier, and we let the machine decide what to do with the extra data. Tamr’s ML-based approach excels when given various signals as long as there’s enough training data to help the machine build the model.

At Tamr, we’re convinced that machine learning is the right tool for tackling entity resolution problems, and our experience working with Customs and Border Protection has reinforced that belief. As the U.S. deploys machine learning more widely at our borders, we hope that Shah Rukh Khan and thousands of others travellers who are wrongfully detained will have smooth and uninterrupted trips through the US border.