Tamr Patents

The Details Behind the Innovation

Our commitment to innovation extends beyond mere words. We're inspired to bring cutting-edge solutions that transform industries and make a tangible difference. Our extensive portfolio of patents is a testament to this commitment. These patents demonstrate our relentless pursuit of original ideas. They stand as proof of our continuous innovation, our propensity to challenge the status quo, and our resolve to drive the future.

Insights

Answer the questions “what changed?” and “how accurate is my ML model?”
Review and Curation of Record Clustering (1 patent in family)
US Patent: US11321359B2

One of the most basic questions about a new version of a data product, and one of the most difficult to answer effectively, is “what has changed since the last version?” Helping our customers answer this question is what drove us to innovate a way to perform automatic, reliable survivorship at scale, which, in turn, enables the automated creation and maintenance of TamrIDs, one of our most popular capabilities. Having automated survivorship and stable identifiers at scale enables the visualization and review of data changes at scale, whether those changes result from changes in source data, to machine learning models or human curation activity. By answering the question “what has changed,” we also enable data consumer feedback on those changes.

Unbiased Cluster Accuracy Metrics (1 patent in family)
US Patent: US11294937B1

Obtaining reliable, cost-effective insight into your model’s accuracy is a significant challenge for machine learning-powered data unification. Tamr aimed to solve this challenge by examining state-of-the-art methods for assessing accuracy. We found that each method had significant deficiencies and introduced biases. These findings inspired us to create a novel, yet straightforward, method of estimating overall accuracy given a very small amount of human input.

Curation

Curate large, diverse data sets at scale
Large Scale Data Curation (3 patents in family)
US Patent: US10929348B2
US Patent: US9542412B2
US Patent: US11500818B2

When Tamr was founded at MIT, technologies existed to address two out of the three V’s of big data. Specialized stores such as Vertica addressed volume and streaming frameworks such as Kafka addressed velocity, but no technology existed to address data variety. Recognizing that human effort alone is cost-prohibitive, and machine learning alone ranges from unreliable and misleading to biased and blatantly incorrect, Tamr explored a new approach. Early results indicated that closely coupling machine learning with subject-matter supervision could yield the efficiency of a machine learning-powered workflow with the trust and reliability of a human-powered workflow. As a result, Tamr collaborated with customers to deliver on the promise of cost-effective accuracy by developing a practical system and method for performing data curation that could scale to millions and billions of records (or more).

Curation with Version Control (1 patent in family)
US Patent: US11042523B2

Tamr is a pioneer in the area of curation with version control. By describing how to incorporate manual data curation into a versioned data product in a principled way, Tamr addresses concerns such as understanding what data, model, and curation changes come together to create a version of a data product. Tamr also addresses how to manage and maintain the benefits of manual curation across changes to models, the data, and its structure.

Reusing Transformations for Evolving Schema Mapping (2 patents in family)
US Patent: US11003636B2
US Patent: US10860548B2

When it comes to data curation, human input is the most expensive aspect. To address this cost issue, Tamr delivered innovations that enable their platform to take advantage of human input outside of the context in which it was originally provided. For example, when a human describes how to modify source data to fit a unified schema, the platform can reuse that description for other source data, even if the other source data differs from the original in content and/or structure.

Feedback

Capture human feedback in context and use it to train the model
In-Situ Data Issue Reporting, Presentation, and Resolution (1 patent in family)
US Patent: US1081736B1

Tamr recognized that users are more likely to provide feedback on data when they can do so from within the application they are using. Tamr’s innovations provide an interface for immediate feedback that gathers contextual information about the data in - including dataset and versions, filters applied, selected data elements - from within a web page, spreadsheet, or visualization platform. This approach not only makes it simpler for users to provide feedback, but it also makes it easier for curators to view that feedback in context and take corrective action. These same techniques also provide users with greater visibility and insight into data quality, including whether or not the data they are viewing has open issues.

Using Clusters to Train Supervised Entity Resolution (1 patent in family)
US Patent: US11049028B1

Data experts are excellent at correcting the output from the system. But traditional machine learning methods are unable to use these corrections to improve the model. Tamr’s patent describes an innovation that translates the feedback that data experts provide into input that the machine learning model requires. This translation ensures that the training remains unbiased, stable, and durable across changes to the data and to the model.

AI/ML Mastering

Make meaningful connections and translate them into active learning
Scalable Binning for Big Data Deduplication (2 patents in family)
US Patent: US10613785B1
US Patent: US11204707B2

One of the greatest dangers that arises when you are matching a record against a corpus of millions - or even billions - of other records is wasting time comparing records that have nothing to do with each other. To tackle this challenge, Tamr has introduced several innovations, the first of which enables the machine to quickly focus on comparing records with meaningful relevance, making large-scale deduplication feasible and practical.

System for Scalable Hierarchical Classification Using Blocking and Active Learning Method (2 patents in family)
US Patent: US10803105B1
US Patent: US11232143B1

Tamr’s next innovations in the area of AI/ML mastering describe an approach to hierarchical classification that enables the practical use of large taxonomies, such as UNSPSC which contains over 100,000 categories. The first focuses on how the system is trained. Tamr uses training data multiple times, in multiple ways, which dramatically reduces the amount of training required to achieve a high-level of accuracy. The second innovation enables the machine to quickly focus on meaningful categories when applying the model so it can find a best-match category for a given record.

Geospatial Binning (1 patent in family)
US Patent: US10877948B1

When it comes to data matching (“conflation in GIS terms), geospatial features such as roads, building footprints, and points of interest can provide strong signals for matched records. But traditional methods require dedicated geospatial databases which often force trade-offs in accuracy. Tamr’s approach to this challenge is to compute similarity, even between disparate feature types such as points of interest and building footprint, while simultaneously avoiding accuracy trade-offs caused by projection. This approach is proven to scale to extremely large feature datasets.

Active Learning When Using Clusters for Supervised ML (1 patent in family)
US Patent: US11416780B1

Active learning is a way to select the questions you want to ask experts that will have a disproportionately positive impact on the accuracy of machine learning. Tamr has an innovation on learning from clusters that translates expert feedback to machine learning input. This innovation goes the other way by translating the technical needs of the active learning system into practical questions that a data expert can answer. Doing so enables the system to build an extremely accurate model using data expert responses from an astonishingly few number questions.