Unify on the Go
Regardless of the industry, it’s a well-known fact that data unification is a major challenge for any enterprise. Data scientists spend a majority of their time performing tasks such as locating data, unifying it from multiple sources, and cleaning it–all before they can even begin their real work. This is where Tamr Unify comes in, enabling companies to perform data unification on siloed datasets at a large scale.
Unify is designed based on our core belief that machines can do incredible things, but human feedback is irreplaceable. Feedback from the field indicated a strong demand for a high-level Python library to programmatically interact with Unify. We’ve created this Python library and called it the Python Client for Tamr Unify (tamr-unify-client on PyPI). There are 2 core groups who will immediately benefit from the Python Client: data scientists wanting to produce robust, trusted analytics and IT personnel wanting to automate Unify’s workflows.
The engineering team at Tamr dedicated their efforts to developing a Python API client, which is now public on GitHub. According to Pedro Cattori, the lead contributor and maintainer of the Python Client:
The purpose of Tamr’s Python API client is to make simple use-cases easy and complex use-cases possible by providing a high-level abstraction on top of the Unify API. Specifically, the API clients let users interact with concepts like “project”, “dataset”, “model”, “operation” instead of the raw API endpoints and JSON formats.
Unify on the Go at Work
To demonstrate how easy it is to enrich your data with Unify’s data using Python (saving data scientists weeks of time on data cleaning), consider the following example.
A data scientist is working on a new problem–for instance, looking at the opioid prescriptions of health providers. The task is to investigate suspicious health providers who overprescribe opioid medicine and understand where their payments come from.
The enterprise already has payment data from health insurance claims or open payment data. The input data sources are unified into a single view which contains every payment of each health provider. This is referred to as unified payment data.
As the data scientist is exploring opioid prescription, they find an external dataset they believe may add value to the existing source datasets. They could add this new dataset to the unified enterprise data through Tamr Unify. The other option would be to quickly assess the value of the new dataset through simple API calls using the python API client before adding it to the unified enterprise view.
First, a data scientist did some exploratory data analysis (EDA) on the opioid prescription dataset and focused on a list of physicians who might over prescribe opioid medicine. Their next move is to look at where the payments came from, as this analysis will be helpful to understand the root cause of the over prescription pattern such as whether there is a link between a payment transaction and an industry sponsor.
To quickly enrich data from unified payment data without performing data unification first, the Low Latency Match (LLM) service comes into play. LLM will allow quick record matching and avoids waiting for data unification tasks. The Python Client makes it convenient to run LLM from a Python environment such as Jupyter Notebook. Data scientists already use libraries like pandas, numpy, nltk in their daily work. Now they can add tamr-unify-client to their tool belt to unify clean, trusted data quickly and collaboratively.
Below are some example code snippets showing how to use tamr-unify-client python package.
Tamr’s Python API client is a useful tool that:
- Supports python and automation: Python is widely used within the data science communities. Unify Python API client allow the best of two worlds, enabling data scientists to work freely while making it easy to align new and unpredictable data with key business assets.
- Makes it easy and convenient to get started with Unify: Unify’s Python API client allows data scientists to get their hands dirty quickly. Even without prior knowledge of the workflow that generates the end unified dataset, they can leverage the unification power of Unify. This makes Unify accessible to a broader audience, especially data scientists who are new to Unify.
- Enables quick analysis with LLM and record streaming services: LLM is frequently used to perform low latency record matching. Record streaming can be used to fetch records from unified datasets directly. These services allow data scientist take quick actions to use unified data without adding new data into Unify.
Tamr’s Python API client is a great complementary tool to our Unify solution.