How Tamr is Helping the Exploration & Production Industry Master Decades of Messy Data

If you are not on the IT side of the energy industry, you may not have heard of the concept of data mastering (often referred to as master data management, or MDM), but those of us who work in other departments have unknowingly been complaining about the lack of it for years.  Data mastering is the idea of creating a central data asset and management process that integrates with all the branches of a business to maintain accuracy, uniformity, and consistency of data.  In the case of oil and gas, master data is the standardized set of records (often called “golden records”) and attributes that details how information gathered from reservoir, production, drilling, geology, land, and finance will be described and tied together across departments.

As someone who has spent almost two decades in the oil and gas industry, I can not tell you how much of that time has been spent cleaning and joining datasets for use in reservoir studies, acquisition work, reserve reports, and spur of the moment requests from private equity sponsors.  Sanitizing the data for those projects is already time consuming, but with the demand for production growth and the increased efficiencies in drilling wells, the data is being acquired faster than ever before. I joined Tamr because I saw firsthand how hard it has become to master this data variety & get an accurate view of operations, and wanted to help solve the problem for the industry.

An Example of the Tamr Process

Let’s consider an example of where Tamr could be useful to your organization.  One of my favorite resources is the Frac Focus database. It is free, easy to use (a .bak file that you can restore in MS SQL), and comes with query instructions.  It is incredibly valuable for any project related to onshore horizontal wells in the US.  The database’s main purpose is to keep the public informed of what chemicals are being used in hydraulic fracturing jobs.  Not only is the database helpful for identifying public health concerns, it is also advantageous to oil and gas companies as it provides a good amount of a competitor’s “secret recipe” for completing a well.  Specifically, information that is critical to establishing well productivity metrics which are key for SEC reserve reporting, acquisition and divestiture work, and any project relying on well type curves.

This sounds all well and good, but once you get into the database and start working in it, you find out why it is underutilized – it is an absolute mess. If you can think of a data problem, it is here.

Common issues you will find:

  • Incorrect spellings of chemicals
  • Chemicals being referred to by different names
  • Incorrect state/county associations
  • Correct data in the wrong column
  • Decimal and whole number percentages differ well to well
  • Wells may have changed hands, so incorrect current operator name

Tamr uses machine learning (ML) to clean and organize the difficult data.  It can tie together the various correct multiple chemical names, group together incorrect spellings, fix incorrect state associations, and perform numeric transformations on the ingredient used percentages.

How Tamr differs from a lot of other AI solutions in the E&P industry is the human portion of the software.  ML is only as good as the humans who program and use it.  Once a user has created associations between columns, and Tamr has done some pre-work on the data set, the software can then create a series of questions that a company’s subject matter experts will answer to assist the program in generating even more relevant relationships found in the data.  For a database like FracFocus, a company would probably assign a completion engineer to that task. Within a short time period (1-2 hours), the SME can answer enough questions so that the software is then knowledgeable enough to make the best choices on the decisions related to classifying the information.  Once complete, a clean, mastered version of the data is available for downstream consumption – through APIs (which can be connected to a data warehouse or tools like Spotfire) or a CSV file.

Tamr is all about uniting multiple datasets, so the next step could easily be taking the now clean version of the FracFocus database and uniting the portion of it that details Pennsylvania wells with the general well header information and production from the Pennsylvania state data web site.  Using a similar process as with the fracturing database, this would then correct any inconsistencies across the two datasets regarding current operators, well locations, target formations, and could also help a company estimate the lateral lengths of wells (very important, but often lacking in datasets).   What used to take techs and engineers months to do, can now be done with Tamr in a matter of hours.

Two Use Cases in E&P Where Tamr Can Add Significant Value

I recently attended a Society of Petroleum Engineers (SPE) talk given by a Senior Director from IHS Markit, one of the largest information providers and analyst groups in our industry.   The subject of the talk was how technology and digitalization are driving performance in oil and gas companies.  They see digitalization as one of the top four trends for impact on upstream innovation and technology and we will continue to see it applied to all elements of oil and gas development and management for the full life cycle of wells.  Even if you didn’t hear the talk, you know this is true if you subscribe to any petroleum industry website or magazine – about 2/3 of all articles are about using machine learning to do “X”.

Machine Learning

Machine learning can be incredibly powerful.  For as technologically behind as E&P is, the industry is making great strides in its use and finding real-world applications. It is becoming less black box and slowly more accepted by upper management.  But, for ML to work properly, it needs to have a lot of data…a lot of data that is clean and can have patterns drawn from it. This is where Tamr comes in extremely handy.

Let’s say a company wants to automate the process of building type curves for fields using ML and then create a 5-year drill schedule for its proven undeveloped locations (PUDs), more data sources than you would expect will need to be cleaned and mastered.  The list of important file types, each with proprietary, public, and 3rd party data sources, would look something like this:

  1. General well header information
  2. Geologic, petrophysical, and geophysical spreadsheets/databases
  3. Well Logs – Same as geologic data, but a company may need to confirm top and bottom picks for formations based on the latest wells
  4. Land/GIS data – You may have a good public source, but it depends on the state
  5. Financial and expense data – In terms of paying for 3rd party, a lot of that information is also available publicly if you can root through SEC docs
  6. Permitting information
  7. Production and pressure databases
  8. Rig availability schedules
  9. Environmental restriction info – more likely with CA, NM, OH, PA, AK
  10. Past reservoir databases – Most likely just your proprietary ones

That is a sizable list!  Many of these items will have portions from the three aforementioned sources where items may be the same but perhaps tagged differently, or just differently enough, to cause problems with even getting started on your automation project.  Using Tamr to get the data in line, quickly, will be crucial to your success. Just a note: As a chemical and petroleum engineer that has thought “I bet I could build that myself”, it is a lot harder than you think.


The other trend that is getting a lot of coverage in industry articles these days is consolidation, which I see continuing over the next 2-3 years.  Companies and wealthy investors that currently have access to capital are eyeing once over-valued companies and assets that are now priced fairly or below market in many parts of the U.S.  When a company acquires another company, they are inheriting a lot of data all at once. It doesn’t matter if the acquired asset is in the same basin as the acquirer or a different basin altogether.  Either way, most companies handle their data very differently and integrating the new asset’s information into the company’s current data set is very difficult and can take years to complete, if it is even started at all.  Again, using Tamr, a company can greatly accelerate the timeline for having all the new data cleaned and organized to start finding value that the previous owners of the asset may not have realized was there.

Final Thought

To get to where the oil and gas industry wants to go in terms of utilizing the most advanced technologies, it needs to build a solid base from which to do so.  Using Tamr as the data mastering solution to combine old, new, and disparate data sets would not only allow E&P to work faster on current automation and large data set mandates, but put the industry at the forefront of using ML to optimize processes and production in assets ranging from legacy to new discoveries.

To learn more, click here to talk to a data mastering expert at Tamr.