With all the buzz around cloud technology, it’s not uncommon to come across a team or organization that is planning, or currently in the process of migrating data from one or more legacy systems over to a cloud infrastructure. And this is for good reason. Moving to the cloud offers up an opportunity to make use of data in different ways by different consumers (e.g. data scientists, analytic teams). In fact, this may be exactly what you’re working on — trying to determine which cloud to migrate to, what resources it will require, and how long it will take, among many other factors.
While these are all important considerations; before any migration, it is critical to think about how you will clean and curate your data. It’s all too common for organizations to lift and shift their data without giving prior thought to improving the quality of the data before it’s moved. So while the data may reside in a new location (the cloud) it is still as unusable as before, and the data consumers don’t possess an intimate knowledge of the data or how it got there.
Why is this? Principally it’s because the data is still sitting in several places (albeit on the cloud), and is not defined into logical entities that can be easily digested by the business.
Think of this issue like moving into a new apartment. Would you pack up the messes in your old place, and simply ship the messes to your new apartment? Of course not. More likely, you would neatly organize and pack the things you plan to bring so that when it gets to your new place, it’s easy to unpack and you’ve given yourself a fresh and clean start. The same logic can be applied to data migrations; figure out what is most important to shift, logically organize, master, and curate, and then migrate to the new infrastructure. Taking the time to organize and curate data into logical entities allows you to have a known baseline to plug into existing applications today and new ones tomorrow.
ETL Alone Cannot Solve Cloud Migration Issues
A typical migration process may heavily utilize an ETL tool in order to move data from multiple source systems. In addition, ETL services are used to build the required logical entities (customers, accounts, etc.) for reporting as well as any further processing of data prior to being stored in a downstream data warehouse for consumption. There may be additional reporting logic needed to power the analytics and reporting downstream systems, as well as for any future migration efforts. Some of the common pitfalls with this strategy are:
Data quality may be poor
The up-front time required to write the ETL logic will cause delays
Too many resources may be required
Time to value will be far too long
The Tamr Advantage
Tamr can be used to generate the logical entities for reporting via accelerated schema mapping and data mastering. This can provide a trusted view of each entity for reporting combining data from multiple sources from Day 1. Reporting systems can use this data along with any other data which may not need to go through Tamr (e.g. transactions).
The need for complex additional ETL is severely reduced due to Tamr’s machine learning-based capabilities for reporting and migration.
In this workflow, multiple disparate internal and external data sources are brought together into a landing zone. From here, Tamr’s Schema Mapping provides accelerated entity mapping to reduce the time it takes to align common attributes and build target data models. Once the logical entities are defined, Tamr’s Record Mastering and Golden Records capabilities match and de-dupe the data to provide a new curated layer for the logical entities. These cleansed and mastered datasets are then sent to the new cloud infrastructure–eventually feeding into downstream reporting and analytics applications.
Under the hood, Tamr is able to map, enrich, match, classify, and consolidate data at scale thanks to the patented human-guided machine learning technology. Business data experts–who are often the people most familiar with the data–contribute directly to the mastering model by answering simple match or no-match questions about the data. This iterative process hastens the time to develop accurate, curated datasets as part of the migration since no traditional rule development processes are implemented.
Getting Started with Tamr
The target cloud infrastructure will have data that is ready to use from the get-go. This allows you to immediately start benefiting from all the new features in the cloud that drove your decision to migrate in the first place. Tamr’s solution allows for this requirement to be met with best-of-breed machine learning technology that accelerates at scale and optimally cleanses and organizes your data during the migration.