Getting The Most From Your AWS Migration with Data Mastering
The benefits of cloud technology are a true business gamechanger, which is why many organizations are moving to AWS. But before you shift to the cloud, you’ll need to reconcile how to best migrate your data in order to take full advantage of the cloud’s benefits.
The good news about an impending cloud migration is that you have a chance to start fresh. Now is an excellent time to address your legacy data problems and transform your business’ data into an asset and make it readily available to the entirety of the business for downstream efforts like data science, analytics, and business forecasting.
Here’s the bad news: if you’re counting on a simple, straight-forward lift and shift migration strategy, you’re setting yourself and your organization up for failure. The reason? With a lift and shift strategy all the issues that plagued it when hosted on-premise (think siloed data, duplicate records, incomplete records) will follow you to the cloud. Your data will be just as unusable as it was before on-premise. This is why before any migration, it is critical to think about how you will clean, curate, and master your data.
A cloud migration strategy that is optimized for ROI not only moves, but improves data.
Discover Cloud-Native Data Mastering
What Your Move to AWS Shares
with Moving to a New Home
A good way to think about your upcoming AWS migration is like moving into a new home. At the moment all your stuff (ie., data) is at your current home in different states of organization. When prepping to move, would you pack up the messes in your old place, and simply ship the messes to your new place? Probably not. You’re more likely to neatly organize and pack the things you plan to bring so that when it arrives at your new place you know what you have, it’s easy to unpack, and it’s organized giving you a fresh start.
The same logic applies to a cloud migration; but with one massive advantage. Because cloud compute power and storage is far more economical than on-premise, by migrating to the cloud not only can you store data more cheaply, but you can activate computationally intensive machine learning algorithms to do the majority of organizing, enriching and mastering at the same time. In essence, outsourcing the hard work of cleaning your bad data to a machine while it's in transit: a major jump in speed and efficiency. By taking advantage of machine learning in the cloud, it’s possible to manage 10 times as much data, with one-tenth the people and in one-tenth the time.
Tamr’s cloud-native data mastering solution uses machine learning to do the heavy lifting of curating and enriching data, so your organization can use the data in the cloud to drive radically better business decision making and real business outcomes from mastered data - saving money, driving growth and reducing risk.
With the need for clean, curated mastered data prior to a cloud migration established, the questions are how and when do you do this? Modern data mastering possess these critical features:
- Machine learning to master data at hyper-scale
Traditionally, organizing and mastering data has been done with a rules-based approach (if / then). Conventional rules-based systems can be effective on a small scale, relying on human-built rules logic to generate master records. However, rules quickly fall apart when tasked with connecting and reconciling large amounts of highly variable data at scale. Machine learning, on the other hand, becomes more effective at matching records across datasets as more data is added. In fact, huge amounts of data (1M+ records across dozens of systems) provides more signals for the algorithms to identify patterns, matches, and relationships, accelerating years of human effort down to days.
- Open and interoperable architecture to break down existing data silos
Look for a solution with an open and interoperable architecture that allows businesses to pursue “best-in-breed” solutions for all their data needs. Today’s premier data organizations take a DataOps approach to their technology stacks, which means using the best tool for each specific need, instead of what’s easiest or readily available. Look for solutions that play well with others and are complementary through RESTful APIs and robust integration capabilities.
- Cloud-native technologies that scale effectively
Machine learning is essential to improving data quality. As stated before, manual, rules-based approaches don’t scale and are slow to provide value. However, running large machine learning projects on-prem is incredibly costly and computationally taxing. This is where the cloud can make all the difference. The cloud provides the scale and compute that makes using machine learning efficient and cost-effective.
Additionally, cloud-native solutions are ideal for leveraging the flexibility and scalability of AWS. Cloud-native capabilities (technologies that leverage built-in elastic and ephemeral cloud and compute benefits of cloud technology) allow for a highly secure and scalable infrastructure that is able to add additional storage and compute power without adding to physical and hosting costs. With this built-in advantage, cloud-native solutions allow organizations to reduce the total cost of ownership and enable data organizations to take advantage of ongoing product enhancements and tooling without needing to allocate additional resources to hardware, and system or software upgrades.
Moving Day: Selecting When
and Where to Master Data
There’s also the choice of when in the data migration flow to master your data. An initial thought may be to master everything prior to moving data into the cloud—really leaning into the idea of starting clean. However, there are plenty of advantages to mastering data once it is staged in the cloud. Consider the positives and negatives of each approach in the table below.
Master Data On-Premise
Master Data in the Google Cloud Data Lake
Data is mastered before entering AWS making it valuable to the entire business
Data is mastered before entering AWS making it valuable to the entire business
Data may still be siloed and unavailable to the mastering effort
Improved data access allowing Tamr to be applied to the entire corpus of data. This can help identify data sources that are redundant and should proceed no further in the migration workflow
Costly to establish a short-term environment to run large-scale machine learning algorithms on large data sets. This effort can be focused on establishing capabilities in the new cloud environment
Once provisioned, the cloud-native Tamr instance can continually master data as new sources are added to the data lake both now and in the future
Still need to move data to the cloud
Data is already in the cloud
Need to work with additional technologies to connect to on-prem sources and move data to the cloud on either side of the mastering process
Data is already in the cloud. Can take advantage of established data migration patterns and read data directly from the cloud data lake
Case study: How Johnson & Johnson Translates
Product Data into Sales Insights with Tamr
Johnson & Johnson’s consumer division, which includes brands like Listerine, Neutrogena and Aveeno, wanted a global view of product sales performance to improve product profitability reporting. By gaining more insight into global sales, J&J hoped to:
- Improved analytics for sales and marketing: Drive growth by applying best practices to pricing and promotions.
- Better supply chain visibility sales and operations planning: Reduce supply chain risks like excesses and shortages through richer insight into what’s being sold where.
Obtaining this insight required creating complete, or golden, product records, a process that J&J was attempting with a rule-based system. However, this approach presented several challenges, including:
- Requiring extensive manual labor: Data had to be manually gathered and analyzed, an approach that was time consuming and subject to errors and discrepancies.
- Slow to produce value: Reporting on the profitability of one consumer product took several months using the manual approach.
- Extensive data variety: Data attributes, like a product’s name, varied by country and cleaning this information was manually intensive.
- Great data volume: Large amounts of data needed to be added constantly, which proved time consuming.
J&J wanted a modern approach to data management that used machine learning to handle the heavy lifting around cleaning and curating data. An application that leveraged AWS was also important since J&J runs multiple applications on that platform.
After a proof of concept project yielded product sales data in weeks instead of months, J&J selected Tamr’s data mastering platform to create golden product records. Being engineered to run using AWS’ native services as well as using machine learning rather than rules also appealed to J&J. Mastering product data with Tamr has helped J&J accomplish business goals including:
- Improving sales visibility into its global consumer product base by creating golden product records
- Generating business analytics faster by using machine learning instead of rules
- Identifying the most effective promotional campaigns
- Empowering cross-functional teams with clean data to help them uncover sales insights
Tamr and AWS: Offering a
Modern, Agile Data Platform
Tamr is complementary to AWS and delivers mastered data to components like Redshift to support downstream analytics using services such as Quicksight and SageMaker. Used together, Tamr and AWS help enterprises form a modern, agile data platform.
Glue Catalog: Glue Catalog is a knowledge source for Tamr and provides data that can be integrated, what does it contain and who uses it. Tamr also uses Glue Catalog as a registry that’s updated with clean, mastered data for each entity. When users need a trusted source of data, they can search Glue Catalog and for items tagged as mastered by Tamr.
S3 or Redshift/Athena: Data mastered with Tamr can be published to S3 or Redshift/Athena, providing users with access to high-quality, curated data.
EMR or Databricks for AWS: Tamr lets organizations use the cloud-native capabilities of AWS to scale their use of services like EMR or Databricks for AWS, which Tamr uses for compute. Leveraging the elastic and ephemeral capabilities of AWS to increase and decrease compute as needed drives cost-efficient data mastering.
Glue: Integration via Tamr’s RESTful APIs allow data quality workflows to be easily invoked as part of a data pipeline. This reduces the barrier to applying data mastering processes.
Quicksight / SageMaker: Analytics and machine learning tools like Quicksight and SageMaker need clean, trusted data, which Tamr provides in a reliable and repeatable manner.
AWS Data Exchange: The enrichment services offered through AWS Data Exchange can be an additional source of data to process using Tamr. Enriching data often involves overcoming challenges around data variety, which is a core problem that Tamr solves.
One of the most valuable assets unlocked by moving to the cloud is the speed at which data can be utilized to solve end-business problems. But doing this relies upon data being mastered.
Migrations are the perfect catalyst for a conversation around improving data quality. To get the most from a migration, lead with data.
By coming to the cloud with a solid understanding of your critical data, like how many customers do you have, who are your leads, how many suppliers, or what parts you buy, you now have a known baseline to plug into existing applications today and new ones tomorrow; setting yourself up for success.
Using Tamr’s machine-learning, interoperable, modular approach to mastering data, your migration can amplify the productivity and possibilities of your data in its new home. With Tamr, AWS customers can improve their migrations and be in a position to accelerate critical analytical insights by reconciling internal and external data at scale.