Written by Tamr
Enterprises and organizations have access to a huge variety of diverse data sources today: internal data sources, external public data sources, feeds from the Internet of Things and more. They want to be able to leverage this data, connect/combine it intelligently and efficiently, and tap into it for new kinds of analytics and applications. But neither traditional top-down data-integration approaches nor some of the newer, bottom-up data scientist tools can scale to meet the demands of Big Data Variety.
Come see Tamr Co-Founder and CTO Michael Stonebraker at Strata Conference NY on October 16 discuss how a scalable data curation platform can help enterprises connect and enrich their data so they can leverage all of it.
In his talk, Mike will describe data curation, which he defines as “the process of turning independently created data sources (structured and semi-structured data) into unified data sets ready for analytics, using domain experts to guide the process.” It involves:
- Identifying data sources of interest (whether from inside or outside the enterprise)
- Verifying the data (to ascertain its composition)
- Cleaning the incoming data (for example, 99999 is not a legal ZIP code)
- Transforming the data (for example, from European date format to US date format)
- Integrating it with other data sources of interest (into a composite whole)
- and deduplicating the resulting composite data set.”
The more data you need to curate for analytics and other business purposes, the more costly and complex curation becomes – mostly because humans (domain experts, or data owners) aren’t scalable. Mike’s talk will compare and contrast three approaches to data curation for Big Data Variety:
1. ETL (Extract-Load-Transform) tools
2. Data Science tools
3. Enterprise curation tools
You can also join Mike after his talk (at 11:50 AM) for an informal Office Hour with Michael Stonebraker.
For more information about Strata NYC scheduling, click here.