Data Preparation in the Big Data Era — O’Reilly Report



Trying to get your head — if not your organization — around how to prepare all the internal and external data now available to your enterprise for analytics? Wondering, given all the complexity and options, where to even start?

Check out O’Reilly’s new reportĀ Data Preparation in the Big Data Era: Best Practices for Data Integration by Federico Castanedo, Chief Data Scientist at

It’s a terrific combination of business and technical content in the context of a step-by-step guide to managing the challenge — and opportunity — of the many valuable but disparate internal and external data sources endemic to today’s enterprise.

As Federico introduces, “preparing and cleaning data for any kind of analysis is notoriously costly, time consuming, and prone to error, with conventional estimates holding that 80% of the total time spent on analysis is spent on data preparation … Substantial ROI gains can be realized by modernizing the techniques and tools enterprises employ in cleaning, combining, and transforming data.”

Topics covered include:

  • Starting with the business question — How to “translate data into useful knowledge … [with] the true potential value of big data … only gained when
    placed in a business context, where data analysis drives better decisions.”
  • Understanding your data — How to “employ more specific terms, like: raw data, technically-correct data, consistent data, tidy data, aggregated or compressed data, and formatted data” in each stage of the process.
  • Selecting the data to use — How to rectify raw data errors “such as incorrect labels and inconsistent formatting” with, ideally, automated data preparation techniques such as string normalization or approximate string matching.
  • Analyzing your current data strategy — How to “achieve consistent data,” which requires that “missing values, special values, errors, and outliers must be removed, corrected, or imputed.” Federico notes that “ideally, you can solve errors by using the expertise of domain experts, who have real-world knowledge about the data and its context.”
  • Assessing alternative ETL and data curation products — How to “deal with the data cleaning and curation problem” — in the context of the radical data variety in the modern enterprise — “cost effectively and at large scale.” Here Federico uses Tamr’s machine-driven, human-guided approach as an example of “a commercial product focused on the data curation problem at scale, attempt[ing] to solve the variety issue of big data.”
  • Delivering the results — How to deliver curated data that “will be consumed and analyzed, often in the form of visualizations” to “different users [who] require different views of the data.”

For Federico’s full report, register for a free download here.