Ringing in the New Data Year: What to Expect in 2019

2018 is quickly winding down. At Tamr, we’ve been asking ourselves: “What’s coming in 2019?” “Will it be the year of machine learning?” “What changes will we see?”

Here are a few important trends to watch for the coming year.

1)   Ditch ETL and Invest in Cataloging, Tracking, and Monitoring Data Assets

In 2019, enterprises will realize that before spending money on extract, transform, and load (ETL) processes on the data they know, they first need to find the data they don’t know about. For any asset, step one is to catalog, track, and establish systems for monitoring key data sets and then understand how they relate to downstream analytics. While data catalogs have been mentioned over the past few years, they have been treated as second class citizens or as static logs. Properly managing every data asset requires serious investment in resources including software, policies, and integration work.

2)   Think Globally, Act Locally

Business units will need to access their own data and take action—all in the same workflow. In 2019, expect more organizations to use data analytics exactly where they’re needed and not in isolation.

Top-down, enterprise-wide design projects to decide on the golden schema and the golden record have proven painfully long and inadequate for handling large volumes of new data. It has become one of the most expensive, yet ineffective data investments. As a result, smart enterprises will stop doing it. Instead, they will focus on continuous, bottom-up data integration. They’ll concentrate on data pipelines and incremental maintenance and will begin integrating data with heavy assistance from machine learning algorithms and continuously learning integration software.

Business units will continue to acquire data sources rapidly to meet their business needs and goals. They cannot be held back by imperfect standards and top-down policies. Bottom-up processes and continuous integration are the only way they can consume their own data and use their own analytics tools without compromising data integrity. Continuously matching and connecting incoming data to other available data sets will grant all business units broader access to the enterprise-wide data asset, even if it’s a work in progress. It will be the only feasible enabler for fast, consistent analytics.

3)   Humanize Data Analytics

Effective solutions will be based on a strong synergy between rule-based engines that encode domain knowledge and machine learning solutions that generalize and automate integration and analytics into a hybrid AI platform.

The synergy between humans and automatic (mostly machine learning) tools have been limited so far to exercises such as: (1) providing training data and (2) validating decisions, such as clusters in record linkage or classes in classification.

While this simple, human-in-the-loop interaction is a must, future tools will need a stronger synergy to model the domain knowledge and the enterprise memory into the data curation tools themselves. For example, machine learning models will need to accept human-written rules as in weak supervision (see The Snorkel Project as an example). They will also need to accommodate hybrid solutions that form ensembles of human-written classifiers with machine-learned ones. The machine learning research community has been pushing with leading results in this direction for the past few years. It’s time to see it in working, deployable solutions.

With the advent of machine learning and new ways to unify data, 2019 looks promising for enterprises wanting to transform data into a strategic asset. Download our whitepaper to learn more about Tamr’s unique approach of using bottom-up, machine learning-based approaches to unifying disparate datasets within an organization.



Ihab is a co-founder of Tamr and a professor in the Cheriton School of Computer Science at the University ofWaterloo, where his main research focuses on the areas of big data and database systems, with special interest in data quality and integration, managing uncertain data, rank-aware query processing, and information extraction.He is a recipient of the Ontario Early Researcher Award (2009), a Cheriton Faculty Fellowship (2013), an NSERC Discovery Accelerator Award (2014), and a Google Faculty Award (2014), and he is an ACM Distinguished Scientist. Ihab is an elected member of the VLDB Endowment board of trustees, elected SIGMOD vice chair, and an associate editor of ACM Transactions on Database Systems (TODS). He holds a Ph.D. in Computer Science from Purdue University and a B.Sc. and an M.Sc. from Alexandria University.