Is Big Data More Important and More Useful Outside of the Analytical Data Stack?

Cloud adoption has been the catalyst for change

Cloud storage and computing have fuelled radical changes to the data architecture in the past 10 years. And this change coincided or catalyzed with data’s changing role in our lives.

Data has moved from inside business applications, into an entire Big Data pipeline on the cloud and, finally, into different, often SaaS-based, operational systems.

The analytical data stack led the way in Big Data, Data Visualisation tools were one of the first adopted en masse. For many companies, that was their Big Data strategy – BI tool attached to Data Lake or Data Warehouse. Analytical/explorational ETL layers were inserted in between, in addition to Data Marts. Slowly but surely these methods enabled the core of the Big Data stack to migrate to the cloud.

More recently, data and privacy regulations spurred those who hadn’t adopted Data Governance tools & processes into their stack, to management and awareness of data across the proliferating cloud-hosted tools. 

The modern big data stack is becoming more and more crucial

Data governance tools can help companies understand the whereabouts of their source data. But manually defining, indexing and data cataloging source data is a painstaking task that decays as soon as the, often mammoth, projects are completed. That’s why data teams have started to turn to AI and ML – asking machines to do the work, as it is the only practical, scalable answer. This has also occurred in the arena of master data management. At last, there has been a path to get from the vast cloud datasets to a reconciled single customer view, via automated SCV tools, such as Tamr. To achieve a single customer view, the varied and disparate data sets collected in the warehouse need to be joined to end up as useful inputs to the AI models, MI dashboards, or the SaaS tools where the operations of the business are now being conducted.

Operational activities such as sales, onboarding, and risk management have been migrated to the cloud applications, however, these processes often still rely on customer or product data, originating from legacy applications. As a result, the business intelligence tools and the operational SaaS systems, both rely on the data from the core of the business, from its legacy applications. But rather than connect to any of them directly, they are accessing data from these systems via an integration with the Big Data stack via ingestion, streaming, ETL, or Reverse ETL tools, from the data warehouse.

Therefore, we have evolved to enter the situation where both the functioning of the business and the comprehension of customers are reliant on the data pipelines, through the Big Data stack.

The quality of these pipelines is paramount to making sure that everything stays on track.

This is where Data Lineage comes in. Only with end-to-end integrations and automated tracking, can one have the complete picture and awareness with regard to all the data storage/movements. Previously, automated Data Lineage has focused on the analytical side of the Big Data stack, just as it’s been more common for an automated Single Customer View to be focused on the same – use cases in reporting and analytical applications. 

But as operational use cases have helped to deliver on the promise of Big Data in many businesses, bringing it to life; so operational use cases for automated Lineage and Single Customer View have the potential to add the most value for these tools, while reducing risk for businesses.

Figure 1 – Emerging Architectures for Modern Data Infrastructure

Operational Data Lakes have enabled the consolidation of a myriad of data from inside and outside the business, to offer more context than ever before in near real-time, and surface that business context within AI models and operational systems. With the help of automated Data Mastering, it is possible that Single Customer View intelligence may need to extend across all of the pipelines of ingestion if no common unique identifier exists, which is often the case when dealing with third-party data. And similarly, Data Lineage could extend across these API feeds too, beyond the ETL, warehouse, and BI tools, to give a round sense of the data flowing into the architecture.

The consolidation of data in operational systems has begun to create new data-orientated products in many businesses, in addition to facilitating more efficient traditional business operations. CIOs / CDOs and COOs should be having sleepless nights if they don’t know how their pipelines work, where GDPR data occurs across the pipeline, how the pipeline changes are made, how robust they are and, if individual jobs in the pipelines fail, how is the business alerted. It’s true to say that even backups are at risk if you can’t rely on the pipeline feeding into that operational system the day the backup was taken unless you have visibility and knowledge of its setup and run correctly from Data Lineage.

Erroneous changes to the API orchestration platform or to the Warehouse / Lake, or in the ETL / Reverse ETL tools could create issues in the data pipelines which could cause a Single Customer View to be inaccurate, could stop an AI model or a dashboard, and even worse, could cause an operational system to malfunction altogether. Automated impact analysis in Data Lineage tools is designed to help prevent these issues when Data engineers make changes in the pipeline.

Just like with software, if a business runs on and relies on data, then it will need to make sure any changes to the Big Data stack are verified and regulatory compliant, before they are implemented! 

Conclusion

Big Data is now employed across every part of a business and has truly become the raw material of our time. If your Big Data stack fails to feed your AI, SCV or operational systems, it could become one of the most important continuity challenges you might face. The same probably can not be said for your CEO’s dashboard, despite how much he/she may like it! Therefore, my advice would be to treat and manage your Big Data stack and its pipelines with, as much respect, care, and attention, as you can.