Learn more about data processing and analytics, which are essential systems in modern enterprises but are frequently overlooked aspects of modern data architectures, by reading Mateus Pimenta’s latest blog.
The newest and best-of-breed data observability tools are quickly invading the market with the promise of helping companies gain insight into and control over their data processing capabilities. This blog post will go through the challenges companies face in the modern data stack and the features that the newest tools offer to tackle those challenges.
The modern data stack can be described as a new suite of data integration, processing and visualisation tools that emerged some years ago and made data analytical capabilities much more accessible for companies of every size. It replaced the slow and expensive traditional batch data processing infrastructure like Hadoop with a much more flexible, ease-of-use toolset at a much lower total ownership cost. The modern data stack combines vendors, cloud providers and open-source tools to support the most different data use cases, from batch to real-time streaming processing at any scale. We see here the rise of the Snowflake, Databricks, Fivetran, dbt and countless others.
For years, companies have been adopting the modern data stack and replacing their heavy infrastructure, often originally running on-premise, with a flexible, cloud-based, cost-effective stack. Today, integrating and processing vast amounts of data is no longer a problem – Operating the data ecosystem, which makes this all happen, however, is.
Let’s now take a look at this challenge.
Most data architectures start simple. Some small ETL (Extract, transform, load) jobs to copy data into your data lake, and you run a small set of transformations and aggregations to generate a report.
However, a few years later, you have a handful of different integration and orchestration tools, a large number of data pipelines, multiple event streams and multiple languages used for cleaning and transforming data.
Understanding and managing our data systems is hard; sometimes, you may feel like you have built a Rube Goldberg machine. Even with your best efforts to tame complexity, everything feels complicated and fragile. If you feel like this, you are not alone, in fact, it might not be your fault at all.
Most modern data stack tools are relatively new and often target specific use cases. Trying to find a single solution covering 99% of everything you need is virtually impossible, even aiming for 70% might be a challenge. Consequently, the more diverse your organisation’s use cases are, the more complex your data architecture will likely become and the harder it is to run your system.
This begs the question. What should I do then?
Complex or not, the most important characteristic of your system is that it works and continues to work reliably over time.
It might sound obvious, but it’s not simple to assert that a data system “works”, and it becomes even more challenging in larger systems. Correctly working systems rely on a few things being in place before this can happen, namely:
Infrastructure to run your data pipelines and source and destination data stores are up and running
Any misbehaviour in the points above can drastically affect the system’s output. This directly impairs your ability to make decisions, the trust in the system erodes, and that key strategic goal to become a truly data-driven organisation becomes more and more distant.
This is why ensuring your data systems are sound and operating as intended is indeed critical. And this is where data observability tools come to our rescue.
Data observability tools are focused on providing a simple and integrated way to verify that your data systems are healthy and working as expected. It does so by integrating with your stack’s myriad of tools and data stores to give you a single pane of glass over your data estate.
Most tools will come with a vast list of features, but I’ll simplify this into two broad groups: Making sure things “work” and finding out why things are not working.
It’s usually trivial to imply that things work if a pipeline runs successfully. But a successful pipeline run in and of itself is generally insufficient in most cases. Something is not quite right if that pipeline reports green but produces only two records instead of the expected ten million.
So an ideal data stack should go further into monitoring; it must be able to assess the quality of the resulting data. It should track the volumes, accuracy, timeliness, and other fundamental characteristics to give consumers a stable and trustworthy dataset to power their critical analytical reports, train a data science model or enable new use cases.
For that reason, most data observability tools give you mechanisms to check these and many other aspects of the data manipulated by your system. As the tools can understand the data structure, you can easily create checks for ranges, freshness, consistency, etc., on different data fields. The tools can also automatically detect schema changes or abnormalities in your data, such as sudden changes in volume or freshness. If a check fails, alerts can be raised for investigation. Some tools can also stop downstream processing, acting as a circuit breaker to stop propagating the issue further.
Those are powerful features and should dramatically improve the health and trust on your data systems. They are also key enablers of modern data approaches such as data mesh and data reliability engineering (SRE’s data cousin), where high data quality and service level objectives (SLOs) are essential to maintain a stable data ecosystem.
Regardless of how much effort you put into your design and implementation, it’s almost inevitable that things will sometimes go wrong. If you are familiar with microservices architectures, you will know that preparing for such scenarios is vitally important. It’s essential to have the means to identify and resolve unexpected problems, and this is no different in the data space.
Most data observability tools offer two main capabilities to assist you. First, a single pane of glass, that is, a dashboard view of all your systems, so that whenever operators receive an alert or a call from the user, they can quickly look at the whole system and drill down into pipeline execution logs, times and statuses. Second, data lineage to be able to trace data across the system. With this, operators can understand, at the field level, which pipelines and transformations generated the data in question, where the original data came from, and any downstream system that the incorrect data might have impacted.
Together, these capabilities allow operators to troubleshoot data incidents much faster, lowering time-to-recover. But not only that, they will help find your systems’ blind and weak spots so you know where to invest and improve the reliability of your data ecosystem.
Whether you are implementing your first data lake or you have implemented a data mesh, the suggestion here is not to overlook the data observability aspect. Your data quality directly impacts the quality of the decisions you make. And the ability to safely operate your systems will build trust and set the strong foundations to scale them.
Data observability is a trending topic but certainly not entirely a greenfield space. Some concepts and practices have been borrowed from microservices and adapted into the data space. There are many vendors with great mature products, such as Monte Carlo, Databand, Acceldata and Collibra, to name a few.
Today, it’s almost unthinkable to use microservices without the help of observability tools. Tomorrow, the same will be true for data.
This blog is written exclusively by the OpenCredo team. We do not accept external contributions.