Data Observability is a terrible phrase and the tools are even worse
Some people even struggle to pronounce "Observability" - and I don't blame them
Introduction
“Observability”
It’s a bit of a mouthful, isn’t it?
“Visibility” could have been used, it’s even one less syllable!
No but seriously — a hammer looking for a nail and an architectural disaster. Not only that but the source of an immense amount of confusion in the data space.
I do not believe this is a category, and certainly not great architectural practice. So let’s go into why and put this to bed.
A Data Observability Product
A Data Observability Product has the following features, at a minimum:
Provides a graphical view of asset-based lineage (tables, dashboards) and so on that is used for debugging, system design and refactoring
Provides Data Governance specialists with an easy to use UI for running data quality tests, which are essentially SQL Queries
It may or may not, also:
Provide a “Paid Add-on” for CI/CD; a git action that allows you to assess upstream impacts of pull requests
Offer circuit-breakers; integrations with your orchestrator that tell it if it should proceed or not
Offer column-level lineage in addition to asset-based lineage in a graphical UI
The Problems Observability Products Claim to Solve
Data Quality
There are finger-in-the-air, “written-by-consultants” stats that get bandied around about data quality, like the fact it costs organisations more than $3 trillion a year. But it’s real that data quality is the bane of our (data engineers’) lives, more often than not (77% of people seem to agree).
What are the causes of poor data quality? To my mind:
People creating it don’t care (there is an error in salesforce, this google sheet was updated incorrectly) - this really sucks for data teams as it’s a hard thing to improve without buy-in from the company
Systems that move data are not robust (this pipeline failed, now something is missing a days’ worth of data)
The data itself is hard to process (causing pipeline failures)
There are modelling errors (which cause incorrect values to be propagated into end-user systems)
I have spoken to plenty of organisations that have a “shift left” mindset — data teams get pristine event data landing in S3, they load it to their warehouse or use spark, they have alerting set-up, and data quality is not an issue.
Which leads to this “Data Observability Paradox” — although they claim to solve data quality they cannot. They can tell you where the problem is, but they do not solve the root cause.
Indeed, the root cause is solved by having robust infrastructure (ingestion, orchestration, alerting etc) but most commonly, individuals not caring enough about the data. A cultural problem — not one that can be solved with Software.
An example - data quality testing and reconciliation
We recently spoke to one of our customers who was streaming data using CDC from SQL Server into Snowflake. They wanted to ensure that the SQL Server was in-sync with Snowflake, and therefore wanted to run a data quality test across the sources every [15] minutes to ensure the drift was acceptable. If the test failed, downstream models should not run.
This meant they could not use a standalone “data quality tool” as it would require extensive integration with their orchestration system. So we built it for them and it works really well.
Preventing Data Quality Issues
The CI/CD aspect of data quality tools is of utmost necessity here. The best way to catch issues is in the development lifecycle. So we definitely think CI/CD for data is important and necessary here.
The other way to think about data quality testing is in-pipeline testing in production.
See this article on source-testing we wrote here
Ideally, after loading data somewhere, you should test that it conforms to the schema you think it should. You should test the quality, the number of rows, completeness, and so on.
Observability tools do not let you do this — they are after the event, ex-post, post-hoc. It is like putting a seatbelt on after the car crash has happened.
For these kinds of tests, they need to be engrained into the data pipeline itself, like Coalesce or dbt-core allow you to.
Finding the Data Quality Issues / Preventing Future Quality Issues
You may, as an organisation, know you have data quality issues but not know where these are. Technically, the governance manager may know what tests to write, but find it very difficult to do so — writing 100,000 lines of .yml is, unsurprisingly, rather tedious.
I will grant that observability tools make it extremely easy to write data quality tests. Notwithstanding the previous section’s objection (that these tests are after-the-event, so help you identify quality issues once they are too late), if you are a governance professional “doing their job”, then you would hopefully pick data quality issues up quite quickly. Indeed, as long as you pick it up quicker than someone else, nobody really cares if bad data is in production - at least you noticed it first.
Why bother taking that risk, though? Surely it is better to monitor things in-real-time?
A second nice feature is that data quality tests also cover anomaly detection and trend monitoring.
As such, some of these tests are indicative or predictors of future data quality issues rather than present ones. This is also, undoubtedly useful. Although these lie much lower down The Engineer’s hierarchy of data needs — what do you care more about, the broken pipeline that your CEO is screaming at you for right now or the pipeline that could break in two weeks’ time?
Anything else?
There are some other problems that get thrown around like painful processes of incident management, alert fatigue, and so on.
With Data, The best cure is prevention. The other approach we need to start seeing more of is prioritisation. In Orchestra’s data catalog, you can see the assets that form part of your Data Products. You can also see all the other assets in your environment.
Why are these here? Why do they cost so much to maintain? Do we really need to be spending time looking after them?
50% of errors come from stuff we probably don’t need to maintain (therein lies the process of incident management). NOTHING EVER GETS USED. Data Teams, like it or not, are BAD at prioritising things. Or perhaps, stakeholders are bad at saying what they want — either way, the root cause of having too many failures is that we are maintaining an enrmous amount of bloat.
Again — can an observability tool help you and your organisation work out what to build and what to kill? You know the answer.
Conclusion
Data Observability is a tiny category that is a product of the ZIRP area, that unfortunately has a disproportionate amount of marketing dollars and is unduly influencing our industry. It has collected up an arbitrary collection of problems data teams face, despite not offering any best-practice solutions.
Some of these are technical. Many of them are pretty niche. Most of them, like improving data culture / getting a shift-left mentality, are not software problems at all.
One increasingly evident thing data teams need to do better is prioritise, which in some ways is related to Governance (perhaps, the art of choosing what to do or what to care about).
We would do well as Data Engineers and Analytics Leaders to think about the root causes of why we face the challenges we do, rather than simply jump the the next shiny thing 🍒
Learn more about Orchestra
If you have a system that is monitoring all your other systems, then you’ll always be proactive and bad data will never reach production (unless you let it happen). This means happier, more trusting stakeholders, better prioritisation, less firefighting and more building.
You can read more about Orchestra here.