Foreword
Some of the biggest pains we see facing data engineers and analytics engineers is that they spend too much time maintaining boilerplate infrastructure, and still have no visibility into pipeline failures.
It means you’re constantly fighting fires and don’t have time to focus on building. What’s worse is that the business doesn’t trust the data.
At Orchestra we’re building a Unified Control Plane for Data Ops. A Data Status page if you like, with some incredible features to give data teams their time back so they can focus on building. You can try it now, free, here.
Introduction
Earlier this week, we announced the MetaEngine, a powerful way to write declarative pipelines that’s a million times faster than building a standard stack from scratch using Airflow.
We’re rewriting the gameplan for running data quality tests.
Sometimes folks end up buying an additional tool for running DQ tests.
These tools are slow to use, UI-driven, and prone to error
They are costly
They create architectural complexity. Orchestra has integrations to DQ Testing tools, for example.
You can use AI and the MetaEngine to turn Orchestra into the most powerful data quality testing tool you’ve ever seen. In this article we’ll show you how.
You can also do this with your own orchestrator. There is no need to use Orchestra. It is, of course, much more time intensive, but the general principle of create a UI to design dynamic DAGs to run DQ tests / dbt-style tests and then clean, store and surface the metadata holds true across any architectural design pattern.
Context-Awareness with the MetaEngine
Part of the reason the MetaEngine is called so is because the Orchestra engine automatically aggregates and cleans metadata from the entire stack.
This means our AI understands everything about your data. It means you can ask it to insert table values into your data pipeline.
But wait — what’s a Matrix?
Matrix Applications in the MetaEngine
The easiest way to think of a Matrix is that it’s an extension of a configuration file. You put a list into a task definition and it creates very many tasks.
What if we had a RUN_DATA_QUALITY_TEST
task? Would the MetaEngine allow us to run thousands if not hundreds of thousands of DQ tests from just a config file?
YES!
Yes it would. Here is how.
The Data Quality Test in the MetaEngine
In the exampe above, the .yml resembles:
version: v1
name: primary_key_test
pipeline:
fcf80b81-6bd2-49b5-995e-76604b1e77ef:
tasks:
2ebc20bb-14b1-49ae-9bee-a38eee49ae29:
integration: SNOWFLAKE
integration_job: SNOWFLAKE_RUN_TEST
parameters:
statement: "select * from ${{MATRIX.TABLES}} \nwhere pk_count is null"
error_threshold_expression: '> 0'
warn_threshold_expression: '> 0'
depends_on: []
name: DQ testing
tags: []
connection: default_00000
depends_on: []
name: ''
matrix:
inputs:
TABLES:
- DATABASE.SCHEMA.TABLE
- DATABASE.SCHEMA.TABLE2
- MY_SECOND_DBT_MODEL
- SLOW_MODEL_9
.
. # ADD NEW DATA HERE
schedule: []
sensors: {}
webhook:
enabled: false
Which means all you need to do to add test coverage is update the matrix.
New tests can be defined and added:
version: v1
name: null_test
pipeline:
fcf80b81-6bd2-49b5-995e-76604b1e77ef:
tasks:
null_test:
integration: SNOWFLAKE
integration_job: SNOWFLAKE_RUN_TEST
parameters:
statement: "select * from ${{MATRIX.TABLES["id"]}} \nwhere
${{MATRIX.TABLES["column"]}} is null"
error_threshold_expression: '> 0'
warn_threshold_expression: '> 0'
depends_on: []
name: DQ testing
tags: []
connection: default_00000
depends_on: []
name: ''
matrix:
inputs:
tables:
- id: shipments
column: date
- id: skus
column: end_date
. # ADD NEW DATA HERE
schedule: []
sensors: {}
webhook:
enabled: false
You could even add these all to the same pipeline.
The results
The MetaEngine is incredibly powerful. In the example below, I ran a subset of the DQ tests above to show you what this looks like.
Here we can see there is a not_null
test we’ve defined and parameterised using the Matrix Syntactis Sugar.
There is a Shipments Table with an ID column we want to check is not null.
We’ve seen on the right-hand panel, the expanded Data Quality Test is indeed selecting from SHPIMENTS
and where ID
is null. The test passed! No nulls.
The Orchestra Data Quality Dashboard
All the results are automatically propagated by the MetaEngine and displayed in a user-friendly dashboards.
This gives you everything you need to debug and monitor your entire Data Quality Estate, in code, in a serverless way, using AI. Pretty cool
Conclusion
We’ve long thought that the orchestrator is the right place to run data quality tests.
Now we’re making that user experience a reality.
By leveraging the MetaEngine to generate SQL Quality Tests data and analytics teams can get complete test coverage using the Orchestra framework at a fraction of the cost and in a fraction of the time as if they use a separate tool.
An interesting next step would be to explore what this would look like in legacy orchestration frameworks. Practically, the steps you might take
Define a Data Quality Operator
Coalesce the Data Quality Operator to your Dynamic DAG framework
Set-up CICD and Kubernetes
Ensure the metadata is stored and cleaned somewhere
Build a dashboard to observe the results
If you’ve built this yourself we are sure there are some great learnings. We’d love to chat!
🚨 If you’re ready to build your v2.0 or just getting started with your v1.0, we’d love to chat.
Find out more about Orchestra
Orchestra is a unified control plane for Data and AI Operations.
We help Data Teams spend less time maintaining infrastructure, make them proactive instead of reactive, and ultimately win trust in data and AI from the Business
We do this by consolidating Orchestration with monitoring, data quality testing, and data discovery. You don’t need an observability, lineage, catalog etc. with Orchestra.
Check out: