How to use a Metadata Framework to build a Data Quality Tool

Data Quality in the Matrix

Sep 16, 2025

Foreword

Some of the biggest pains we see facing data engineers and analytics engineers is that they spend too much time maintaining boilerplate infrastructure, and still have no visibility into pipeline failures.

It means you’re constantly fighting fires and don’t have time to focus on building. What’s worse is that the business doesn’t trust the data.

At Orchestra we’re building a Unified Control Plane for Data Ops. A Data Status page if you like, with some incredible features to give data teams their time back so they can focus on building. You can try it now, free, here.

Introduction

Earlier this week, we announced the MetaEngine, a powerful way to write declarative pipelines that’s a million times faster than building a standard stack from scratch using Airflow.

We’re rewriting the gameplan for running data quality tests.

Sometimes folks end up buying an additional tool for running DQ tests.

These tools are slow to use, UI-driven, and prone to error
They are costly
They create architectural complexity. Orchestra has integrations to DQ Testing tools, for example.

You can use AI and the MetaEngine to turn Orchestra into the most powerful data quality testing tool you’ve ever seen. In this article we’ll show you how.

You can also do this with your own orchestrator. There is no need to use Orchestra. It is, of course, much more time intensive, but the general principle of create a UI to design dynamic DAGs to run DQ tests / dbt-style tests and then clean, store and surface the metadata holds true across any architectural design pattern.

Context-Awareness with the MetaEngine

Part of the reason the MetaEngine is called so is because the Orchestra engine automatically aggregates and cleans metadata from the entire stack.

This means our AI understands everything about your data. It means you can ask it to insert table values into your data pipeline.

Automatically adding tests in to the Orchestra Matrix

But wait — what’s a Matrix?

Matrix Applications in the MetaEngine

The easiest way to think of a Matrix is that it’s an extension of a configuration file. You put a list into a task definition and it creates very many tasks.

What if we had a RUN_DATA_QUALITY_TEST task? Would the MetaEngine allow us to run thousands if not hundreds of thousands of DQ tests from just a config file?

YES!

Yes it would. Here is how.

The Data Quality Test in the MetaEngine

In the exampe above, the .yml resembles:

version: v1
name: primary_key_test
pipeline:
  fcf80b81-6bd2-49b5-995e-76604b1e77ef:
    tasks:
      2ebc20bb-14b1-49ae-9bee-a38eee49ae29:
        integration: SNOWFLAKE
        integration_job: SNOWFLAKE_RUN_TEST
        parameters:
          statement: "select * from ${{MATRIX.TABLES}} \nwhere pk_count is null"
          error_threshold_expression: '> 0'
          warn_threshold_expression: '> 0'
        depends_on: []
        name: DQ testing
        tags: []
        connection: default_00000
    depends_on: []
    name: ''
    matrix:
      inputs:
        TABLES:
        - DATABASE.SCHEMA.TABLE
        - DATABASE.SCHEMA.TABLE2
        - MY_SECOND_DBT_MODEL
        - SLOW_MODEL_9
        .
        . # ADD NEW DATA HERE
schedule: []
sensors: {}
webhook:
  enabled: false

Which means all you need to do to add test coverage is update the matrix.

New tests can be defined and added:

version: v1
name: null_test
pipeline:
  fcf80b81-6bd2-49b5-995e-76604b1e77ef:
    tasks:
      null_test:
        integration: SNOWFLAKE
        integration_job: SNOWFLAKE_RUN_TEST
        parameters:
          statement: "select * from ${{MATRIX.TABLES["id"]}} \nwhere 
                          ${{MATRIX.TABLES["column"]}} is null"
          error_threshold_expression: '> 0'
          warn_threshold_expression: '> 0'
        depends_on: []
        name: DQ testing
        tags: []
        connection: default_00000
    depends_on: []
    name: ''
    matrix:
      inputs:
        tables:
        - id: shipments
          column: date
        - id: skus
          column: end_date

        . # ADD NEW DATA HERE
schedule: []
sensors: {}
webhook:
  enabled: false

You could even add these all to the same pipeline.

The results

The MetaEngine is incredibly powerful. In the example below, I ran a subset of the DQ tests above to show you what this looks like.

Here we can see there is a not_null test we’ve defined and parameterised using the Matrix Syntactis Sugar.

There is a Shipments Table with an ID column we want to check is not null.

We’ve seen on the right-hand panel, the expanded Data Quality Test is indeed selecting from SHPIMENTS and where ID is null. The test passed! No nulls.

The Orchestra Data Quality Dashboard

All the results are automatically propagated by the MetaEngine and displayed in a user-friendly dashboards.

Press enter or click to view image in full sizeResults from the MetaEngine

This gives you everything you need to debug and monitor your entire Data Quality Estate, in code, in a serverless way, using AI. Pretty cool

Conclusion

We’ve long thought that the orchestrator is the right place to run data quality tests.

Now we’re making that user experience a reality.

By leveraging the MetaEngine to generate SQL Quality Tests data and analytics teams can get complete test coverage using the Orchestra framework at a fraction of the cost and in a fraction of the time as if they use a separate tool.

An interesting next step would be to explore what this would look like in legacy orchestration frameworks. Practically, the steps you might take

Define a Data Quality Operator
Coalesce the Data Quality Operator to your Dynamic DAG framework
Set-up CICD and Kubernetes
Ensure the metadata is stored and cleaned somewhere
Build a dashboard to observe the results

If you’ve built this yourself we are sure there are some great learnings. We’d love to chat!

🚨 If you’re ready to build your v2.0 or just getting started with your v1.0, we’d love to chat.

🚀 Try Orchestra here

Find out more about Orchestra

Orchestra is a unified control plane for Data and AI Operations.

We help Data Teams spend less time maintaining infrastructure, make them proactive instead of reactive, and ultimately win trust in data and AI from the Business

We do this by consolidating Orchestration with monitoring, data quality testing, and data discovery. You don’t need an observability, lineage, catalog etc. with Orchestra.

Check out:

The Orchestra Data Leadership Newsletter

Discussion about this post