Unpopular Opinion: Column-Level Lineage is not a feature you need

An unpopular opinion is grounded in some hard truths

Hugo Lu

Nov 16, 2024

Introduction

I recently posted an “Unpopular Opinion” post on Linkedin that proved to be relatively unpopular indeed.

In as much as there was a split jury, it did highlight a few things

People mean different things when they say “column-level lineage”
Column-level lineage solves different problems
Column-level lineage may or may not be a feature of something you choose to build or buy

Let’s dive into these.

Two types of column-level lineage

What is column-level lineage?

Column level lineage for the data system / data modelling

Put simply, it is a graphical representation of what columns or fields in a table, dashboard or anything else really that other columns or fields depend on.

This is useful because it helps:

Debugging: if you know what columns something failing depends on, you can trace an error back to its source
Refactoring: if you inherit a bunch of spaghetti code from someone, it can be helpful to have a diagrammatical representation of what’s going as it’s quicker to understand it this way vs. looking at code
System Design: visualising a system is a helpful way to work out how to improve it

There are other problems this render of it solves, but these are the main ones.

Column-level lineage in CI/CD

Column-leve lineage can also refer to column-level lineage checks in CI/CD.

Source: https://medium.com/inthepipeline/free-data-impact-reports-for-your-dbt-data-project-2d87cdf6aa0c

In the link above, we can see the impact of a Pull Request to a git repository. Granted, it’s not actually column level (it’s table level) but you get the gist.

This is really helpful for understanding the blast radius of your pull request. Think twice before renaming all the columns in the customers table — you could break the dashboard your CEO cares about.

Seems helpful, doesn’t it?

The second-form of column-lineage checks I would argue can be pretty indispensable when you’re democratising access to data modelling.

Without this, analysts without understanding of everything downstream could break things they shouldn’t, with disastrous consequences.

But this is not a feature, or indeed, a paid feature, often. These are git actions. Git actions are normally free. Sure, there are companies like Octopus Deploy that basically sell you managed git actions in terraform for complicated things, but we seem to be at this interesting place in data where noone wants to write these actions themselves.

Instead, there are a few software companies that try open-sourcing these and then decide not to. Can someone please just write these really well like the Gitlab people did so everyone can just crack on with this as a free feature for all?

The other thing worth pointing out is that when most people say “column level lineage” they aren’t talking about git actions; they’re talking about wanting to see the state of the system they’re in (i.e. the first one).

This is complicated to do. There are numerous companies whose business it is is to build column level lineage charts. These are typically observability tools, but they may be nicher.

The thing to understand is that if you can get a bit of metadata from a saas tool like Tableau but you have the bulk of the code you write, column lineage is easy to build.

This is why Databricks and Snowflake have basic lineage and I’ll wager will ahve a column-level view soon. BigQuery has Dataplex. It’s also available in dbt Power User, if you’re so inclined, within VS Code i.e. your code editor:

Ok so if it’s all free and helpful what’s the fuss about?

That’s a good question. Let’s think about the three main problems we’re saying the first form of column lineage addresses:

Debugging: if you know what columns something failing depends on, you can trace an error back to its source
Refactoring: if you inherit a bunch of spaghetti code from someone, it can be helpful to have a diagrammatical representation of what’s going as it’s quicker to understand it this way vs. looking at code
System Design: visualising a system is a helpful way to work out how to improve it

System Design with Column Level Lineage

Looks like actually having something end-to-end would be helpful here. But how helpful? What are we solving?

Are we solving for the warehouse? Has someone written a bunch of really odd code? Perhaps we don’t know what entities we’re modelling?

If that’s the problem we’re solving, some table lineage should do. If there are three customers tables, you don’t need to know what fields depend on what, you need to know where the downstream tables need to point at.

Ok maybe the issue is that we’re interested in the dashboards and the downstream. Does knowing what fields depend on what fields really impact how we would design the system? Surely it would be more helpful to know the tables etc. so we can ensure the right use-cases are hooked up to the right data… so I’m not really sure what incremental value column-lineage provides above table or asset lineage here.

Refactoring with column based lineage

So here’s a juicy problem — there’s a really important report with some interesting metrics like Rolling 4 week on 4 week revenue by customer delta you just can’t get your head around.

Turns out the orders table is being imported straight into Tableau and there are some funky ass calculations going on in there! Screw that, let’s refactor it into a view or materialisation and make Tableau simpler. The report keeps breaking, noone knows how the numbers arise, and the report doesnt get used as people don’t trust it.

Your column level lineage tool lets you know that

R4wk on 4wk -> R4wk + R4wk lag-1 -> total revs -> orders

But you don’t know the calcultion. You hope it’s simply

sum(orders) over (partition by week order by week desc rows between 3 preceding and 0 following)

but you don’t know this. You are gnna have to get into that juicy code.

At which point we’re in the code. The exact thing we wanted to avoid!

Debugging with column-based lineage

Ok one final problem — the dashboard is broken. In particular, the rolling 4 week on 4 week sales delta looks off — apprently sales for the last 4 weeks have plummeted an astonishing 50%. What do we do?

Our column-level lineage will tell us that the underlying metric powering this obfuscated metric is orders in an orders table, nested deep in our warehouse.

Turns out an upstream process responsible for pulling order data in from the ERP into the warehouse has broken. Everything else has run on a schedule, as we planned, but there’s no new data in Tableau because of that upstream failure. Column-level lineage has saved the day — we can quickly identify the cause for the failure.

We fix the upstream process, and sort everything out.

But why are we in this position in the first place?

Surely we should have been notified of the upstream failure? Why did the stakeholder find out before us? Why did the models materialise even though the upstream task failed — did we not check for incompleteness in the data first? Why did we refresh the dashboard?

The answer is often because orchestration and pipeline visibility is lacking. Without a system that can govern processes end to end as data is moving , we cannot possibly hope to catch these errors in real-time.

Furthermore, such a system would ideally have alerted affected stakeholders (and us), to let them know errors had been identified and we (the friendly, neighbourhood data team) were trying to fix it.

So let’s suppose we have a sophisticated orchestration platform that can handle alerts like this, that also performs very basic data quality tests. Do we really need column-level lineage to identify the failure in real time? The answer is no, of course we don’t.

Conclusion

Column-level lineage has rapidly gained popularity over the last couple of years, giving rise to lots of new companies, new features, and new interpretations.

As Data Engineers, it is generally indispensable for assessing the blast radius of pull requests and changes to codebases. When refactoring and evaluating system design, it is a helpful tool for understanding what depends on what.

However, the main problems and the way these are solved typically rely on other architectural or design elements like ensuring robust orchestration and data pipeline visibility. They may also rely on table level-lineage or column-type lineage in CI/CD / free tools.

One point deliberately left out of this discussion are scenarios where companies require granular tracing and audit capabilities as a matter of course, where industries are heavily regulated like Financial or Healthcare Services.

There is one particular regulation in Banking, BCBS239 that is often used as the justification for heavy-duty lineage tools like IBM’s. One important Principle is:

Principle 2 Data architecture and IT infrastructure — A bank should design, build and maintain data architecture and IT infrastructure which fully supports its risk data aggregation capabilities and risk reporting practices not only in normal times but also during times of stress or crisis, while still meeting the other Principles.

There is nothing actually in the regs that specifically mentions column-level lineage, but it’s accepted as necessary for ensuring reliable, accurate and timely risk reporting. These principles can be applied to any area where there is some highly critical use-case for data and reporting. If banks fail to handle risk, there can be disastrous economic consequences.

If you fail the handle [X] data, what are the disastrous consequences you face?

A hard truth is that for many BI use-cases, there aren’t any. As I write here and in the “gold-rush paradox”, there is a chicken and egg problem. Perhaps there aren’t any critical use-cases because your data isn’t reliable enough.

We need to hold ourselves to higher standards to encourage people to make critical decisions using data — perhaps that does mean leveraging column-level lineage in a stakeholder-facing way. It certainly involves being the “bigger person” and saying “I don’t care if you don’t see the value in data today. We’re going to make it super reliable, super available and easy to understand, then if you still don’t want to make use of it, we’ll try something else”.

That may involve standalone column-level lineage tools, but the lower-hanging fruit will always be orchestration, pipeline visibility, and adherence to other best practices like strong CI/CD checks and rigorous data modelling.

The Orchestra Data Leadership Newsletter

Discussion about this post