
What if every commit corresponded to some data? Screenshot from GitKraken which I recently discovered and enjoyed looking at
When I first started Orchestra I got very into the idea that Data and Software teams were fundamentally different because Software teams ship apps and Data Teams ship Apps and Data.
Data Teams might build an App that moves data from A to B but the real thing we care about is that the data in B is timely, reliable and high-fidelity.
There were a few things I thought made this very different. A lot of these were wrong, but I think this one was right:
The only way to test a software application is to create a replica of prod and run it there
We don’t want to create replicas of our data. That’s bad. If I have 100 developers and they all make a pull request at once, that means 100 copies for them to test on locally and then another 100 ephemeral copies in staging and then 100 merge requests into production at once. This is obviously ridiculous.
So what needs to be replicated? Do you copy the entire Snowflake Account? No that’s too much.
Ok what about the entire Database? Hmm still not really necessary. You could just have one database with different schema/datasets within it and then different copies of the tables.
Why even have copies? Why not have zero copy clones?
And so on and so forth.
Why the warehouse pattern is going away
Iceberg means we are going to have more data in things like S3. The best warehouses already allow you to spin up zero copy clones of data, which means instead of copying data every time you need to make a change to your data logic, you can just create a clone.
This is significantly cheaper, and basically acts as a pointer to that (production) data and uses that to materialise changes and new data in an ephemeral location.
So how should this happen on the lakehouse?
As the data world moves from monolithic data warehouses to modular, flexible lakehouse architectures powered by Apache Iceberg and S3-backed storage, one question looms large:
How do we bring the developer agility of Git workflows to the data layer?
Several vendors and projects are tackling this by reimagining what “Git for data” means in a lakehouse context. Here’s how some of the key players are thinking about it:
🦕 Nessie — Git-Like Version Control for Data Tables
Nessie brings a Git-like experience to data lakes by sitting on top of Iceberg (and Delta Lake, Hudi) to allow branching, committing, merging, and tagging of data versions.
Want to test new transformations without touching prod? Just create a new branch.
Finished with QA? Merge the branch into main.
Need to roll back? Checkout a previous commit.
Nessie integrates directly with Apache Iceberg catalogs, enabling zero-copy clones via Iceberg metadata, not physical data duplication. This allows isolated dev/test environments for data engineering — just like Git branches for code.
🧊 LakeFS — Git for Object Storage Itself
LakeFS takes a lower-level approach: it introduces Git-style version control at the object store layer (think S3 or GCS), independently of the table format.
You can create branches of entire datasets.
Run Spark or dbt jobs in isolation.
Promote changes to production when validated.
LakeFS supports Iceberg, Delta, and Hudi, and integrates with pipelines like Airflow and Spark. It’s great for teams that want full data lineage and reproducibility, even outside the table abstraction. It also offers pre-commit hooks, CI-style validation, and automatic merging.
🏗️ Bauplan — GitOps and Metadata-Driven Data Dev
Bauplan is building towards a GitOps-native experience for the data lakehouse, with a focus on metadata and declarative infrastructure. It allows teams to define and version data logic, pipelines, and metadata in code, and then apply it to the lakehouse — similar to how Kubernetes configs are managed via GitOps.
It’s early-stage and incredibly cool. It speaks to the growing demand for modular, reproducible workflows in the lakehouse world that go beyond just data files and into orchestration, ownership, and governance.
Also, since a bauplan operation is a immutable deterministic snapshot of data, infra and code, it is the only platform offering one-line reproducibility (bauplan re-run jobId xxx
). Which I like for time-travel reasons. They have an interesting paper on this (link downloads pdf): https://arxiv.org/pdf/2404.13682
See the Orchestra Docs for Bauplan here.
🛠️ Y42 — Git-Based DataOps with UI Abstraction
Y42 offers a no-code/low-code interface for modern data stack tooling, but under the hood, it’s Git-powered. Every change in the UI generates a Git commit. This enables:
Full auditability of pipeline and transformation logic.
Easy rollback and versioning.
Collaboration with non-technical users via a UI, while technical users get Git control behind the scenes.
Y42 also integrates with dbt, Airbyte, and modern warehouse/lakehouse stacks, targeting cross-functional collaboration without giving up software engineering principles.
I include them here, not because anyone is really using this anymore but just because they were very big on virtual data builds which is essentially the same thing as above but on warehouses, but the abstraction for users was very nice.
Data Version Control
I know very little about this project, but I know it exists.
Why It Matters: The Lakehouse Needs More Than Just Storage
By extending this idea of “git for data” every single branch and version of that branch corresponds to a set of inputs that allow you to specify the state of data at any given time.
This is in stark contrast to how many people do git workflows with tools like dbt and snowflake, where a branch is opened with versions, but they in turn correspond to code rather than the data produced by the code.
This is enabled by the availability of the transactional log in tools like S3. Every commit effectively corresponds to a list of transactions operated on a data lake.
The benefits are huge in terms of versioning, rollback, and also in terms of cost (full-refreshes are not required when logic changes, you can just rollback by committing the inverse of the transaction logs).
Imagine if the first time you did `dbt run — select models tag:changed_models` you created a new branch of iceberg data in S3 in an ephemeral schema in Snowflake. Imagine if every time you changed your code you incrementally updated that data in Iceberg, and could easily rollback to that data.
Imagine if when it came to merge this logic into main, you were able to simply run tests on the data you just created, run downstream jobs, and then clone the data you just created back in to your production pointer. Imagine if this process had an incredibly fast feedback loop so you weren’t always waiting on dbt?
This is the pattern we’re seeing at Orchestra with our new partners, which is super cool. It means Fast-growing Data Teams who need to stand up data stacks aren’t just doing it with a context-aware, AI powered control plane like Orchestra. They’re also literally redifining the developer experience for data, reducing cost, reducing the barrier to entry…it is crazy how much innovation there is.
Here are the thoughts we had about this almost 2 years ago:
Continuous Integration and Continuous Development: learnings from Software to Data (Miniseries)
How CI/CD works for Data Science Pipelines (Miniseries, Part 2)
Principles of effective data delivery: How CI/CD should look for Data teams (Miniseries Part 3) |…
Performance metrics for high functioning data teams (Miniseries Part 4)
A new Paradigm for Data: Continuous Data Integration and Delivery (Miniseries Part 5)
We’re now in a stage where the feedback loop on the data manipulation itself it super rapid and super cheap as you can use iceberg instead of a warehouse.
Furthermore, you get end to end visibility and orchestration by using your orchestrator in CI/CD and as your metadata layer, which solves a lot of the problems we discussed in the 5 part series from before.
It’s crazy how much things have changed, as we said. If you’re interested in chatting CI/CD / AI / Data please reach out!!