Sunday Scaries: The future of dbt Labs
Analysing everything from Coalesce 2024, the direction of the data industry and dbt’s place within it
Introduction
There was a phrase that stuck with me from the keynote for dbt Labs’ conference in London two years ago;
“I remember my first dbt run”
I really do. There was something so intensely gratifying about seeing those green log messages flash up on screen. Intensely validating, that I had finally moved on from having some basic knowledge to what felt like software engineering.
Boy did I not know how much there was to go.
The interesting and somewhat predictable eventuality of that keynote was the shift in focus to all the new developments in dbt Cloud, dbt Labs’ paid offering and the performance of the company.
It is a classic pattern in technology conferences. What might start out as a small, intimate ‘indie’ conference becomes a victim of its own success — thousands of qualified leads gathered in the same place attracts a huge amount of commercial interest (rightly so).
Fast forward two years and there is no London conference, and the keenest, brightest-eyed of the dbt Community have descended on Vegas to learn the latest about what’s in store next, for dbt Labs.
dbt Cloud™ alternative | Orchestra + dbt Core™
Running dbt Core™ in your orchestrator has never been easier. Set-up rapidly, stop paying for an IDE you don't need…www.getorchestra.io
One dbt
One dbt is a cultural ethos that analytics practitioners can take to their day to day.
It centres around
flexibility: use the platform and tools of your choice together with dbt
Trust: it is a goal of analytics to foster trust and collaboration with end stakeholders
It is a fair point that the analytics and data communities are generally not well understood by business stakeholders, which we should try to change (dbt or not).
Product expansion
As we see across the data space, every company is expanding into new areas and dbt Labs is not an exception.
The bold vision is for dbt Cloud to be not just a transformation tool, but an orchestrator, catalog, data quality monitoring tool. A control plane for data, if you will.
Let’s dive into these different sections.
Orchestrator
You can loosely define an orchestration tool as one that triggers the execution of jobs and automatically handles dependencies between them.
In this definition, running dbt would indeed be akin to orchestration as it pushes down queries synchronously to data warehouses and does the dependency management.
Now you could say I am somewhat of an orchestration expert myself. After all, it is my job to help data teams get more of their time back through Orchestra, which is a lightweight, managed orchestration platform.
As the word “catalog” in data is confusing, the word “orchestrator” is well-defined. There is a minimum viable set of features you need to truly be considered a data orchestrator by the data community
dbt still lacks many of these. I do not believe the vision of dbt was to be able to orchestrate anything. If it was then the team would basically have built Airflow but better, which people were already doing at the time.
Observability
The Data Observability category as I understand it is a little ambiguous. There are two main problems the tools here seem to solve
The automation of huge swathes of data quality tests. This saves time for data engineers taking on a governance role or data stewards from writing thousands of lines of yml
Ensuring changes to code do not impact production data assets in unexpected ways (CI/CD)
CI/CD needs to be handled by the user in git. Sure, there are parallels in software engineering where there are successful companies that sell managed git actions, but by and large people do this themselves.
I will never understand why CICD is such a hard concept for the analytics engineers to understand. Perhaps this is why dbt Cloud offers it out of the box. But we should remember CICD it is one good open source PR away from being free.
The second aspect is not something we currently see in dbt Cloud: running data quality tests.
Yes — dbt tests are the same thing as data quality tests that observability tools run.
But the key benefit is easily generating them. Writing yml files is terse. Anyone who has had to write five tests per column on a ten column table knows how terse.
dbt would need an easy to use interface specifically designed for data quality to generate the lines of yml required to make this an easy experience — again not something they have right now so interested to see if this is truly a priority.
Catalog
The definition of catalog is becoming increasingly poorly defined in data, but there is a good article by Jeremiah Hansen from Snowflake below which I’ve included.
Catalogs from Sears to Iceberg
What is a data “catalog”?medium.com
Essentially it seems dbt’s catalog is like a directory of data assets that is generated from the metadata of the dbt project.
It is most similar to business data catalogs like Atlan, Collibra, Alation and so on. It is not a catalog for governance and managing control to data assets. It is not a catalog for interacting with iceberg data.
When evaluated like this, cloud again appears to fall short of the features of business data catalogs. Those in the latter camp bring in active metadata — metadata generated at source i.e. in your data warehouse.
They also offer integrations to all your tools, including BI tools and multiple warehouses/data lakes.
There are also features like Collibra workflows that are important elements of keeping Data Products and workspaces. These features aim to ease data discovery in sprawling enterprise environments with tens or hundreds of thousands of data assets.
It is hard to see how dbt metadata can serve a similar purpose in its current form. Of course, this question would be different if everyone in an organisation did the same thing and stored data in the same place in the same way — if only.
Iceberg
We wrote up a few more detailed thoughts on iceberg below.
dbt™ latest features and takeaways from Coalesce 2024: iceberg
Behind the fanfare here’s what you need to know about Icebergmedium.com
The key takeaway is additional vindication that those in the data community are (aptly?) warming to iceberg.
The promise of A) unifying storage to limit complexity and B) unifying data formats across different use-cases (analytics, operational, ML, AI) has enormous benefit that is getting serious executive backing.
Another important takeaway is that dbt’s support for iceberg is indirect. Your data warehouse provider maintains a query engine (the “compute”). This may or may not have the ability to connect to object storage and work with data stored in iceberg data format.
dbt is a developer framework for handling SQL queries and therefore does not interact with iceberg directly. It is a wrapper around your query engine.
This means that necessarily, there is a sort of dependency-lag between firstly iceberg functionality, query engine functionally, and finally dbt’s functionality — it may be desirable to leverage an orchestrator that can interact directly with a query engine.
Side note: this perhaps explains why Databricks were willing to pay so much for Tabular — by influencing and having information about the roadmap for iceberg, they ensure A) delta keeps pace best with iceberg and B) those developing the Databricks’ query engine’s iceberg capabilities have a competitive advantage of knowing the iceberg roadmap — so they will always have first-mover advantage (in theory).
A development vs a deployment environment
It is genuinely incredible that there are literally hundreds of thousands of people that used dbt every day.
However; there are different ways we use it. When building out a data product, we are likely writing SQL or ‘dbt code’. This is done in a developer environment, like VS Code.
Hopefully, we don’t spend too much time here and instead focus on analysing the outputs. These outputs are updated by dbt code *running* on some kind of schedule — dbt is running ok infrastructure, and this is your deployment environment.
dbt Cloud™ alternative | Orchestra + dbt Core™
Running dbt Core™ in your orchestrator has never been easier. Set-up rapidly, stop paying for an IDE you don't need…www.getorchestra.io
It is in this sense that there are hundreds of thousands of people running dbt code every day (but not necessarily writing it).
An important thing to note is that dbt Cloud is very much leaning in to providing cloud as a developer environment, in addition to a deployment environment.
The developer environment has some stiff competition — especially when you consider that code editors are pretty good. So good that 99.99% of software engineers use one. They’re also free.
dbt Cloud’s pricing is user-based. Usage based pricing never really took off, mainly because as data engineers we know it doesn’t take much to run dbt in terms of compute.
Indeed, most of us have worked with dbt locally, and our home computers aren’t that big. Intuitively it makes sense as dbt is just sending API requests to a warehouse which does the heavy lifting — sure you might run into issues when waiting on hundreds or thousands of those requests, but that’s not most of us.
The point being there is no offering for analysts who are looking for a place to *run* dbt but don’t need a development environment. Other aforementioned features like catalog, onservability and so on are nice to have.
As we discuss here — the correct place for dbt to run is in the orchestration plane. Which is why Orchestra supports dbt Core. The pricing is lightweight and standalone — $0.065 a minute for a starter plan.
Conclusion
It’s great to see the continued momentum in dbt adoption. What started out as the darling indie-favourite in data engineering is now truly becoming part of the mainstream.
It definitely feels like a pivotal moment for the company behind the success. With an addition to model-based pricing following in the wake of other notable companies like Hashicorp changing their licensing, the path of dbt is not without risk.
The valley between a self deployed open-source project and paid offering is widening. It could one day become a chasm. dbt or not, our job as data professionals is to ensure we do not end up within that chasm 🚠
Learn more about Orchestra
Orchestra is a cloud-based unified control plane for data operations. In English: data orchestration, observability, quality testing and catalog in one. We run dbt Core too and offer lightweight, scalable pricing.