dbt™ latest features and takeaways from Coalesce™ 2024: One dbt™
We are all ONE™. Open your mind.
Source: https://www.bigdatawire.com/2024/10/08/still-too-much-duct-tape-in-data-transformation-dbt-labs-handy-says/
Foreword
Finding your feet and your ethos as a company is hard, and sometimes when we say things that don’t age well, we write blog posts about them to justify those views.
This happens a lot, and I couldn’t help but notice some parallels with One dbt (that has been addressed since I wrote the post below in various other posts by various other people).
The reason for it is that, as with any profit-maximising company with an open-core product, at some point it becomes difficult to finance giving away stuff for free.
This leads to two version of the same thing. Hashicorp vs. the terraform fork, Cockroach DB changing their license, and now dbt Labs committing very few resources to dbt-core / adding model-based pricing. This inevitably leads to two choices; pay or maintain. Just as everyone either prefers vanilla or chocolate ice cream, are there necessarily two groups of people (if you choose to demarcate things that way).
You could challenge this - because irrespective of if you run dbt core yourself or use dbt cloud, everyone’s still writing dbt code. That’s why it’s the dbt community. Hell do we need a different community for dbt-airflow, dbt-orchestra, dbt-prefect, dbt-azure-vm-container services? No of course not.
This is a very long-winded way of saying that as data leaders, we need to focus on what the end goal should be and not get distracted by discourse - a good way to call BS on discourse is if it changes a lot over time. What you are reading now, of course, is also spiel to an extent - but judge for yourself if you think it’s any good - we have articles going back years, and they all say the same thing!
We’ve always been adamant that one major hurdle for data teams is having a way to easily orchestrate and connect different parts of the stack, so we are building a platform that can do that. We already wrote about why dbt-core obviously needs to run in an orchestrator.
The second major hurdle for data teams is improving data quality and simplifying the stack. We do not believe it is necessary or desirable to have an observability tool, a lineage tool, an airflow cluster maintained by a horde of platform engineers who gatekeep data Products from the business, a data quality testing tool, a data catalog (and so on) , so we are trying to address those problems those platforms collectively address in one go.
This is architecturally desirable and therefore offers genuine benefits to both data teams and organisations, over and above the “accepted” Modern Data Stack discourse.
At Orchestra we’re building a Unified Control Plane for Data Ops. We run dbt core, python, and give you visibility into the entire stack. With some incredible features to give data teams their time back so they can focus on building. You can try it now, free, here.
Introduction (written late Oct-24)
Just a very brief note on One dbt in case anybody else was interested.
I always appreciate ethoses and cultures especially in data and analytics where so much of what we talk about are problems caused by stuff that shouldn’t be solved by SAAS (shock!)
I don’t think there is anything in here anyone can disagree with. Common unifying framework for solving analytics problems? Sounds like standards and guardrails to me — love it.
Platform flexibility? Absolutely. I don’t care so much about iceberg RIGHT NOW but there is a reason Orchestra has hundreds of integrations and the reason is you should be able to use what you want and not worry about how to integrate it.
Data collaborators — Yes also good. Rise in Self-Service BI is a bit scary and has actually lead to the death of dashboards. The Rise of the Analytics Engineer has similarly, in some cases, lead to an ungoverned mess in the data warehouse. I believe we need governance across the end-to-end i.e. better and more stringent processes around making data available and usable.
However the idea of “Give the people who have the domain knowledge the tools to self serve within guardrails” has always been attractive to me. As someone who fell into data out of the necessity of avoiding aggregating 30 different excel sheets, this has personal appeal but we should also recognise not everyone in the marketing team is going to become a data whizz overnight.
Trust — trust is obviously important but fundamentally it relates back to standards. You trust your software engineers because they’ve been doing this for 15 years. You might not trust your analytics engineer because they look pretty green to you.
One Elephant™
The elephant in the room, of course, is that it’s One dbt™ and not anything else.
Having platform flexibility, building trust with end stakeholders, and empowering individuals to self-serve analytics in a governed way are general goals for probably 95% of us.
dbt™ or no dbt™.
There are many ways to get rigor into your analytics, and many ways to transform data — I’m not sure how intrinsically linked they are.
If Iceberg is as big as we all think it is, there is going to be a much bigger focus on shift left and streaming. If we can just get our software engineers and data producers to dump data into object storage that is 🔥we may not even need dbt.
When all that data starts coming through Kafka, don’t forget Kafka is a database and you can transform data there too!
What about Big Data? Remember when that was cool? Spark folks don’t always use dbt (although DataProc folks seem to enjoy mixing and matching).
What about other frameworks like Coalesce? Semantic Data Fabric? SQL Mesh?
What about warehouse-native dependency management tools like Snowflake Tasks or Dataform?
As CEO of a company doing Orchestration I try not to be too much of a shill (genuinely), so I’ve tactfully left out the other elephant which is orchestrators! What about you Airflow / Prefect / Orchestra / Dagster folks?
It’s clear to me that One dbt is clearly pitched to Data Analysts and those that leverage dbt Cloud to the fullest extent. This persona is an upskilling persona that is fairly early-on in their journey. Learning dbt is the gateway drug to doing bigger and better things.
For many of us either with more experience, or from the software engineering background, perhaps even as data engineers or many years, a bold ethos and analytics development lifecycles may raise eyebrows. But we must remember One dbt is not for us. We know of the importance of doing things properly already. We’ve been burnt by not doing it too many times.
One dbt is a cultural rallying call for folks who are about to join the party.
So if you’re one of those folks reading this — welcome. I too, remember my first dbt run. I remember the first time I ran a query that cost over $1,000. I remember my first presentation about analytics that fell on deaf ears. I remember building my first streaming service. I remember testing my first Lambda function. I remember my first uvicorn main:app --reload
. There really is a lot to look forward to, and I hope you develop a crippling addiction as I have.
Find out more about Orchestra
Orchestra is a unified control plane for Data and AI Operations.
We help Data Teams spend less time maintaining infrastructure, make them proactive instead of reactive, and ultimately win trust in data and AI from the Business
We do this by consolidating Orchestration with monitoring, data quality testing, and data discovery. You don’t need an observability, lineage, catalog etc. with Orchestra.
Check out