The hottest SQL tools you have no use for

Why being good at SQL is DEFINITELY enough

Nov 16, 2023

Foreword

Since writing this back in July of 2023 a lot of trends I predicted have come to light. These include companies like Datafold getting more public traction for their CI tool and Dagster’s recent blog and strategy now seemingly revolving around this idea of a dagster code repo being all a data team needs to effectively operate (which is obviously ludicrous since even software engineers buy some SaaS, however I’m fundamentally aligned to a lot of things the team their advocate for).

What has received the most attention, however, are the machinations within dbt lads ahem labs. dbt are finally focusing on environment management, Continuous Data Integration and Delivery, and integrating elements of observability in their platform to allow dbt to become a data release pipeline. Some recent examples:

Cloning or deferring by my good friend Kshitij (link)
dbt Explorer (launched October 2023 - link)
Environment separation (link)

Now, of course, dbt have to focus on cloud features, because you can pay for them, and business only make money from stuff that’s paid for (at least in the short-term). But is it a coincidence that they’re focusing on all these pain points I’ve been arguing are a currently missing from so much of our tooling? After all - there are land grabs going on everywhere in data. Fivetran can run dbt. Monte Carlo built a Github Action. Hightouch literally copied Rudderstack’s open-sourced code. dbt aren’t doing any of these things - they’re doubling down on features that help developers deploy data into production reliably and efficiently.

The reason I think this vindicates what’s in this article is that I wrote it in response to tools whose USPs revolve around work developers do locally, in a dev branch. The dbt command line tool being enjoyable to use would fall into this category. I felt their offerings were one-sided. I never had many issues in the dev environment. It was in the staging and production environments that I’d have data quality / efficiency / visibility issues, which is why that’s mainly where Orchestra’s USPs are. You can check these out following the button below:

Orchestra Beta

It suggests the ways to truly enhance developer productivity aren’t by giving us fancier dev tools and nicher environments. It suggests the ways exist in staging environments and post-code-writing. Tools that help you analyse patterns and deploy data repeatably and efficiently. And of course, companies like SQLMesh have gotten fancier websites and better cloud offerings themselves over the last 6 months - but that’s not what everyone was hyping up 6 months ago now was it.

Anyway, read on and let me know what you think.

When I first learned SQL, my mind was kinda blown. In my first job I worked in Investment Banking, doing tons and tons of analytics using nothing other than excel. When I realised you could do this using something like SQL, I was amazed, so amazed that for the last 4 years I’ve been working in Data Engineering. The deeper I get, the greater problems people seem to have with SQL and data-build-tool (dbt). I don’t really understand why, and this articles speaks a bit to that.

What is SQL and how is it similar to other programming languages?

Why are there different dialects of SQL? Taken from here:

First designed by Donald D. Chamberlin and Raymond F. Boyce in 1974, it’s longevity is highly unusual for a coding language (only Fortran and C can claim a longer lifespans). This longevity can be credited to the benefits of relational databases, as an efficient storage method with a simple language to extract information, and it’s competitive performance regardless of the scale of deployment.
The language has changed and evolved since it’s inception, as various database softwares added functionality in different ways. This has resulted in different “dialects” of SQL, as different companies develop slightly different syntaxes (T-SQL for SQL server, PL/Pg SQL for PostgreSQL, PL/SQL for Oracle).

Or diagrammatically:

A few examples of popular SQL languages; diagram my own. Note — as called out by Franco Pitano in the comments, Delta Lake also is just a wrapper around parquet files and json files in the data lake, which can also be S3 if you’re in AWS or Azure Data Lake Storage (“ADLS”) if in Azure.

We see there are a few variables:

How the data is stored i.e. the file format
Where we store the data
How we operate on it
How we parse it

So if we take databricks; files are stored as .parquet files. When we write SQL in databricks, databricks takes the ANSI standard SQL as input and parses this into lists of commands (in whatever language Delta Lake uses) that are executed as commands in Delta Lake that manipulate Parquet files.

In Snowflake, you’re kinda doing the same thing but using their own SQL language. This gets translated into operations that manipulate files in Amazon S3. The two processes are conceptually exactly the same but architecturally what’s going on behind the scenes couldn’t be more different.

The important point here, is that there are lots of SQL languages and using them is really easy. Anyone can manipulate data beyond the means of any excel sheet. Conceptually, a SQL language is very very different to a regular programming language.

All programming languages like Python and Javascript do is add a way for people to compile bytecode. Statically-typed programming languages do type checking (i.e., the process of verifying and enforcing the constraints of types on values) at compile-time, whereas dynamically-typed languages do type checks at runtime. There is also the distinction between strong and weakly-typed languages (the latter don’t enforce things like type-checking).

This is important because depending on your use-case, you may care about having these features. If you’re building some software that really can’t screw up you need guardrails to help you write things in a robust way. If you’re just hacking around and building some cool data science analytics, you don’t.
For SQL, if you’re a data engineer there is one use-case: transforming data. The only variables are A) how much data and B) how complicated is the transformation.

This is why I don’t understand why there is all this “hot new SQL tooling”. Or maybe I do, I just disagree about why I should use it.

Ok so what is the tooling

You’re probably familiar with dbt. dbt provides an easy way for data engineers to execute SQL but in an orchestrated way. Rather than manually execute three dependent SQL statements, dbt handles this for you.

There are some issues with dbt. It doesn’t separate data changes from data model definition for example. It’s hard to build incremental models (for some people — personally I find it very straightforward). The biggest issue, though, is related to the data workload in that you don’t just change code when you submit a PR for your dbt repo, you change:

Code
Data
Database

And for (2) and (3) we have powerful tools in the world of software engineering that handle these already like Prisma migrations or Alembic (this stuff is non-trivial). With dbt, you can’t really do either.

Enter:

SQlMesh

SQL Mesh is a way to get over some of the problems of dbt, like incremental loading being hard (again, not sure about this) and the aspects of testing (1–3) dbt misses out. There is actually a really good Reddit discussion on this.

Semantic Data Fabric

See here;

Today, SDF focuses on a deep understanding of your warehouse in order to build consistent representations of data, modernize and automate data quality and simplify governance.

Which is vague as hell, but to me says “write some SQL, define some constraints, and we’ll let you test all your stuff in a really nice rigorous way locally, and then when you’re ready, you can push these changes to your database (be it Snowflake, Bigquery, whatever) and then you’re golden.

There are loads of other tools out there that do transformation, (DataMeer, Coalesce, Google’s transform, any all-in-one platform, it’s endless), but this new breed of tools are focussing on local testing and holistic development / data quality. Which is cool. Supposedly…

Why I don’t think you need this

I am in a privileged position where I can say I am comfortable writing excel or writing code.

What strikes me as odd about these new SQL tools is just how little credit it gives analysts.

It assumes we’re all, let’s say, intellectually-challenged. Incapable of deploying code and data changes in a way that doesn’t blow up the efficacy of our entire organisation.

Want to change the data table under a dashboard? No — you need to make sure column level lineage works first.

Want to change the logic behind a model? No — you need to write a test for this model first, unit test the code, assess the data quality, and only if that succeeds locally may you even suggest PR-ing this into prod.

I wasn’t a software engineer before I became a data engineer. I worked in excel, we cranked through stuff day after day without any of these guidelines. And you know what? I was bloody good at it and at one point even considered entering an Excel competition (ridiculous I know). Excel doesn’t have these guard rails. Excel doesn’t get PR’d. The work is way more important too. If you screw up an excel in corporate world you really do get a bollocking in a way that simply doesn’t exist in the world of tech. Point is: you’re doing the same task, in a different place, the stakes are higher, the tools are worse, but the system works. In the data world, there is also a huge set of problems that represent hurdles for your data project. Fancy tooling solves but a subset of these, and only in the (hopefully edge-cases) where you’re doing stuff incorrectly:

Hot SQL tools solve the problems in the cirlces if you screw them up. You should be doing your job in such a way where you only screw up a bit. Image my own.

The whole point of having a job is that you can kinda become good at it. If we’re forever relying on tools that catch exceptions, isn’t that mollycoddling? Shouldn’t we just be striving to use things like AI Copilots and to learn things from first principles? I think screwing up what’s in those circles in the diagram should only be in a minority of cases if you’re doing your job properly. You shouldn’t become reliant on this testing. And if it’s in a minority of cases, then the value of the tool is less. If you tell me the value of your tool is extremely high, by modus tollens, you’re assuming I need this, it will be valuable to me, and that I frequently screw-up, and therefore that I’m not doing my job properly. I resent people telling me the answer to doing a poor job of doing something is the tooling they use, because I’m not a bad workman and I don’t blame my tools.

So what’s going on with SQL then? Why are there all these incredibly smart software engineers building tools that make it impossible for people to screw up?

I believe because Data is so new, and working with it en-masse is so new, the problems aren’t tooling-related; they’re structural.

It’s true — dbt made writing and deploying SQL so easy anyone could do it.

This created a problem — a data swamp. A mess of crap lying everywhere for data teams to wade through. Too much badly-written SQL.

This wasn’t created by dbt, though. It was created by the sudden onset of data, data tooling, and the lack of supply of quality data analysts who understand how to do this sort of stuff well.

If you’re good at dbt, and you’re good at building data models, you don’t need tooling that makes it impossible to screw up. You just push to prod (yeah I said it) and crack on.

At my previous job at Codat, we did this a fair bit, way more than we should have, but honestly for the first year the marginal benefit of a staging environment would have been quite low. I don’t think we were an awesome awesome data team (an article for another time) but I do think our data quality issues were pretty low, because technically, we were pretty good.

Back to dialects

If you’re a software engineer, your use-cases are extremely varied. You can be writing different types of application that genuinely might call for different dialects of code.

If you’re writing SQL, it’s simpler. You care about A) the amount of data and B) the complexity of your transformations. You realise having a new SQL dialect makes no difference. However having a new SQL writing tool (like an alternative to dbt) might make a difference; you’re writing extremely complex transformations on loads and loads of data. You can’t afford to make mistakes, and you’re pretty new to this data gig, your boss is an ass and you’re going to be under fire if you make mistakes. Actually, dbt has burned you a few times recently and now you do want to try out SQL Mesh.

However, you’re probably just writing some non-critical SQL to power a product usage dashboard for some Sales or Product folk (who happen to be quite friendly). You’re building some analytics for the finance team on recurring revenue, this needs to be pretty ship-shape but there’s no latency requirement on it, and they go through lots of internal processes before the data gets used for anything important anyway.

The point of these two analogies is, you are extremely unlikely to be the first person. And I don’t see why this is going to change over time. The value comes in the process and the culture around data. If people value it, people take their time and do things properly, like a diligent 21 y/o writing some excel, they try get the job done properly, the tool is almost irrelevant.

The new generation of SQL-help tools assume there is no process, no culture, and no talent in a data engineer. They assume tooling fixes everything, that your analytical nirvana culminates in the purchase of a software subscription. It fails to realise that processes around data and just the fact that analysts could actually be diligent and have a semblance of intelligence and rigor might also solve the very problem the tooling aims to solve. Although they take very little for granted and like in programming, they will certainly find their use-case. On balance, it just probably won’t be yours.

References:

The Data School, Not dated: https://www.thedataschool.co.uk/jack-arnaud/sql-dialects/
Reddit, 2023: https://www.reddit.com/r/dataengineering/comments/12b6fgb/a_dbt_killer_is_born_sqlmesh/
The Semantic Data Fabric blog, 2023:

SDF Blog

Introducing SDF - The Semantic Data Fabric

Data development at enterprise scale is really hard. Data organizations need to rapidly deliver insights while managing a complex landscape of governance and privacy regulations. Unfortunately, data engineering tools have not kept up with the rapid advancements seen in software engineering. The result is a patchwork of fragmented tools glued together to…

2 years ago · 24 likes · 2 comments · SDF Team

CSM, Not dated: https://www.csm.ornl.gov/~sheldon/ds/sec1.3.html#:~:text=q%20is%20true.-,Modus%20Tollens,An%20error%20in%20reasoning.

The Orchestra Data Leadership Newsletter

Discussion about this post