What Snowflake’s Acquisition of Datavolo means for the Data Industry

Cloudera, Hortonworks, Unstructured Data, and of course — AI

Nov 26, 2024

Substack Note

While I don’t think it is worth speculating, I am sure there was a strategic element to this acquisition vs. building out unstructured ELT as being part of the Snowflake Strategy. This, for now, seems very at odds.

While Marketing is intent on tackling Databricks Head-on, talking about Data Platforms and Spark, the corporate angle is focussed on AI (and so too, is its corporate Finance side).

I find this super odd given there are very few data engineers meaningfully doing (or wanting to do) AI in a serious way.

Data Leaders need to not get distracted by this acquisition. The glaring hole in Snowflake’s platform is a complete suite of tools for Data Practitioners that we actually take seriously.

Databricks workflows gets used instead of Airflow, Tasks does not
Databricks are pushing their BI tool hard; Streamlit is not a viable alternative to many
Lakeflow (DBX) is the latest ingestion tool, Snowflake has some native connectors
Unity Catalog is something people also actively use, where is Snowflake’s?

And as I’ve said before, there is no reason to evaluate these things in a like-for-like comparison apart from the fact their marketing team appear to want us to do this, so let us oblige.

Datavolo acquisition will likely not resonate with engineers but will form part of the company’s AI strategy, which is itself apparently not in-line with the anti-databricks angle.

The biggest missing feature in both platforms? A true control plane for data, like Orchestra.

Introduction

Snowflake announced that they were acquiring Datavolo earlier this week — sum undisclosed.

There are a lot of big acquisitions happening in the Data space, and there are a few things you might think about this. Is Snowflake Really moving into AI? Are they trying to take on Fivetran? Are they trying to respond to Databricks’ Lakeflow? Not exactly.

I spoke with Luke Roquet about his vision in December last year, so here’s what the acquisition means for the industry.

Tight Coupling to Nifi: Context for Snowflake / Datavolo

Something you might not know is that Datavolo is tightly, tightly coupled to Apache Nifi. The Founders of Datavolo previously worked at Hortonworks, where they had a similar product called Dataflow.

These are tools with GUIs (similar to Orchestra) that let you move data, transform it, and orchestrate pipelines.

I’ve always thought these were pretty ugly tools that were hard to use, but as Luke pointed out; Cloudera has a 100m business built atop Nifi — does that really matter? Datavolo’s product was very similar, built ontop of Nifi, but with connectors for unstructured data as opposed to regular bits of data.

So this is interesting, as it’s essentially an extension to the Cloudera platform for unstructured data. Timely built, for the AI boom.

How will this help Snowflake Customers?

I think this is a hard one — Cloudera is not really something Snowflake Customers use a whole lot. They’re both trying to be data platforms, so quite a mutually exclusive customre base. The Nifi / Cloudera demographic is similar, so you expect the same to be true of Datavolo.

I know a lot of Snowflake users who want to move unstructured data, but it’s likely there will need to be a lot of UI work to make this usable. For the most part, it would be challenging to give your average Snowflake customer Apache Nifi for unstructured data and have it take-off.

Limitations with Fivetran and other ELT Vendors

An interesting point is that ETL can be much more desirable for unstructured data. You might want to vectorise data in transit (like image data) and then land the data vs. land the images, store them, and then vectorise — quite alot more efficient.

This is something Luke mentioned to me as a key difference between unstructured ETl and structured ELT which is why it made sense that Datavolo was doing something new and novel.

Is this a response to Databricks’ Lakeflow?

The short answer is no. Lakeflow is closer to a way to do regular and standard replication, probably somewhat linked to Databricks’ acquisition of Arcion. I do not think Datavolo is intended to be Snowflake’s alternative to Databricks’ Lakeflow. They do very different things.

How does this play into Snowflake’s AI Strategy?

If you believe everyone is going to be using Snowflake to facilitate AI then you will believe this acquisition makes life a lot easier.

Why? Because with Datavolo it has never been easier to move unstructured data to your warehouse.

I think this is potentially a bit of a miss. As we have seen, the vast majority of AI use-cases are currently being rolled out by software engineers. Software engineers who are using object storage and building applications.

As Data Engineers and Analytics Engineers, not only are we loathe to use tools like Nifi but we also understand that the business can barely be trusted with dashboards, let alone AI or ML. We still believe walking before running is important.

And indeed, it is mainly data engineers and analytics engineers that are leveraging Snowflake. Not Software engineers building systems of operation at energy companies or hedge funds (for the most part, at least).

That said, given the huge boom in AI and the rapid growth of Datavolo, this is likely a strategic but perhaps also an opportunistic acquisition. Wedo not know how much Snowflake paid, so everything is relative.

But to round out the point; think about it — if you’re a Snowflake user, when was the last time someone asked you to do unstructured data ingestion that wasn’t a JSON? :)

Conclusion

Today was a big day for Snowflake, which reported better-than-expected earnings that sent its stock climbing 19%. The company is clearly doubling down on its efforts to facilitate AI use-cases. We will see if that is what the people want ⛄️

Like Nifi? You’ll Love Orchestra.

The Orchestra Data Leadership Newsletter

Discussion about this post