Why Apache Iceberg is heralding a new era of change in Data Engineering
“Bring Your Own Storage (BYOS)” has never been cooler
About me
I’m Hugo Lu — I started my career working in M&A in London before moving to JUUL and falling into data engineering. I headed up the Data function at London-based Scale-up Codat. I’m now CEO at Orchestra, which is a data release pipeline tool that helps Data Teams release data into production reliably and efficiently 🚀
️️⭐️ Also check out our Substack and our internal blog ⭐️
Substack Note
It’s a kinda wild idea that every application should need an analytical layer that supports a Bring Your Own Storage layer. I don’t really think this is a likely megatrend we’re going to see, but I think this article makes for great reading and let’s face it - if everything we had had an iceberg backend and we all used Iceberg, that would be pretty fucking awesome.
Introduction
For many years, compute and storage were intrinsically linked. If you wanted to own a computer (in the context of business computing), you would need to pick from a menu of computers with differing levels of RAM (compute) and storage.
This bled into software.
Data warehouses typically offered this pricing model. This created tension. What if I had spiky computational requirements? What if storage could change over time?
Snowflake was billed as the first truly elastic data warehouse, as it was designed so both storage and compute could scale infinitely.
With the rise of open-source software in the data community, there will be an even greater decoupling.
“Bring Your Own Storage” will emerge as a more popular pattern. This article explains why, and how, and what we might do about it at Orchestra (in the world of data orchestration).
What is Apache Iceberg?
Apache Iceberg is an open-source table format for large-scale data systems, designed to provide efficient and reliable management of structured data in distributed environments. It offers features essential for data warehousing and analytics workloads, such as atomic transactions, schema evolution, and efficient data pruning. Iceberg is built to address the limitations of existing table formats like Apache Parquet and Apache ORC.
Key features of Apache Iceberg include:
1. Atomic Transactions: Iceberg ensures ACID (Atomicity, Consistency, Isolation, Durability) properties for write operations, allowing multiple operations to be committed atomically. This ensures data consistency and reliability even in the face of failures.
2. Schema Evolution: Iceberg supports schema evolution, allowing changes to the table schema without disrupting existing data or requiring expensive metadata rebuilds. This feature facilitates the seamless evolution of data schemas over time.
3. Partitioning and Sorting: Iceberg provides efficient partitioning and sorting mechanisms to optimize query performance. By organizing data into partitions and maintaining sorted data within each partition, Iceberg minimizes the amount of data scanned during query execution.
4. Time Travel: Iceberg supports time travel queries, enabling users to query historical versions of data stored in the table. This feature is particularly useful for auditing, debugging, and analyzing changes to the dataset over time.
5. Incremental Data Updates: Iceberg allows for efficient incremental updates to the table, enabling users to add, update, or delete records without having to rewrite the entire dataset. This significantly reduces the overhead associated with data updates.
6. Metadata Management: Iceberg maintains comprehensive metadata about the table structure, data files, partitions, and transaction history. This metadata is stored separately from the data files, enabling efficient metadata operations and facilitating compatibility with different storage systems.
7. Compatibility and Ecosystem Integration: Iceberg is designed to integrate seamlessly with existing data processing frameworks and tools, including Apache Spark, Apache Hive, and Apache Flink. It provides APIs and connectors for interacting with Iceberg tables in these environments.
For more detailed information and documentation, you can refer to the official Apache Iceberg website (
https://iceberg.apache.org/
)
What are the alternatives?
There are lots of other table formats, as you might imagine. The battle for the one table format to rule them all rages on, here are a few of our picks:
1. Delta Lake:
— Delta Lake is an open-source storage layer that brings ACID transactions to Apache Spark and big data workloads. It offers features like ACID transactions, schema evolution, time travel, and unified batch and streaming data processing.
— [Delta Lake]
2. Apache Hudi (Hadoop Upserts Deletes and Incrementals):
— Apache Hudi is a storage abstraction that provides incremental data processing and stream processing capabilities on top of Apache Hadoop. It offers features like upserts, deletes, and incremental data processing with built-in support for schema evolution and ACID transactions.
— [Apache Hudi]
3. Apache Parquet:
— Apache Parquet is a columnar storage format for Hadoop that provides efficient storage and encoding of structured data. It is widely used in big data processing frameworks like Apache Spark and Apache Hive for data storage and analysis.
— [Apache Parquet]
4. Apache ORC (Optimized Row Columnar):
— Apache ORC is another columnar storage format for Hadoop that offers efficient compression and encoding techniques to reduce storage space and improve query performance. It is commonly used in data warehousing and analytics workloads.
— [Apache ORC]
How does this affect platforms like Snowflake?
Snowflake has been aware of the BYOS demand for a while, and therefore has supported external tables since January 2021.
Due to the increasing popularity of apache iceberg in large companies, at Orchestra we understand external tables will be sunset in favour of Snowflake’s support for Iceberg.
This is somewhat surprising — there’s no theoretical reason to prefer Iceberg over any other storage layer.
For example, you should in theory be agnotic as to whether data is stored in S3, Google Object Storage, or Azure ADLS Gen-2.
Indeed, most of us *are* — Snowflake’s not sunsetting those destinations.
It’s sunsetting the file types in favour of Iceberg. Of course — Snowflake would prefer you store your data in their S3 with their proprietary file-format. This means it controls the storage (and can charge you a mark-up) and the file-format (they own the data).
But if you want your own storage, and you want to own all your data (i.e. Snowflake is just effectively pushing down queries) you’re at an impass without effectively an open-sourced file-format for tables that Snowflake knows how to play with — this is exactly what external tables is, but it is notoriously inefficient (source), hence the reason for favouring iceberg.
Why Iceberg?
It’s interesting to note there are a few other alternatives but it’s iceberg that’s getting the most traction.
For example, you could check out Hudi or OneHouse who are working on an interesting project called OneHouse.
OneTable is an abstraction on top of all of Hudi, Delta and Iceberg. It adds an interoperability layer between all three.
One has to wonder whether all these additional layers truly make systems like Snowflake or Databricks more or less efficient — the answer is surely less. But the point is, the BYOS pattern gets enabled by a few file formats. Databricks and Snowflake support a few of these, so everyone else will surely follow suit.
Why stop at warehouses?
There are increasingly many applications where the database layer served to the end user resembles an analytical database.
Think about Amplitude, for example.
Amplitude is a platform that collects event data.
Users then build charts and conduct analysis on the event data. They build graphs in much the same way that you might build graphs in Looker, and query a database like Snowflake.
There’s no reason that portion of Amplitude’s database can’t be architected on iceberg, provided the logic required to effectively query .iceberg tables in SQL is also available.
There are many pieces of software like Amplitude that effectively store important information that gets ETL’d out from SAAS applications into places like Snowflake.
This process is costly, and leads to our first would-be benefit of a world where most data-intensive applications offer a BYOS option.
ELT gets dramatically minimised
This is surely the solution to the ELT problem. Erase it in the first place.
The second benefit is analytical. Rather than have disparate data sources that require constant EL and T, instead, you have data being updated in place. Provided there are no multi-tenanted issues, you effectively centralise all your data by design.
Imagine if Salesforce, Amplitude, and Xero all simply had iceberg backends. Imagine your data warehouse was also based on iceberg, and your Postgres RDS instance needed no replica because you were simply firing events onto a queue to be put into iceberg.
There would be no need for ELT.
And Snowflake lose 10% on stage — anecdotally, this is how much me and my data friends pay for storage, if not less.
But they lose more — and you gain more. That’s because the data models that exist in Iceberg that are actually the back-ends for applications like Amplitude are having queries pushed down to them from services like that (Amplitude).
This means instead of EL’ing the raw data and T’ing using Snowflake, the need just goes away.
What’s the catch?
The catch is that in order for this to work, you would need to convince the CEO of every data-intensive saas business to build their back-end using a database that works on an open-table format.
Iceberg is pretty new. And even if it wasn’t, the world of databases doesn’t just work like that.
However there is HOPE!
The hope, is that replication from these services to iceberg will become a gold-standard. We already see this with services like Salesforce and Amplitude — they aren’t willing to give a BYOS model (it’s too much of an internal lift), but they will give you a copy of your data wherever you want, whenever you want it.
At Orchestra, we aggregate, clean, and enrich all the Data pipeline metadata we can. This means we act as a Data Team for the Data Team, curating a rich metadata set which facilitates rapid debugging, seamless maintenance, and unparalleled depth of insight.
Will we do this using .iceberg? Perhaps. We’re thinking about it. Let us know what you think in the comments! ❄️
Find out more about Orchestra
Orchestra is a platform for getting the most value out of your data as humanely possible. It’s also a feature-rich orchestration tool, and can solve for multiple use-cases and solutions. Our docs are here, but why not also check out our integrations — we manage these so you can get started with your pipelines instantly. We also have a blog, written by the Orchestra team + guest writers, and some whitepapers for more in-depth reads.