Snowflake External Tables: eliminate egress overhead (but not all the cost, yet)
Snowflake External Tables enable access to object storage and mean sources in third party storage can be accessed via tools like dbt
I’m Hugo Lu, CEO and Co-Founder of Orchestra, the Unified Control Plane for Data Teams. I’m a Data Engineer and helped build Codat’s and JUUL’s EMEA Data Functions🚀
Also check out our internal blog ⭐️
Want to see how Orchestra is changing the game by delivering unparalleled cost savings and visibility? Try our Free Tier now.
Introduction
It is really annoying to have to port over data from Object Storage in Data Warehouses.
Not only does it require you to spend valuable resources transferring data over the internet, you also need another process to run the scripts to move data.
FEAR NOT!
Snowflake now support external stages. In fact, BigQuery do too, via external tables. As do Databricks and I guess even Redshift!
In this article, I’ll dive into what these are and why they’re cool.
Step 1: Create an External Storage Account and Bucket
AWS S3:
Sign in to your AWS Management Console.
Navigate to the S3 service.
Create a new bucket, ensuring it’s in the desired region.
Note down the bucket name and the region.
Azure Blob Storage:
Sign in to the Azure portal.
Create a new Storage account.
Create a new container within the storage account.
Note down the storage account name and container name.
Google Cloud Storage:
Sign in to the Google Cloud Console.
Create a new bucket.
Note down the bucket name and the location.
Step 2: Set Up Access Permissions
AWS S3:
Create an IAM user or role with the necessary permissions to access the bucket.
Attach a policy granting
s3:GetObject
,s3:PutObject
, ands3:ListBucket
permissions to the bucket.Generate access keys for the IAM user (if applicable).
Azure Blob Storage:
Create a Shared Access Signature (SAS) token with the required permissions.
Alternatively, set up role-based access control (RBAC) with the necessary permissions.
Google Cloud Storage:
Create a service account with the required permissions.
Generate a JSON key file for the service account.
Ensure the service account has
Storage Object Viewer
andStorage Object Creator
roles.
Step 3: Configure Network Policies
Ensure that your Snowflake account can access the external storage. For AWS S3, you might need to set up VPC endpoints or bucket policies. For Azure and Google Cloud, ensure that network access rules allow connections from Snowflake’s IP ranges.
This is actually a massive bitch. You can see a bit more here. But the crux of the matter is as follows — imagine your security team has everything locked down in Azure. It’s all in ADLS. And you’re like — ha! I’ve got external tables.
Well — not so fast. All you’re doing is basically giving Snowflake a set of creds to access that data in ADLS, and that data’s locked down, remember?
External tables are cool, but if you haven’t thought about security then you’re going to fall down a bit flat on external stages!
Step 4: Create the External Stage in Snowflake
AWS S3:
CREATE OR REPLACE STAGE my_s3_stage
URL='s3://your-bucket-name'
CREDENTIALS=(AWS_KEY_ID='your-access-key-id' AWS_SECRET_KEY='your-secret-access-key')
FILE_FORMAT=(TYPE=CSV);
Create the object
CREATE OR REPLACE EXTERNAL TABLE my_external_table (
id STRING,
name STRING,
age NUMBER,
country STRING
)
WITH LOCATION=@my_s3_stage
FILE_FORMAT = (TYPE = CSV);
And now you can reference it.
SELECT * FROM my_external_table;
Step 5; running Snowflake External Tables in dbt (data build tool)
Now that you’ve created the external table, it’s literally this simple:
-- models/external_table_reference.sql
WITH external_data AS (
SELECT
id,
name,
age,
country
FROM
your_database.your_schema.my_external_table
)
SELECT
id,
name,
age,
country
FROM
external_data
And I love this. Because now, instead of having to land your data in S3, trigger a script to load that data (incrementally, efficiently, by the way) with an Orchestration tool (ehem) and then trigger what’s going on in Snowflake, you can JUST reference stuff IN S3 and crack on, completely skipping that middle step.
Bonus — Iceberg and storage on your own environment via Snowflake Polaris
The announcement around Snowflake Polaris, the Iceberg Catalog, underlined Snowflake’s commitment to open storage.
Well, that’s one reading — the other is that they are acknowledge Bring Your Own Storage is likely to be a powerful, unavoidable trend and they are planning for the future.
Either way, it means you should in the future, be able to run models using Snowflake and if you choose, dbt (and I suppose, Coalesce.io too) on data that’s ALL IN YOUR OWN OBJECT STORAGE.
This is super cool, as it finally removes egress and ingress — often expensive and architecturally unnecessary costs.
Well — sort of. the data is, of course, still moving around. It’s just you’re now using your data warehouse’s infrastructure to do it (and incur the costs there). Still, significantly simpler than setting up a separate service just t
Conclusion
Setting up Snowflake External Tables in just five simple steps opens a world of possibilities- by leveraging Snowflake’s powerful capabilities, you can easily access data stored in AWS S3, Azure Blob Storage, or Google Cloud Storage without the hassle of data migration or complex ETL processes.
However, it’s crucial to remember the importance of configuring security and network policies correctly. Ensuring proper permissions and access controls can prevent potential security pitfalls and ensure smooth operation.
As Snowflake continues to innovate with features like Snowflake Polaris and the Iceberg Catalog, the future of data storage and processing looks promising. These advancements point towards a more open and flexible data ecosystem, where you can leverage your existing object storage while enjoying the robust analytical capabilities of Snowflake.
I’m Hugo Lu, CEO and Co-Founder of Orchestra, the best-in-class Data pipeline Management PLatform, and I hope this guide has been helpful in simplifying the setup process for Snowflake External Tables.
Happy building!