Running dbt-core on Github Actions: a comprehensive guide
Github runners are surprisingly versatile, but peel back the onion and you’ll realise deploying dbt-core on Github Actions isn’t for the faint-heared
Introduction
So we checked out what the actual fuss was with using Github Actions to run dbt-core, not just on Continuous Integration / changes to main but also “in prod” i.e. on a schedule, be it cron or event-based.
Here’s what we found. Disclaimer — we leveraged some GPT to help us write this, but you’ll see the structure is very logical and human.
Deploying dbt Core on GitHub Actions: A Comprehensive Guide
In the world of data transformation and analytics, dbt (data build tool) has emerged as a powerful tool that enables data analysts and engineers to transform data in their warehouse more effectively. As organizations strive for efficiency and seamless integration in their data operations, the deployment of dbt Core on GitHub Actions has gained attention for its potential to automate and streamline data transformation workflows. This approach, however, comes with its set of advantages and disadvantages that organizations must consider.
Advantages of Deploying dbt Core on GitHub Actions
1. **Workflow Definition in Git**: One of the primary advantages of using GitHub Actions for dbt Core deployment is the ability to define workflows directly in Git. This means that the workflows are version-controlled, allowing for easy tracking of changes, collaboration among team members, and rollback if necessary. The integration with Git ensures that the latest version of your code is available immediately for deployment, eliminating delays in the development cycle.
2. **Immediate Access to Latest Code**: Deploying dbt Core through GitHub Actions ensures that the latest version of your code is always at your fingertips. This immediate access facilitates rapid deployment and testing, enhancing productivity and reducing the time from development to deployment.
3. **No Need to Provision Instances for GitHub Hosted Runners**: For those using GitHub Enterprise Plan, there’s the added benefit of not needing to provision instances for runners. GitHub hosted runners provide a ready-to-use infrastructure that can significantly reduce the setup and maintenance overhead associated with managing your own instances for running dbt jobs.
Disadvantages of Deploying dbt Core on GitHub Actions
1. **Limited by GitHub’s Workers**: A significant limitation comes from being bound by the workers provided by GitHub. Unless organizations opt to configure self-hosted runners — a process which currently lacks robust support — they are subject to the constraints of GitHub’s available resources. This can lead to bottlenecks in processing times, especially for data-heavy operations.
2. **’Fire and Forget’ Implementation**: GitHub Actions’ implementation for dbt Core deployments is often described as ‘fire and forget,’ meaning that once a dbt job is run, triggering subsequent actions or jobs is not straightforward. This limitation can hinder workflow automation, requiring additional effort to manage dependencies and follow-up actions manually.
3. **Overhead When Connecting to Other Services**: Integrating with other services, such as uploading data to Amazon S3, introduces significant overhead. The setup required to enable GitHub Actions workers to interact with external services can be cumbersome, involving extensive configuration and management of permissions and credentials. This added complexity can detract from the simplicity and efficiency sought through automation.
4. **Cost of GitHub Enterprise Plan**: It’s crucial to note that accessing the full benefits of GitHub Actions, including the use of GitHub hosted runners without needing to provision instances, requires a GitHub Enterprise Plan. The cost of this plan is a significant consideration for organizations, especially smaller ones, making it a notable disadvantage. The pricing structure, dependent on the number of users and level of resources required, can add a considerable expense to the organization’s operational costs.
Summary
Deploying dbt Core on GitHub Actions offers a compelling blend of benefits for organizations looking to streamline their data transformation workflows. The ease of integration with Git, immediate access to the latest code, and the elimination of infrastructure provisioning are significant advantages. However, these benefits are balanced by limitations such as resource constraints, the ‘fire and forget’ nature of deployments, overhead in connecting to external services, and the cost of the GitHub Enterprise Plan. Organizations must weigh these factors carefully, considering their specific needs, resources, and goals, to determine if deploying dbt Core on GitHub Actions aligns with their operational strategy and offers the best return on investment.
Ok cool — I want to run it. Give me the Actions.yml file
The GitHub Actions workflow file (`.yml` or `.yaml` format) itself isn’t something that Terraform directly manages. Terraform is used for infrastructure as code, allowing you to manage cloud services and resources. However, I can provide you with an example of a GitHub Actions workflow file that could be used to trigger a dbt Core job. If you’re looking to manage GitHub repositories or actions with Terraform, you would typically use it to set up the repository, permissions, and maybe secrets that the Actions workflow could use, not the workflow file itself.
Below is an example GitHub Actions workflow file named `dbt.yml` that you can add to your repository under `.github/workflows/` to trigger a dbt Core job. This example assumes you have your dbt project configured and available in your repository:
name: DBT Core Job
on:
push:
branches:
- main
schedule:
- cron: '0 3 * * *' # Runs every day at 3 AM
jobs:
dbt-run:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.8'
- name: Install dbt
run: pip install dbt-core
- name: Run dbt
run: dbt run
env:
DBT_PROFILES_DIR: ${{ github.workspace }}
This workflow does the following:
- It’s triggered on every push to the `main` branch and also on a schedule (daily at 3 AM UTC).
- It sets up a job named `dbt-run` that runs on the latest Ubuntu runner available on GitHub Actions.
- It checks out your repository using the `actions/checkout` action.
- It sets up Python using the `actions/setup-python` action, specifying Python version 3.8.
- It installs `dbt-core` using pip.
- Finally, it runs `dbt run` to execute your dbt project. This step assumes that your dbt project is configured to run with the provided environment and that any necessary dbt profile is either checked into your repository or configured through GitHub Secrets (for database connection settings, etc.).
Please remember, hardcoding sensitive information in your workflow file (like database credentials) is not recommended. Instead, use [GitHub Secrets](https://docs.github.com/en/actions/security-guides/encrypted-secrets) to securely store and access such information within your workflow.
How do I get the artifacts? like the run_results.json?
To collect artifacts such as run_results.json
from a dbt run in a GitHub Actions workflow, you can use the actions/upload-artifact
action. This action allows you to upload files produced during your workflow to GitHub, where they can be downloaded after the workflow completes. This is particularly useful for retrieving output files like run_results.json
, manifest.json
, or any logs that your dbt job might generate.
Here’s how you can modify the provided GitHub Actions workflow example to upload run_results.json
as an artifact:
name: DBT Core Job
on:
push:
branches:
- main
schedule:
- cron: '0 3 * * *' # Runs every day at 3 AM
jobs:
dbt-run:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v2
with:
python-version: '3.8'
- name: Install dbt
run: pip install dbt-core
- name: Run dbt
run: |
dbt run
dbt docs generate
- name: Upload run_results.json
uses: actions/upload-artifact@v2
with:
name: dbt-run-results
path: target/run_results.json
- name: Upload manifest.json
uses: actions/upload-artifact@v2
with:
name: dbt-manifest
path: target/manifest.json
In this workflow, after the dbt run:
The
actions/upload-artifact
action is called to uploadrun_results.json
from thetarget
directory (which is the default directory where dbt stores its output files).Additionally, you can also upload
manifest.json
using a similar step. This file contains a manifest of your dbt project, which can be useful for debugging or analyzing your dbt models.
The uploaded artifacts can be found in the “Artifacts” section of the GitHub Actions run page after the workflow completes. You can download them from there for further analysis or use.
Remember, the paths specified in the path
keys must match the location of your files within the GitHub Actions runner's filesystem. If you have configured dbt to output its files to a directory other than target
, you'll need to adjust the path
values accordingly.
Can I trigger GIthub actions from an external source? like an orchestrator?
Yes, you can trigger GitHub Actions workflows from an external source, such as an orchestrator (e.g., Apache Airflow, Google Cloud Composer, AWS Step Functions), or any other external system capable of making HTTP requests. This is commonly achieved using GitHub’s [repository dispatch event](https://docs.github.com/en/rest/reference/repos#create-a-repository-dispatch-event), which allows you to send a POST request to GitHub’s API to trigger a workflow run. Additionally, you can use the [workflow_dispatch event](https://docs.github.com/en/actions/using-workflows/events-that-trigger-workflows#workflow_dispatch), which provides a way to manually trigger workflows from GitHub’s UI or via the GitHub API.
### Using the Repository Dispatch Event
To trigger a workflow using the repository dispatch event, you’ll first need to create a Personal Access Token (PAT) with `repo` scope. This token is used to authenticate your HTTP request to the GitHub API.
Your workflow file (`*.yml` or `*.yaml` under `.github/workflows/`) needs to be set up to listen for the repository dispatch event:
name: External Trigger Workflow
on:
repository_dispatch:
types: [orchestrator-event]
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run a script
run: echo "This workflow was triggered by an external event!"
To trigger this workflow, send a POST request to the GitHub API:
curl -X POST \
-H "Accept: application/vnd.github.v3+json" \
-H "Authorization: token YOUR_PERSONAL_ACCESS_TOKEN" \
https://api.github.com/repos/OWNER/REPO/dispatches \
-d '{"event_type": "orchestrator-event", "client_payload": {"key": "value"}}'
Replace `YOUR_PERSONAL_ACCESS_TOKEN` with your generated PAT, and `OWNER` and `REPO` with your repository details.
### Using the Workflow Dispatch Event
The workflow dispatch event is similar but allows for input parameters to be passed and can be triggered directly from the GitHub UI.
First, set up your workflow file to listen for the workflow dispatch event:
name: Manual Trigger Workflow
on:
workflow_dispatch:
inputs:
logLevel:
description: 'Log level'
required: true
default: 'warning'
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Run a script
run: echo "Log level is ${{ github.event.inputs.logLevel }}"
To trigger this workflow via the GitHub API, send a POST request:
curl -X POST \
-H "Accept: application/vnd.github+json" \
-H "Authorization: token YOUR_PERSONAL_ACCESS_TOKEN" \
https://api.github.com/repos/OWNER/REPO/actions/workflows/YOUR_WORKFLOW_FILE.yml/dispatches \
-d '{"ref":"main", "inputs": {"logLevel": "info"}}'
Replace `YOUR_WORKFLOW_FILE.yml` with the path to your workflow file, and adjust the `ref` and `inputs` as necessary.
Both methods provide flexible ways to integrate GitHub Actions into your broader automation or orchestration strategies, enabling you to leverage GitHub’s CI/CD capabilities as part of a larger workflow.
Conclusion
Phew! Quite a lot there.
Now we see there are a few advantages to running dbt-core in Github. The code’s already there, it’s incredibly convenient and you can even get artifacts!
However, there is no visibility and no alerting. There is no concept of cross project dependency and no UI.
Furthermore, making use of artifacts will inevitably require some other system to go and be triggered to fetch them from an object store like S3.
It almost feels, therefore, that unless you’re going to die on your large github-shaped sword and argue that Github Action is the ultimate orchestration tool, it’s really only fit for purpose as a single component part of a data stack (using it specifically to run dbt-core).
Even then, it feels painful and not suitable for that due to the high interoperability costs.
But hey — don’t want to pay for dbt Cloud and don’t care about visibility? Then github away! 🔨