Skip to content

jordandakota/dbt-duckdb-iceberg-azure

Repository files navigation

dbt-duckdb + Apache Iceberg on Azure

Run dbt with DuckDB and automatically convert your tables to Apache Iceberg format on Azure Storage. No more manual conversion scripts or complex workflows.

uv run dbt_iceberg.py run {any dbt flags or commands here}

What This Does

This project solves a specific problem: dbt-duckdb can't write Iceberg tables directly. So we built a wrapper that:

  1. Runs your dbt models (which output Parquet files)
  2. Automatically converts them to Iceberg tables
  3. Cleans up the temporary files
  4. Handles all the Azure authentication headaches

Setup

Prerequisites

  • Python 3.11+
  • Azure Storage Account
  • Service Principal with "Storage Blob Data Contributor" role

Install

git clone <repository-url>
cd duckdb_iceberg

# Using uv (faster)
pip install uv
uv sync

# Or regular pip
pip install -r requirements.txt

Configure

Create .env file from .env.example:

AZURE_STORAGE_ACCOUNT=your_storage_account
AZURE_CONTAINER=your_container
AZURE_TENANT_ID=your_tenant_id
AZURE_CLIENT_ID=your_client_id
AZURE_CLIENT_SECRET=your_client_secret

Run

# Run everything
uv run python dbt_iceberg.py run

# Run specific models
uv run python dbt_iceberg.py run --select customer_metrics

# See example strategies in action
uv run python dbt_iceberg.py run --select tag:example

How Tables Get Loaded

You control how data gets loaded into Iceberg tables using three strategies:

1. Replace (Default)

Drop the table and recreate it. Simple and clean.

models:
  - name: dim_customers
    config:
      iceberg_strategy: replace  # or just omit it

2. Merge (Upsert)

Update existing rows, insert new ones. Perfect for slowly changing dimensions.

models:
  - name: dim_products
    config:
      iceberg_strategy: merge
      iceberg_merge_keys: [product_id, is_current]

3. Append

Just add new rows. Great for fact tables and event logs.

models:
  - name: fact_sales
    config:
      iceberg_strategy: append
      iceberg_partition_spec: [sale_year, sale_month]

Full Refresh

Need to recreate all tables with fresh schemas? Use --full-refresh:

# Forces all tables to use replace strategy
uv run python dbt_iceberg.py run --full-refresh

This is useful when:

  • You've changed column types in your models
  • You want to fix schema mismatches
  • You need a clean slate

The flag overrides configured strategies - even tables set to merge or append will be dropped and recreated.

Project Structure

duckdb_iceberg/
├── dbt_iceberg.py          # The magic wrapper
├── models/
│   ├── staging/            # Transform raw data
│   ├── marts/              # Business logic (external tables)
│   └── examples/           # See strategies in action
├── utilities/
│   └── iceberg_utils.py    # Azure auth fix + conversion logic
└── data/                   # Sample data

What Happens When You Run It

  1. dbt creates Parquet files in Azure at /temp/dbt/
  2. Script finds all external tables from dbt's manifest
  3. Each Parquet file gets converted to an Iceberg table
  4. Temp files get deleted (unless something fails)

Your data ends up in:

  • /raw/ - Seed data from dbt
  • /temp/dbt/ - Temporary Parquet files (auto-cleaned)
  • /iceberg-warehouse/ - Final Iceberg tables

The Azure Auth Fix

PyIceberg has a bug where service principal auth doesn't work for writes. We built a custom FileIO class that fixes it:

# This is in utilities/iceberg_utils.py
class AzureServicePrincipalFileIO(FsspecFileIO):
    # Properly passes Azure creds to the filesystem

No more "authentication failed" errors when creating tables.

Example Models

Check out the models/examples/ folder to see all three strategies in action:

  • dim_customers_latest.sql - Replace strategy
  • dim_products_scd.sql - Merge with history tracking
  • fact_daily_sales.sql - Append with partitioning

Troubleshooting

Auth errors? Make sure your service principal has "Storage Blob Data Contributor" role.

Container not found? Create it first. The script won't create containers for you.

Weird Python exit errors? Those are from the adlfs library. They're harmless and happen after all the work is done. On Windows, hide them with:

uv run python dbt_iceberg.py run 2>&1 | findstr /V "TypeError.*await"

License

MIT

About

Wrapper for dbt on iceberg with azure

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages