Iceberg materialization test #552

ian-r-rose · 2025-03-19T20:55:11Z

@thehanggit I'm reopening this with some additional experimentation around iceberg metadata. The main issue is that there can be many different versions of the iceberg metadata (in general, each change to the table results in a new version of the metadata). Snowflake keeps track of which version is the most recent, but it's not always easy to determine the correct one from a different tool trying to query the iceberg table.

This creates a new stored procedure that unloads a JSON blob to the iceberg folder in S3 that lists the correct metadata file for each iceberg table in the marts database. I was then able to use that directly in the following sample duckdb script:

-- Load the extensions allowing for S3 access and Iceberg
load iceberg;
load httpfs;

set variable name = 'IMPUTATION__DETECTOR_IMPUTED_AGG_FIVE_MINUTES';
set variable schema = 'DBT_IROSE_IMPUTATION';

-- Since this script doesn't manage the iceberg tables, it doesn't know
-- which version of the metadata is the correct one. We could guess based
-- on heuristics for file names or modification dates, but that is extremely
-- unreliable. So we unload a special file into the iceberg directory that
-- tracks the most recent version of the metadata for each iceberg table
-- (as of the time the current_table_versions.json file was written)
-- This grabs the metadata path for the table we care about.
set variable meta = (
    select metadata from read_json(
        "s3://caltrans-pems-dev-us-west-2-marts/iceberg/current_table_versions.json"
    )
    where name = getvariable('name') and schema = getvariable('schema')
);


-- Actually make a query with no authentication!
select
    station_id,
    sample_date,
    sum(volume_sum) as volume,
    count_if(
        volume_imputation_method in ('local', 'regional', 'global', 'local_avg', 'regional_avg')
    )/count(*) as pct_imputed,
from iceberg_scan(getvariable('meta'))
where sample_date > '2025-03-10' and sample_date < '2025-03-17'
group by station_id, sample_date
order by station_id, sample_date;

…-pems into iceberg-materialization-test

thehanggit · 2025-03-19T21:09:47Z

@thehanggit I'm reopening this with some additional experimentation around iceberg metadata. The main issue is that there can be many different versions of the iceberg metadata (in general, each change to the table results in a new version of the metadata). Snowflake keeps track of which version is the most recent, but it's not always easy to determine the correct one from a different tool trying to query the iceberg table.

This creates a new stored procedure that unloads a JSON blob to the iceberg folder in S3 that lists the correct metadata file for each iceberg table in the marts database. I was then able to use that directly in the following sample duckdb script:

-- Load the extensions allowing for S3 access and Iceberg
load iceberg;
load httpfs;

set variable name = 'IMPUTATION__DETECTOR_IMPUTED_AGG_FIVE_MINUTES';
set variable schema = 'DBT_IROSE_IMPUTATION';

-- Since this script doesn't manage the iceberg tables, it doesn't know
-- which version of the metadata is the correct one. We could guess based
-- on heuristics for file names or modification dates, but that is extremely
-- unreliable. So we unload a special file into the iceberg directory that
-- tracks the most recent version of the metadata for each iceberg table
-- (as of the time the current_table_versions.json file was written)
-- This grabs the metadata path for the table we care about.
set variable meta = (
    select metadata from read_json(
        "s3://caltrans-pems-dev-us-west-2-marts/iceberg/current_table_versions.json"
    )
    where name = getvariable('name') and schema = getvariable('schema')
);


-- Actually make a query with no authentication!
select
    station_id,
    sample_date,
    sum(volume_sum) as volume,
    count_if(
        volume_imputation_method in ('local', 'regional', 'global', 'local_avg', 'regional_avg')
    )/count(*) as pct_imputed,
from iceberg_scan(getvariable('meta'))
where sample_date > '2025-03-10' and sample_date < '2025-03-17'
group by station_id, sample_date
order by station_id, sample_date;

@ian-r-rose Thank you Ian! One question here is: does this json file stores all iceberg tables' metadata information? And we can use such one json file to find all most updated metadata information by choosing variables for the table name and schema right.

ian-r-rose · 2025-03-19T21:40:34Z

One question here is: does this json file stores all iceberg tables' metadata information? And we can use such one json file to find all most updated metadata information by choosing variables for the table name and schema right.

Yes, exactly!

I'm attaching an example file that was created as part of this PR (which is also hosted in S3, as you can see in the above script):

current_table_versions.json

thehanggit · 2025-03-19T21:47:37Z

One question here is: does this json file stores all iceberg tables' metadata information? And we can use such one json file to find all most updated metadata information by choosing variables for the table name and schema right.

Yes, exactly!

I'm attaching an example file that was created as part of this PR (which is also hosted in S3, as you can see in the above script):

current_table_versions.json

Thanks for clarification! I will follow your steps and create necessary iceberg tables for your review.

ian-r-rose and others added 15 commits February 14, 2025 15:37

experiment with iceberg materializations

f870fe0

experiment with adding unload filters for incremental models

2287693

work in incrementality

76a66ff

Bump dbt version

75d634f

Try to make the imputed dataset full

4e25198

WIP create an iceberg materialization

1e2c541

Merge branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans…

cd1ca42

…-pems into iceberg-materialization-test

Incrementalize

58e0f3a

WIP

71a74ce

Revert

61044fd

Revert

a6530bc

Merge branch 'main' of https://github.com/cagov/caldata-mdsa-caltrans…

e72cc3a

…-pems into iceberg-materialization-test

Back to refresh macro

1d340c8

Create stored procedure to write Iceberg metadata

701925e

Add macro to write the iceberg metadata

eaa5688

ian-r-rose self-assigned this Mar 19, 2025

Update dbt-snowflake to fix iceberg materialization bug

33ef1e1

ian-r-rose force-pushed the iceberg-materialization-test branch from ed46b84 to 33ef1e1 Compare May 2, 2025 22:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Iceberg materialization test #552

Iceberg materialization test #552

Uh oh!

ian-r-rose commented Mar 19, 2025

Uh oh!

thehanggit commented Mar 19, 2025

Uh oh!

ian-r-rose commented Mar 19, 2025

Uh oh!

thehanggit commented Mar 19, 2025

Uh oh!

Uh oh!

Iceberg materialization test #552

Are you sure you want to change the base?

Iceberg materialization test #552

Uh oh!

Conversation

ian-r-rose commented Mar 19, 2025

Uh oh!

thehanggit commented Mar 19, 2025

Uh oh!

ian-r-rose commented Mar 19, 2025

Uh oh!

thehanggit commented Mar 19, 2025

Uh oh!

Uh oh!