Skip to content

Iceberg materialization test #552

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 16 commits into
base: main
Choose a base branch
from
Draft

Conversation

ian-r-rose
Copy link
Member

@thehanggit I'm reopening this with some additional experimentation around iceberg metadata. The main issue is that there can be many different versions of the iceberg metadata (in general, each change to the table results in a new version of the metadata). Snowflake keeps track of which version is the most recent, but it's not always easy to determine the correct one from a different tool trying to query the iceberg table.

This creates a new stored procedure that unloads a JSON blob to the iceberg folder in S3 that lists the correct metadata file for each iceberg table in the marts database. I was then able to use that directly in the following sample duckdb script:

-- Load the extensions allowing for S3 access and Iceberg
load iceberg;
load httpfs;

set variable name = 'IMPUTATION__DETECTOR_IMPUTED_AGG_FIVE_MINUTES';
set variable schema = 'DBT_IROSE_IMPUTATION';

-- Since this script doesn't manage the iceberg tables, it doesn't know
-- which version of the metadata is the correct one. We could guess based
-- on heuristics for file names or modification dates, but that is extremely
-- unreliable. So we unload a special file into the iceberg directory that
-- tracks the most recent version of the metadata for each iceberg table
-- (as of the time the current_table_versions.json file was written)
-- This grabs the metadata path for the table we care about.
set variable meta = (
    select metadata from read_json(
        "s3://caltrans-pems-dev-us-west-2-marts/iceberg/current_table_versions.json"
    )
    where name = getvariable('name') and schema = getvariable('schema')
);


-- Actually make a query with no authentication!
select
    station_id,
    sample_date,
    sum(volume_sum) as volume,
    count_if(
        volume_imputation_method in ('local', 'regional', 'global', 'local_avg', 'regional_avg')
    )/count(*) as pct_imputed,
from iceberg_scan(getvariable('meta'))
where sample_date > '2025-03-10' and sample_date < '2025-03-17'
group by station_id, sample_date
order by station_id, sample_date;

@ian-r-rose ian-r-rose self-assigned this Mar 19, 2025
@thehanggit
Copy link
Contributor

@thehanggit I'm reopening this with some additional experimentation around iceberg metadata. The main issue is that there can be many different versions of the iceberg metadata (in general, each change to the table results in a new version of the metadata). Snowflake keeps track of which version is the most recent, but it's not always easy to determine the correct one from a different tool trying to query the iceberg table.

This creates a new stored procedure that unloads a JSON blob to the iceberg folder in S3 that lists the correct metadata file for each iceberg table in the marts database. I was then able to use that directly in the following sample duckdb script:

-- Load the extensions allowing for S3 access and Iceberg
load iceberg;
load httpfs;

set variable name = 'IMPUTATION__DETECTOR_IMPUTED_AGG_FIVE_MINUTES';
set variable schema = 'DBT_IROSE_IMPUTATION';

-- Since this script doesn't manage the iceberg tables, it doesn't know
-- which version of the metadata is the correct one. We could guess based
-- on heuristics for file names or modification dates, but that is extremely
-- unreliable. So we unload a special file into the iceberg directory that
-- tracks the most recent version of the metadata for each iceberg table
-- (as of the time the current_table_versions.json file was written)
-- This grabs the metadata path for the table we care about.
set variable meta = (
    select metadata from read_json(
        "s3://caltrans-pems-dev-us-west-2-marts/iceberg/current_table_versions.json"
    )
    where name = getvariable('name') and schema = getvariable('schema')
);


-- Actually make a query with no authentication!
select
    station_id,
    sample_date,
    sum(volume_sum) as volume,
    count_if(
        volume_imputation_method in ('local', 'regional', 'global', 'local_avg', 'regional_avg')
    )/count(*) as pct_imputed,
from iceberg_scan(getvariable('meta'))
where sample_date > '2025-03-10' and sample_date < '2025-03-17'
group by station_id, sample_date
order by station_id, sample_date;

@ian-r-rose Thank you Ian! One question here is: does this json file stores all iceberg tables' metadata information? And we can use such one json file to find all most updated metadata information by choosing variables for the table name and schema right.

@ian-r-rose
Copy link
Member Author

One question here is: does this json file stores all iceberg tables' metadata information? And we can use such one json file to find all most updated metadata information by choosing variables for the table name and schema right.

Yes, exactly!

I'm attaching an example file that was created as part of this PR (which is also hosted in S3, as you can see in the above script):

current_table_versions.json

@thehanggit
Copy link
Contributor

One question here is: does this json file stores all iceberg tables' metadata information? And we can use such one json file to find all most updated metadata information by choosing variables for the table name and schema right.

Yes, exactly!

I'm attaching an example file that was created as part of this PR (which is also hosted in S3, as you can see in the above script):

current_table_versions.json

Thanks for clarification! I will follow your steps and create necessary iceberg tables for your review.

@ian-r-rose ian-r-rose force-pushed the iceberg-materialization-test branch from ed46b84 to 33ef1e1 Compare May 2, 2025 22:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants