iceberg integration with geoparquet #257

dschneider-wxs · 2025-04-18T23:42:35Z

dschneider-wxs
Apr 18, 2025

Hi - Just getting my hands dirty with geoparquet trying to build a data lakehouse with iceberg and geoparquet. I haven't seen much detailed discussion on this, just "it would be good". PyIceberg doesn't yet support Iceberg v3 which include geo datatypes - only Sedona does as far as I can tell.
Excluding Spark/Sedona for now, what are some approaches people have used to integrate partitioning with iceberg and parquet? My data source is actually in netcdf and I was thinking of extracting to geoparquet because it's a vector data cube and doesn't fit great into the icechunk model.
I was thinking I'd extract values from netcdf with time and spatial coordinates then write with either duckdb or geopandas. The incoming data is in chunks on the order of MB. If i add a geoparquet to Iceberg the geometry column is a binary blob whose metadata is lost in translation when converted to duckdb or pandas. I saw that a scan of the iceberg table can actually yield the relevant parquet files. Would it make sense to add lat and long columns and setup iceberg partitioning on this? Then return the data files and read them directly with duckdb or gpd. This all feels like a bit of hack to access directly, even if just for reads, but as this data grows I need to extract subsets of the data and serve as netcdf again. I could also optimize row size and partitions of the geoparquets to improve the actual extraction.
Any feedback on the approach is welcome, hopefully this isn't completely out in left field.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

iceberg integration with geoparquet #257

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

iceberg integration with geoparquet #257

Uh oh!

dschneider-wxs Apr 18, 2025

Replies: 0 comments

dschneider-wxs
Apr 18, 2025