iceberg integration with geoparquet #257
Unanswered
dschneider-wxs
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi - Just getting my hands dirty with geoparquet trying to build a data lakehouse with iceberg and geoparquet. I haven't seen much detailed discussion on this, just "it would be good". PyIceberg doesn't yet support Iceberg v3 which include geo datatypes - only Sedona does as far as I can tell.
Excluding Spark/Sedona for now, what are some approaches people have used to integrate partitioning with iceberg and parquet? My data source is actually in netcdf and I was thinking of extracting to geoparquet because it's a vector data cube and doesn't fit great into the icechunk model.
I was thinking I'd extract values from netcdf with time and spatial coordinates then write with either duckdb or geopandas. The incoming data is in chunks on the order of MB. If i add a geoparquet to Iceberg the geometry column is a binary blob whose metadata is lost in translation when converted to duckdb or pandas. I saw that a scan of the iceberg table can actually yield the relevant parquet files. Would it make sense to add lat and long columns and setup iceberg partitioning on this? Then return the data files and read them directly with duckdb or gpd. This all feels like a bit of hack to access directly, even if just for reads, but as this data grows I need to extract subsets of the data and serve as netcdf again. I could also optimize row size and partitions of the geoparquets to improve the actual extraction.
Any feedback on the approach is welcome, hopefully this isn't completely out in left field.
Beta Was this translation helpful? Give feedback.
All reactions