-
Notifications
You must be signed in to change notification settings - Fork 63
Initial attempt at 2.0 spec #278
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| | Field Name | Type | Description | | ||
| | -------------- | ------------ | ----------- | | ||
| | encoding | string | **REQUIRED.** Name of the geometry encoding format. Currently `"WKB"`, `"point"`, `"linestring"`, `"polygon"`, `"multipoint"`, `"multilinestring"`, and `"multipolygon"` are supported. | | ||
| | encoding | string | **REQUIRED.** Name of the geometry encoding format. Only `"WKB"` is supported. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I contemplated a bit if we should call this something different than 'wkb', like to get across that it's a more native Parquet thing. But our language mostly talks about how the column is encoded, which is WKB, so I ended up just reusing it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping "WKB" is good. We want to be accurate, but I think we also don't want people to have to rewrite their GeoParquet metadata parsers if we can avoid it. (For the case where the reader is treating GEOMETRY/GEOGRAPHY as a byte array, WKB is still accurate).
| | crs | object\|null | [PROJJSON](https://proj.org/specifications/projjson.html) object representing the Coordinate Reference System (CRS) of the geometry. If the field is not provided, the default CRS is [OGC:CRS84](https://www.opengis.net/def/crs/OGC/1.3/CRS84), which means the data in this column must be stored in longitude, latitude based on the WGS84 datum. | | ||
| | orientation | string | Winding order of exterior ring of polygons. If present must be `"counterclockwise"`; interior rings are wound in opposite order. If absent, no assertions are made regarding the winding order. | | ||
| | edges | string | Name of the coordinate system for the edges. Must be one of `"planar"` or `"spherical"`. The default value is `"planar"`. | | ||
| | edges | string | Name of the coordinate system for the edges. Must be one of `"planar"` or `"non-planar"`. The default value is `"planar"`. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm open to something other than 'non-planar' here, but it seemed weird to keep calling it 'spherical' and then have 'spherical' be one of the algorithm options. It made sense to me to align that one exactly with the Parquet options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the "spherical" and just add the other options (again, so existing metadata parsers don't have to change). We can add a note that in Parquet the logical type is GEOGRAPHY for values other than planar.
(I think BigQuery might actively be using the spherical edge thing on export? But other than that probably the only file that actively uses spherical edges is a test file in GeoArrow data and I had to try really hard to actually write it. Which is to say, backward compatibility probably isn't as important here.).
|
|
||
| The `crs` field of GeoParquet MUST reflect the crs of the Parquet `crs` property on the GEOMETRY or GEOGRAPHY logical type. If the `srid` | ||
| option is used in Parquet then it MUST reference a CRS identifier from the [Spatial reference identifier](https://spatialreference.org/projjson_index.json) catalog. The GeoParquet `crs` field MUST have the same projjson that is in the corresponding spatial reference identifier catalog. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I contemplating putting more language in here, to try to really guide people more than the core Parquet spec does, as it's a bit vague / underspecified, but ran out of time and am not super sure what to do. I borrowed from #276 the language on having the CRS identifier - I think we should make it required that it's an actual identified CRS, not just some random authority and code.
I still don't really understand the projjson option in the Parquet spec - where that projjson is supposed to live. Seems like we could narrow the options here and give people a clearer recommendation, but I don't yet understand what that would be. I also was looking at what @rouault did in GDAL did and it seemed like maybe he just had it stick the projjson inline, I see:
required binary field_id=-1 crop:code_list (String);
optional binary field_id=-1 geometry
(Geometry(crs={"$schema":"https://proj.org/schemas/v0.7/projjson.schema.json","type":"ProjectedCRS","name":"MGI / Austria
Lambert","base_crs":{"type":"GeographicCRS","name":"MGI","datum":{"type":"GeodeticReferenceFrame","name":"Militar-Geographische
Institut","ellipsoid":{"name":"Bessel
1841","semi_major_axis":6377397.155,"inverse_flattening":299.1528128}},"coordinate_system":{"subtype":"ellipsoidal","axis":[{"name":"Geodeti
c latitude","abbreviation":"Lat","direction":"north","unit":"degree"},{"name":"Geodetic
longitude","abbreviation":"Lon","direction":"east","unit":"degree"}]},"id":{"authority":"EPSG","code":4312}},"conversion":{"name":"Austria
Lambert","method":{"name":"Lambert Conic Conformal (2SP)","id":{"authority":"EPSG","code":9802}},"parameters":[{"name":"Latitude of false
origin","value":47.5,"unit":"degree","id":{"authority":"EPSG","code":8821}},{"name":"Longitude of false
origin","value":13.3333333333333,"unit":"degree","id":{"authority":"EPSG","code":8822}},{"name":"Latitude of 1st standard
parallel","value":49,"unit":"degree","id":{"authority":"EPSG","code":8823}},{"name":"Latitude of 2nd standard
parallel","value":46,"unit":"degree","id":{"authority":"EPSG","code":8824}},{"name":"Easting at false
origin","value":400000,"unit":"metre","id":{"authority":"EPSG","code":8826}},{"name":"Northing at false
origin","value":400000,"unit":"metre","id":{"authority":"EPSG","code":8827}}]},"coordinate_system":{"subtype":"Cartesian","axis":[{"name":"N
orthing","abbreviation":"X","direction":"north","unit":"metre"},{"name":"Easting","abbreviation":"Y","direction":"east","unit":"metre"}]},"s
cope":"Topographic mapping (medium and small
scale).","area":"Austria.","bbox":{"south_latitude":46.4,"west_longitude":9.53,"north_latitude":49.02,"east_longitude":17.17},"id":{"authori
ty":"EPSG","code":31287}}));
}
Or maybe that's just how it's showing up?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long answer below, but the tl;dr is that I think inlining PROJJSON into the logical type is the most likely way to ensure the CRS is understood by the consumer. Keeping the PROJJSON requirement at the top level is probably good (although allowing the string "authority:code" might be helpful to unify it with the raster spec).
The Arrow C++ implementation (and probably the Rust implementation will do this too) inlines PROJJSON (or: inlines the GeoArrow CRS, which is usually PROJJSON but is permitted to be other things) into the CRS field because it turns out that zero Parquet implementations expose a way to add to the global key/value metadata store while converting into Parquet types. The other direction is also true...Parquet implementations didn't provide a reference to the global metadata in their type conversion code, so the projjson:metadata_key thing doesn't always work. (It does for Arrow C++ and GDAL I believe but I had to add it to Arrow C++ specifically to make this work).
I prototyped the top-level key/value metadata thing in Arrow C++ once and the other issue I ran into is that, because schema metadata is actively propagated, you have to be very careful what you name the CRS key or you end up with a reference to the wrong CRS if you read/modify/write a file. The best solution I could come up with was naming the key/value metadata projjson_crs_uglyHashOfTheJsonToAvoidCollisions (and then I ended up with more than one of them in some reasonably simple read/modify/write scenarios where only some of those values were referred to in the latest version of the file).
srid:<must be an integer according to the spec> doesn't actively refer to EPSG in the spec so there's no guarantee a consumer will interpret it that way (although I did add an example file and a note in the Parquet example files explaining that most producers will probably mean this and to please avoid actual EPSG codes if you're trying to use this an opaque identifier).
Non Arrow readers and writers have some more flexibility here because they don't have to depend on the GeoArrow--Parquet Type conversion in Arrow C++ (or soon arrow-rs).
I put examples in the Parquet test files (and a readme explaining how various CRS values might be written/interpreted, including inlined PROJJSON). https://github.com/apache/parquet-testing/tree/master/data/geospatial#geospatial-test-files .
|
|
||
| The bbox values MUST be in the same coordinate reference system as the geometry. | ||
|
|
||
| #### covering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we want to explain something about the native stats? I figure in a blog post we can explain why covering went away, but seems like we don't need to mention it. But just wondering if there's any advice we could / should give people? We could perhaps recommend some spatial sorting, like bring it in from https://github.com/opengeospatial/geoparquet/blob/main/format-specs/distributing-geoparquet.md#spatial-ordering Similarly we could recommend zstd.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think release notes/blog is a good idea. I can add an example of file size savings (I recently calculated that there are about 25GB of coverings in Overture buildings but the percentage is likely much higher for point data with few attributes).
paleolimbot
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for doing this! A more formal draft like this one has been on my and Jia's TODO list for a long time but we clearly haven't gotten there 😬
| | Field Name | Type | Description | | ||
| | -------------- | ------------ | ----------- | | ||
| | encoding | string | **REQUIRED.** Name of the geometry encoding format. Currently `"WKB"`, `"point"`, `"linestring"`, `"polygon"`, `"multipoint"`, `"multilinestring"`, and `"multipolygon"` are supported. | | ||
| | encoding | string | **REQUIRED.** Name of the geometry encoding format. Only `"WKB"` is supported. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keeping "WKB" is good. We want to be accurate, but I think we also don't want people to have to rewrite their GeoParquet metadata parsers if we can avoid it. (For the case where the reader is treating GEOMETRY/GEOGRAPHY as a byte array, WKB is still accurate).
| | crs | object\|null | [PROJJSON](https://proj.org/specifications/projjson.html) object representing the Coordinate Reference System (CRS) of the geometry. If the field is not provided, the default CRS is [OGC:CRS84](https://www.opengis.net/def/crs/OGC/1.3/CRS84), which means the data in this column must be stored in longitude, latitude based on the WGS84 datum. | | ||
| | orientation | string | Winding order of exterior ring of polygons. If present must be `"counterclockwise"`; interior rings are wound in opposite order. If absent, no assertions are made regarding the winding order. | | ||
| | edges | string | Name of the coordinate system for the edges. Must be one of `"planar"` or `"spherical"`. The default value is `"planar"`. | | ||
| | edges | string | Name of the coordinate system for the edges. Must be one of `"planar"` or `"non-planar"`. The default value is `"planar"`. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should keep the "spherical" and just add the other options (again, so existing metadata parsers don't have to change). We can add a note that in Parquet the logical type is GEOGRAPHY for values other than planar.
(I think BigQuery might actively be using the spherical edge thing on export? But other than that probably the only file that actively uses spherical edges is a test file in GeoArrow data and I had to try really hard to actually write it. Which is to say, backward compatibility probably isn't as important here.).
|
|
||
| The `crs` field of GeoParquet MUST reflect the crs of the Parquet `crs` property on the GEOMETRY or GEOGRAPHY logical type. If the `srid` | ||
| option is used in Parquet then it MUST reference a CRS identifier from the [Spatial reference identifier](https://spatialreference.org/projjson_index.json) catalog. The GeoParquet `crs` field MUST have the same projjson that is in the corresponding spatial reference identifier catalog. | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Long answer below, but the tl;dr is that I think inlining PROJJSON into the logical type is the most likely way to ensure the CRS is understood by the consumer. Keeping the PROJJSON requirement at the top level is probably good (although allowing the string "authority:code" might be helpful to unify it with the raster spec).
The Arrow C++ implementation (and probably the Rust implementation will do this too) inlines PROJJSON (or: inlines the GeoArrow CRS, which is usually PROJJSON but is permitted to be other things) into the CRS field because it turns out that zero Parquet implementations expose a way to add to the global key/value metadata store while converting into Parquet types. The other direction is also true...Parquet implementations didn't provide a reference to the global metadata in their type conversion code, so the projjson:metadata_key thing doesn't always work. (It does for Arrow C++ and GDAL I believe but I had to add it to Arrow C++ specifically to make this work).
I prototyped the top-level key/value metadata thing in Arrow C++ once and the other issue I ran into is that, because schema metadata is actively propagated, you have to be very careful what you name the CRS key or you end up with a reference to the wrong CRS if you read/modify/write a file. The best solution I could come up with was naming the key/value metadata projjson_crs_uglyHashOfTheJsonToAvoidCollisions (and then I ended up with more than one of them in some reasonably simple read/modify/write scenarios where only some of those values were referred to in the latest version of the file).
srid:<must be an integer according to the spec> doesn't actively refer to EPSG in the spec so there's no guarantee a consumer will interpret it that way (although I did add an example file and a note in the Parquet example files explaining that most producers will probably mean this and to please avoid actual EPSG codes if you're trying to use this an opaque identifier).
Non Arrow readers and writers have some more flexibility here because they don't have to depend on the GeoArrow--Parquet Type conversion in Arrow C++ (or soon arrow-rs).
I put examples in the Parquet test files (and a readme explaining how various CRS values might be written/interpreted, including inlined PROJJSON). https://github.com/apache/parquet-testing/tree/master/data/geospatial#geospatial-test-files .
|
|
||
| The bbox values MUST be in the same coordinate reference system as the geometry. | ||
|
|
||
| #### covering |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think release notes/blog is a good idea. I can add an example of file size savings (I recently calculated that there are about 25GB of coverings in Overture buildings but the percentage is likely much higher for point data with few attributes).
|
|
||
| # Optionally both commit and push | ||
| default_stages: [commit] | ||
| default_stages: [pre-commit] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, not sure why this changed, I'll revert - I was just trying to get the pre-commit hooks to run locally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible that the latest version of pre-commit renamed the stage?
First draft of a 2.0 specification.
Changes:
The general philosophy of this is to require the full metadata if you want to be GeoParquet 2.0 compliant, and it must match the Parquet metadata. I could imagine a different philosophy where we remove the duplicate information - GeoParquet would not have crs, geometry types, edges / algorithm, and would provide the additional info - primary column, orientation, epoch.
And another approach I've been contemplating is that if a file adds a geometry or geography in Parquet and does it right we call it 'GeoParquet 2.0 compliant', even if they don't write any additional metadata. I stopped short of trying to figure out how exactly to specify that, but I lean towards going in that direction - use the 'brand name' of GeoParquet to encourage Parquet geospatial implementation.
I also have been contemplating if it'd be a good idea to do a GeoParquet 1.2, where we leave in covering and arrow types as optional (though maybe deprecated). @Maxxen spurred that thinking as I believe he said somewhere it could help with backwards compatibility as duckdb code looks for version 1.x in GeoParquet. But I'm more recently leaning away from that, since older libraries using older versions of parquet readers look to mostly just crash when they encounter a geometry or geography logical type. If they wouldn't crash I'd be inclined for a compatibility mode, but it seems like we'd just be aiming at a narrow slice when people upgraded their parquet reader but are too lazy to change the geospatial handling. So I'm more inclined to just push towards getting all geospatial libraries to implement reading 2.0 / parquet types. But I'm definitely open to arguments as to a 1.2 having a value. It wouldn't be hard to do, but the messaging might be a bit weird.
I have more open questions, but will put them in the follow-up.