Skip to content

Geoparquet 1.1 spec compliance update#2271

Merged
tomkralidis merged 13 commits intogeopython:masterfrom
cgs-earth:geoparquetUpdates
Mar 9, 2026
Merged

Geoparquet 1.1 spec compliance update#2271
tomkralidis merged 13 commits intogeopython:masterfrom
cgs-earth:geoparquetUpdates

Conversation

@C-Loftus
Copy link
Contributor

@C-Loftus C-Loftus commented Mar 4, 2026

Overview

This PR address #2252

  • If a parquet file has a bbox column, it will be filtered with using the bbox instead of needing to specify multiple x / y columns
  • Added support to allow for id columns that are strings other other types. Previously it just used the row index of the data as the id value no matter what. This caused issues when the id was a string.
    • I added a warning log if the id column is non-numeric. Parquet cannot calculate row statistics as easily for strings or other non-numeric types, and as such, queries are likely to be less efficient
  • Cleaned up some of the internals; previously there were some dangling variables or unclear semantics on how the bbox filter works
  • Cleaned up tests. There is one class for all naive parquet files (i.e. ones that don't follow the geoparquet spec and naively set a point column) and one for all geoparquet compliant files

Related Issue / discussion

Closes #2252

Additional information

  • In the future it would be useful to change all code in this provider to skip the pandas serialization step. It is generally not necessary and adds extra overhead to convert to the intermediate format. However, for the sake of not making too many changes in one PR, I've left it as is.

  • I have not changed any of the code for s3fs. It is possible this could be tweaked for better performance.

Dependency policy (RFC2)

  • I have ensured that this PR meets RFC2 requirements

Updates to public demo

Contributions and licensing

(as per https://github.com/geopython/pygeoapi/blob/master/CONTRIBUTING.md#contributions-and-licensing)

  • I'd like to contribute [feature X|bugfix Y|docs|something else] to pygeoapi. I confirm that my contributions to pygeoapi will be compatible with the pygeoapi license guidelines at the time of contribution
  • I have already previously agreed to the pygeoapi Contributions and Licensing Guidelines

Copy link
Contributor Author

@C-Loftus C-Loftus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few comments on anything non obvious

' field of provider config'
LOGGER.error(msg)
raise Exception(msg)
raise ProviderGenericError(msg)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raises a specific error instead of generic exception

self.maxy = self.y_field
else:
self.miny, self.maxy = self.y_field
self.bb = [self.minx, self.miny, self.maxx, self.maxy]
Copy link
Contributor Author

@C-Loftus C-Loftus Mar 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable self.bb was unused

:returns: generator of RecordBatch with the queried values
"""
scanner = pyarrow.dataset.Scanner.from_dataset(self.ds, **kwargs)
scanner = self.ds.scanner(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To my understanding pyarrow more efficiently creates the scanner when calling with the method on the object.

Feel free to let me know if you prefer removing use_threads or tweaking other kwargs here


:returns: dict of 0..n GeoJSON features
"""
result = None
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to store this as a variable since we always just return the result directly

@tomkralidis tomkralidis self-requested a review March 4, 2026 22:50
@tomkralidis
Copy link
Member

@C-Loftus can you address the CI errors? Thanks in advance.

@C-Loftus
Copy link
Contributor Author

C-Loftus commented Mar 5, 2026

Fixed! Sorry I missed that

@C-Loftus
Copy link
Contributor Author

C-Loftus commented Mar 6, 2026

Thanks for your review and feedback! I will prioritize addressing this as soon as I have some time, most likely early next week

@C-Loftus
Copy link
Contributor Author

C-Loftus commented Mar 9, 2026

Thanks again for your review. I believe I should have all feedback addressed.

The one extra change I made in the latest commit is to remove some of the dangling variables from get_test_file_path and just put them directly in the fixture so its is clearer they are only used once and what fixture they correspond to.

@tomkralidis tomkralidis added this to the 0.24.0 milestone Mar 9, 2026
@tomkralidis tomkralidis self-requested a review March 9, 2026 12:50
@tomkralidis tomkralidis merged commit 16736b2 into geopython:master Mar 9, 2026
3 checks passed
@C-Loftus C-Loftus deleted the geoparquetUpdates branch March 9, 2026 12:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Update parquet provider for latest geoparquet spec version / best practices

2 participants