Python GeoArrow Module Proposal

# Python GeoArrow Module Proposal

The strength of Arrow is in its interoperability, and therefore I think it's worthwhile to discuss how to ensure all the pieces around `geoarrow-python` fit together really well.

(It's possible this should be written RFC-style as a PR to a docs folder in this repo?)

## Goals:

- Modular: the user can install what they need and choose which dependencies they want.
- Interoperable: the user can use c-based and rust-based (and more? CUDA?) modules together smoothly.
- Extensible: future developers can develop on top of `geoarrow-c` and/or `geoarrow-rust` and largely reuse their python bindings without having to create ones from scratch
- Strongly typed. A method like `convex_hull` should always return a `PolygonArray` instead of a generic `GeometryArray` that the user can't "see into".
- Static typing support: At least minimal typing support and IDE autocompletion where possible.
- No strict pyarrow dependency. At least in the longer term, users should not be _required_ to use pyarrow, even though it's likely the vast majority will.

This proposal is based around the new [Arrow PyCapsule Interface](https://arrow.apache.org/docs/format/CDataInterface/PyCapsuleInterface.html), which allows libraries to safely interoperate data without memory leaks and without going through pyarrow. This is implemented in pyarrow as of v14+, work is underway to add it to arrow-rs, and presumably nanoarrow support is not too hard to implement.

## Primarily Functional API

A functional API makes it easy to take in data without knowing its provenance. Implementations may choose to also implement methods on classes if desired to improve the API usability, but nothing should be implemented _solely_ as a method.

## Data Structures

These are the data structure concepts that I think need to be first-class. Each core implementation will implement classes that conform to one of these

### GeometryArray

This is a logical array of contiguous memory that conforms to the GeoArrow spec. I envision there being `PointArray`, `LineStringArray`, etc. classes that are all subclasses of this.

This object should have an `__arrow_c_array__` member that conforms to the PyCapsule interface. The exported `ArrowSchema` must include extension type information (an extension name of `geoarrow.*` and optionally extension metadata).

Whether the array uses small or large list offsets internally does not matter, but the implementation should respect the `requested_schema` parameter of the PyCapsule interface when exporting.

#### GeometryStorageArray?

In geoarrow-rs I've tried to make a distinction between "storage" types (i.e. WKB and WKT) and "analysis" types (i.e. anything zero-copy). This is partially to nudge users not to store data as WKB and operate directly on the WKB repeatedly. Do we want to make any spec-level distinction between storage and analysis arrays? Should every operation accept storage types? I think it should be fine for a function to declare it'll accept only non-storage types, and direct a user to call, say, `parse_wkb`.

### ChunkedGeometryArray

I believe that chunked arrays need to be a first-class data concept. Chunking is core to the Arrow and Parquet ecosystems, and to handle something like `unary_union` that requires the entire _column_ as input to a single kernel requires understanding some type of chunked input. I envision there being `ChunkedPointArray`, `ChunkedLineStringArray`, etc. classes that are all subclasses of this.

This should have an `__arrow_c_stream__` member. The `ArrowSchema` must represent a valid GeoArrow geometry type and must include extension type information (at least a name of `geoarrow.*` and optionally extension metadata).

This stream should be compatible with Dewey's existing kernel structure that allows for pushing a sequence of arrays into the kernel.

(It looks like pyarrow doesn't implement `__arrow_c_stream__` for a `ChunkedArray`? To me it seems natural for it to exist on a ChunkedArray... I'd be happy to open an issue.)

### GeometryTable

For operations like joins, kernels need to be aware not only of geometries but also of attribute columns.

This should have an `__arrow_c_stream__` member. The `ArrowSchema` must be a struct type that includes all fields in the table. The ArrowArray must be a struct array that includes all arrays in the table. At least one child of the `ArrowSchema` must have GeoArrow extension type information (an extension name of `geoarrow.*` and optionally extension metadata).

### Future proofing

Spatial indexes can be serialized within a table or geometry array by having a struct containing the geometry column and a binary-typed run end array holding the bytes of the index (pending geoarrow discussion).

Not sure what other future proofing to consider.

## Module hierarchy

General things to consider:

- How much appetite for monorepo-based approach? I.e. for shapely interop would you rather have an optional dependency on shapely from `geoarrow.pyarrow` or have a separate library `geoarrow.shapely` that's very minimal. (Personally, I could go either way, but if `geoarrow.shapely` isn't likely to change often, I might lean towards a separate module...?)
- We presumably can't have import cycles across submodules
- Versioning? I have to say I don't _love_ requiring all libraries to be at the same version number, like the general Arrow libraries do.

### geoarrow.pyarrow

- Pyarrow-based extension type classes
- Does not have any external dependencies other than pyarrow
- Holds and registers pyarrow extension types and extension arrays for all classes.

### geoarrow.pandas

- depends on geoarrow-pyarrow, pyarrow, pandas
- should it have required? optional? dependencies on other submodules for operations on arrays?

### geoarrow.shapely

- Contains two main functions for I/O between geoarrow geometry arrays and shapely using the shapely to/from ragged array implementation.

    ```py
    import numpy as np
    from numpy.typing import NDArray

    def to_shapely(
        array: ArrowArrayExportable | ArrowStreamExportable
    ) -> NDArray[np.object_]: ...
    def from_shapely(
        arr: NDArray[np.object_],
        *,
        maxchunk: int
    ) -> ArrowArrayExportable | ArrowStreamExportable: ...
    ```
- `from_shapely` returns pyarrow-based extension arrays. Longer term it also takes a parameter for the max chunk size.
- depends on geoarrow-pyarrow, shapely

### geoarrow.gdal

Wrapper around pyogrio?

### geoarrow.c

I'll let dewey give his thoughts here.

- Dependency free?

### geoarrow.rust.core

- standalone classes, PointArray, LineStringArray, etc
- future: chunked classes
- no python dependencies?
- includes pure-rust algorithms that don't require a c extension module
- Question: if I don't don't have python dependencies, what do I return? Should I wrap my own versions of a `Float64Array` and assume the user will call pyarrow.array() on the result? Or should I depend on pyarrow in the short term?

### geoarrow.rust.proj, geoarrow.rust.geos

- Adds C-based dependencies that may not be desired in `geoarrow.rust.core`.
- Rust dependency on `geoarrow-rs` but no python dependencies
- Only functional, no methods on classes (can't add methods to external objects)

## Downsides

- leaks implementation details: does the user want/need to know what's implemented in rust vs c? Or is that ok because we're targeting advanced users here (and libraries that build on top of `geoarrow.*` will handle making it simple for end users)?
- Multiple copies of geometry array definitions. E.g. `geoarrow.pyarrow.PointArray`, `geoarrow.c.PointArray`, `geoarrow.rust.core.PointArray`. This is, in some ways, unfortunate. But it allows users to control dependencies closely. And unavoidable unless functions returned bare PyCapsule objects?
- Explosion of implementations: function definition in rust, geoarrow.rust.core, geoarrow.pandas, geopolars

## Static Typing

A full proposal for static typing is out of the scope of this proposal (and some operations just won't be possible to type accurately).

_A few_ methods will be amenable to generics, as shown below. But ideally every function can be given a return type that matches one of the Arrow PyCapsule protocols. At least in the Rust implementation, I'd like to have type stubs that accurately return type classes (though sadly I'll still have to write the `.pyi` type stubs by hand).

```py
from typing import Protocol, Tuple, TypeVar, reveal_type


class ArrowArrayExportable(Protocol):
    def __arrow_c_array__(
        self, requested_schema: object | None = None
    ) -> Tuple[object, object]:
        ...


class ArrowStreamExportable(Protocol):
    def __arrow_c_stream__(self, requested_schema: object | None = None) -> object:
        ...


ArrayT = TypeVar("ArrayT", bound=ArrowArrayExportable)
StreamT = TypeVar("StreamT", bound=ArrowStreamExportable)


class PointArray:
    def __arrow_c_array__(
        self, requested_schema: object | None = None
    ) -> Tuple[object, object]:
        ...


class ChunkedPointArray:
    def __arrow_c_stream__(self, requested_schema: object | None = None) -> object:
        ...


def translate(array: ArrayT | StreamT, x: float, y: float) -> ArrayT | StreamT:
    ...


p = PointArray()
p2 = translate(p, 1, 1)
reveal_type(p2)
# Type of "p2" is "PointArray"

cp = ChunkedPointArray()
cp2 = translate(cp, 1, 1)
reveal_type(cp2)
# Type of "cp2" is "ChunkedPointArray"
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Python GeoArrow Module Proposal #38

Python GeoArrow Module Proposal

Goals:

Primarily Functional API

Data Structures

GeometryArray

GeometryStorageArray?

ChunkedGeometryArray

GeometryTable

Future proofing

Module hierarchy

geoarrow.pyarrow

geoarrow.pandas

geoarrow.shapely

geoarrow.gdal

geoarrow.c

geoarrow.rust.core

geoarrow.rust.proj, geoarrow.rust.geos

Downsides

Static Typing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Python GeoArrow Module Proposal #38

Description

Python GeoArrow Module Proposal

Goals:

Primarily Functional API

Data Structures

GeometryArray

GeometryStorageArray?

ChunkedGeometryArray

GeometryTable

Future proofing

Module hierarchy

geoarrow.pyarrow

geoarrow.pandas

geoarrow.shapely

geoarrow.gdal

geoarrow.c

geoarrow.rust.core

geoarrow.rust.proj, geoarrow.rust.geos

Downsides

Static Typing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions