-
Notifications
You must be signed in to change notification settings - Fork 4
Description
Python GeoArrow Module Proposal
The strength of Arrow is in its interoperability, and therefore I think it's worthwhile to discuss how to ensure all the pieces around geoarrow-python
fit together really well.
(It's possible this should be written RFC-style as a PR to a docs folder in this repo?)
Goals:
- Modular: the user can install what they need and choose which dependencies they want.
- Interoperable: the user can use c-based and rust-based (and more? CUDA?) modules together smoothly.
- Extensible: future developers can develop on top of
geoarrow-c
and/orgeoarrow-rust
and largely reuse their python bindings without having to create ones from scratch - Strongly typed. A method like
convex_hull
should always return aPolygonArray
instead of a genericGeometryArray
that the user can't "see into". - Static typing support: At least minimal typing support and IDE autocompletion where possible.
- No strict pyarrow dependency. At least in the longer term, users should not be required to use pyarrow, even though it's likely the vast majority will.
This proposal is based around the new Arrow PyCapsule Interface, which allows libraries to safely interoperate data without memory leaks and without going through pyarrow. This is implemented in pyarrow as of v14+, work is underway to add it to arrow-rs, and presumably nanoarrow support is not too hard to implement.
Primarily Functional API
A functional API makes it easy to take in data without knowing its provenance. Implementations may choose to also implement methods on classes if desired to improve the API usability, but nothing should be implemented solely as a method.
Data Structures
These are the data structure concepts that I think need to be first-class. Each core implementation will implement classes that conform to one of these
GeometryArray
This is a logical array of contiguous memory that conforms to the GeoArrow spec. I envision there being PointArray
, LineStringArray
, etc. classes that are all subclasses of this.
This object should have an __arrow_c_array__
member that conforms to the PyCapsule interface. The exported ArrowSchema
must include extension type information (an extension name of geoarrow.*
and optionally extension metadata).
Whether the array uses small or large list offsets internally does not matter, but the implementation should respect the requested_schema
parameter of the PyCapsule interface when exporting.
GeometryStorageArray?
In geoarrow-rs I've tried to make a distinction between "storage" types (i.e. WKB and WKT) and "analysis" types (i.e. anything zero-copy). This is partially to nudge users not to store data as WKB and operate directly on the WKB repeatedly. Do we want to make any spec-level distinction between storage and analysis arrays? Should every operation accept storage types? I think it should be fine for a function to declare it'll accept only non-storage types, and direct a user to call, say, parse_wkb
.
ChunkedGeometryArray
I believe that chunked arrays need to be a first-class data concept. Chunking is core to the Arrow and Parquet ecosystems, and to handle something like unary_union
that requires the entire column as input to a single kernel requires understanding some type of chunked input. I envision there being ChunkedPointArray
, ChunkedLineStringArray
, etc. classes that are all subclasses of this.
This should have an __arrow_c_stream__
member. The ArrowSchema
must represent a valid GeoArrow geometry type and must include extension type information (at least a name of geoarrow.*
and optionally extension metadata).
This stream should be compatible with Dewey's existing kernel structure that allows for pushing a sequence of arrays into the kernel.
(It looks like pyarrow doesn't implement __arrow_c_stream__
for a ChunkedArray
? To me it seems natural for it to exist on a ChunkedArray... I'd be happy to open an issue.)
GeometryTable
For operations like joins, kernels need to be aware not only of geometries but also of attribute columns.
This should have an __arrow_c_stream__
member. The ArrowSchema
must be a struct type that includes all fields in the table. The ArrowArray must be a struct array that includes all arrays in the table. At least one child of the ArrowSchema
must have GeoArrow extension type information (an extension name of geoarrow.*
and optionally extension metadata).
Future proofing
Spatial indexes can be serialized within a table or geometry array by having a struct containing the geometry column and a binary-typed run end array holding the bytes of the index (pending geoarrow discussion).
Not sure what other future proofing to consider.
Module hierarchy
General things to consider:
- How much appetite for monorepo-based approach? I.e. for shapely interop would you rather have an optional dependency on shapely from
geoarrow.pyarrow
or have a separate librarygeoarrow.shapely
that's very minimal. (Personally, I could go either way, but ifgeoarrow.shapely
isn't likely to change often, I might lean towards a separate module...?) - We presumably can't have import cycles across submodules
- Versioning? I have to say I don't love requiring all libraries to be at the same version number, like the general Arrow libraries do.
geoarrow.pyarrow
- Pyarrow-based extension type classes
- Does not have any external dependencies other than pyarrow
- Holds and registers pyarrow extension types and extension arrays for all classes.
geoarrow.pandas
- depends on geoarrow-pyarrow, pyarrow, pandas
- should it have required? optional? dependencies on other submodules for operations on arrays?
geoarrow.shapely
-
Contains two main functions for I/O between geoarrow geometry arrays and shapely using the shapely to/from ragged array implementation.
import numpy as np from numpy.typing import NDArray def to_shapely( array: ArrowArrayExportable | ArrowStreamExportable ) -> NDArray[np.object_]: ... def from_shapely( arr: NDArray[np.object_], *, maxchunk: int ) -> ArrowArrayExportable | ArrowStreamExportable: ...
-
from_shapely
returns pyarrow-based extension arrays. Longer term it also takes a parameter for the max chunk size. -
depends on geoarrow-pyarrow, shapely
geoarrow.gdal
Wrapper around pyogrio?
geoarrow.c
I'll let dewey give his thoughts here.
- Dependency free?
geoarrow.rust.core
- standalone classes, PointArray, LineStringArray, etc
- future: chunked classes
- no python dependencies?
- includes pure-rust algorithms that don't require a c extension module
- Question: if I don't don't have python dependencies, what do I return? Should I wrap my own versions of a
Float64Array
and assume the user will call pyarrow.array() on the result? Or should I depend on pyarrow in the short term?
geoarrow.rust.proj, geoarrow.rust.geos
- Adds C-based dependencies that may not be desired in
geoarrow.rust.core
. - Rust dependency on
geoarrow-rs
but no python dependencies - Only functional, no methods on classes (can't add methods to external objects)
Downsides
- leaks implementation details: does the user want/need to know what's implemented in rust vs c? Or is that ok because we're targeting advanced users here (and libraries that build on top of
geoarrow.*
will handle making it simple for end users)? - Multiple copies of geometry array definitions. E.g.
geoarrow.pyarrow.PointArray
,geoarrow.c.PointArray
,geoarrow.rust.core.PointArray
. This is, in some ways, unfortunate. But it allows users to control dependencies closely. And unavoidable unless functions returned bare PyCapsule objects? - Explosion of implementations: function definition in rust, geoarrow.rust.core, geoarrow.pandas, geopolars
Static Typing
A full proposal for static typing is out of the scope of this proposal (and some operations just won't be possible to type accurately).
A few methods will be amenable to generics, as shown below. But ideally every function can be given a return type that matches one of the Arrow PyCapsule protocols. At least in the Rust implementation, I'd like to have type stubs that accurately return type classes (though sadly I'll still have to write the .pyi
type stubs by hand).
from typing import Protocol, Tuple, TypeVar, reveal_type
class ArrowArrayExportable(Protocol):
def __arrow_c_array__(
self, requested_schema: object | None = None
) -> Tuple[object, object]:
...
class ArrowStreamExportable(Protocol):
def __arrow_c_stream__(self, requested_schema: object | None = None) -> object:
...
ArrayT = TypeVar("ArrayT", bound=ArrowArrayExportable)
StreamT = TypeVar("StreamT", bound=ArrowStreamExportable)
class PointArray:
def __arrow_c_array__(
self, requested_schema: object | None = None
) -> Tuple[object, object]:
...
class ChunkedPointArray:
def __arrow_c_stream__(self, requested_schema: object | None = None) -> object:
...
def translate(array: ArrayT | StreamT, x: float, y: float) -> ArrayT | StreamT:
...
p = PointArray()
p2 = translate(p, 1, 1)
reveal_type(p2)
# Type of "p2" is "PointArray"
cp = ChunkedPointArray()
cp2 = translate(cp, 1, 1)
reveal_type(cp2)
# Type of "cp2" is "ChunkedPointArray"