Skip to content

Commit 89a129c

Browse files
authored
feat: Add Geometry & Geography Types (#2859)
Closes #1820 # Rationale for this change Apache Iceberg v3 introduces native `geometry` and `geography` primitive types. This PR adds spec-compliant support for those types in PyIceberg, including: - New primitive types with parameter handling and format-version enforcement - Schema parsing and round-trip serialization - Avro mapping using WKB bytes - PyArrow / Parquet integration with version-aware fallbacks - Unit test coverage for type behavior and compatibility constraints A full design and scope discussion is available in the accompanying RFC: 📄 **RFC: [Iceberg v3 Geospatial Primitive Types](#3004 The RFC documents scope, non-goals, compatibility constraints, and known limitations. # Are these changes tested? Yes. - Unit tests cover: - Type creation (default and custom parameters) - `__str__` / `__repr__` round-tripping - Equality, hashing, and pickling - Format version enforcement (v3-only) - PyArrow-dependent behavior is version-gated and conditionally tested # Are there any user-facing changes? Yes. - Users may now declare `geometry` and `geography` columns in Iceberg v3 schemas - Parquet files written with PyArrow ≥ 21 preserve GEO logical types when available - Limitations (e.g. no WKB↔WKT conversion, no spatial predicates) are documented <!-- Changelog label added -->
1 parent 4ba9a8c commit 89a129c

File tree

11 files changed

+909
-1
lines changed

11 files changed

+909
-1
lines changed

mkdocs/docs/geospatial.md

Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Geospatial Types
2+
3+
PyIceberg supports Iceberg v3 geospatial primitive types: `geometry` and `geography`.
4+
5+
## Overview
6+
7+
Iceberg v3 introduces native support for spatial data types:
8+
9+
- **`geometry(C)`**: Represents geometric shapes in a coordinate reference system (CRS)
10+
- **`geography(C, A)`**: Represents geographic shapes with CRS and calculation algorithm
11+
12+
Both types store values as WKB (Well-Known Binary) bytes.
13+
14+
## Requirements
15+
16+
- Iceberg format version 3 or higher
17+
- Optional: `geoarrow-pyarrow` for GeoArrow extension type metadata and interoperability. Without it, geometry and geography are written as binary in Parquet while the Iceberg schema still preserves the spatial type. Install with `pip install pyiceberg[geoarrow]`.
18+
19+
## Usage
20+
21+
### Declaring Columns
22+
23+
```python
24+
from pyiceberg.schema import Schema
25+
from pyiceberg.types import NestedField, GeometryType, GeographyType
26+
27+
# Schema with geometry and geography columns
28+
schema = Schema(
29+
NestedField(1, "id", IntegerType(), required=True),
30+
NestedField(2, "location", GeometryType(), required=True),
31+
NestedField(3, "boundary", GeographyType(), required=False),
32+
)
33+
```
34+
35+
### Type Parameters
36+
37+
#### GeometryType
38+
39+
```python
40+
# Default CRS (OGC:CRS84)
41+
GeometryType()
42+
43+
# Custom CRS
44+
GeometryType("EPSG:4326")
45+
```
46+
47+
#### GeographyType
48+
49+
```python
50+
# Default CRS (OGC:CRS84) and algorithm (spherical)
51+
GeographyType()
52+
53+
# Custom CRS
54+
GeographyType("EPSG:4326")
55+
56+
# Custom CRS and algorithm
57+
GeographyType("EPSG:4326", "planar")
58+
```
59+
60+
### String Type Syntax
61+
62+
Types can also be specified as strings in schema definitions:
63+
64+
```python
65+
# Using string type names
66+
NestedField(1, "point", "geometry", required=True)
67+
NestedField(2, "region", "geography", required=True)
68+
69+
# With parameters
70+
NestedField(3, "location", "geometry('EPSG:4326')", required=True)
71+
NestedField(4, "boundary", "geography('EPSG:4326', 'planar')", required=True)
72+
```
73+
74+
## Data Representation
75+
76+
Values are represented as WKB (Well-Known Binary) bytes at runtime:
77+
78+
```python
79+
# Example: Point(0, 0) in WKB format
80+
point_wkb = bytes.fromhex("0101000000000000000000000000000000000000")
81+
```
82+
83+
## Current Limitations
84+
85+
1. **WKB/WKT Conversion**: Converting between WKB bytes and WKT strings requires external libraries (like Shapely). PyIceberg does not include this conversion to avoid heavy dependencies.
86+
87+
2. **Spatial Predicates**: Spatial filtering (e.g., ST_Contains, ST_Intersects) is not yet supported for query pushdown.
88+
89+
3. **Bounds Metrics**: Geometry/geography columns do not currently contribute to data file bounds metrics.
90+
91+
4. **Without geoarrow-pyarrow**: When the `geoarrow-pyarrow` package is not installed, geometry and geography columns are stored as binary without GeoArrow extension type metadata. The Iceberg schema preserves type information, but other tools reading the Parquet files directly may not recognize them as spatial types. Install with `pip install pyiceberg[geoarrow]` for full GeoArrow support.
92+
93+
## Format Version
94+
95+
Geometry and geography types require Iceberg format version 3:
96+
97+
```python
98+
from pyiceberg.table import TableProperties
99+
100+
# Creating a v3 table
101+
table = catalog.create_table(
102+
identifier="db.spatial_table",
103+
schema=schema,
104+
properties={
105+
TableProperties.FORMAT_VERSION: "3"
106+
}
107+
)
108+
```
109+
110+
Attempting to use these types with format version 1 or 2 will raise a validation error.

pyiceberg/avro/resolver.py

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -87,6 +87,8 @@
8787
DoubleType,
8888
FixedType,
8989
FloatType,
90+
GeographyType,
91+
GeometryType,
9092
IcebergType,
9193
IntegerType,
9294
ListType,
@@ -204,6 +206,14 @@ def visit_binary(self, binary_type: BinaryType) -> Writer:
204206
def visit_unknown(self, unknown_type: UnknownType) -> Writer:
205207
return UnknownWriter()
206208

209+
def visit_geometry(self, geometry_type: "GeometryType") -> Writer:
210+
"""Geometry is written as WKB bytes in Avro."""
211+
return BinaryWriter()
212+
213+
def visit_geography(self, geography_type: "GeographyType") -> Writer:
214+
"""Geography is written as WKB bytes in Avro."""
215+
return BinaryWriter()
216+
207217

208218
CONSTRUCT_WRITER_VISITOR = ConstructWriter()
209219

@@ -359,6 +369,14 @@ def visit_binary(self, binary_type: BinaryType, partner: IcebergType | None) ->
359369
def visit_unknown(self, unknown_type: UnknownType, partner: IcebergType | None) -> Writer:
360370
return UnknownWriter()
361371

372+
def visit_geometry(self, geometry_type: "GeometryType", partner: IcebergType | None) -> Writer:
373+
"""Geometry is written as WKB bytes in Avro."""
374+
return BinaryWriter()
375+
376+
def visit_geography(self, geography_type: "GeographyType", partner: IcebergType | None) -> Writer:
377+
"""Geography is written as WKB bytes in Avro."""
378+
return BinaryWriter()
379+
362380

363381
class ReadSchemaResolver(PrimitiveWithPartnerVisitor[IcebergType, Reader]):
364382
__slots__ = ("read_types", "read_enums", "context")
@@ -498,6 +516,14 @@ def visit_binary(self, binary_type: BinaryType, partner: IcebergType | None) ->
498516
def visit_unknown(self, unknown_type: UnknownType, partner: IcebergType | None) -> Reader:
499517
return UnknownReader()
500518

519+
def visit_geometry(self, geometry_type: "GeometryType", partner: IcebergType | None) -> Reader:
520+
"""Geometry is read as WKB bytes from Avro."""
521+
return BinaryReader()
522+
523+
def visit_geography(self, geography_type: "GeographyType", partner: IcebergType | None) -> Reader:
524+
"""Geography is read as WKB bytes from Avro."""
525+
return BinaryReader()
526+
501527

502528
class SchemaPartnerAccessor(PartnerAccessor[IcebergType]):
503529
def schema_partner(self, partner: IcebergType | None) -> IcebergType | None:

pyiceberg/conversions.py

Lines changed: 92 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,8 @@
4949
DoubleType,
5050
FixedType,
5151
FloatType,
52+
GeographyType,
53+
GeometryType,
5254
IntegerType,
5355
LongType,
5456
PrimitiveType,
@@ -182,6 +184,18 @@ def _(type_: UnknownType, _: str) -> None:
182184
return None
183185

184186

187+
@partition_to_py.register(GeometryType)
188+
@partition_to_py.register(GeographyType)
189+
@handle_none
190+
def _(_: PrimitiveType, value_str: str) -> bytes:
191+
"""Convert a geometry/geography partition string to bytes.
192+
193+
Note: Partition values for geometry/geography types are expected to be
194+
hex-encoded WKB (Well-Known Binary) strings.
195+
"""
196+
return bytes.fromhex(value_str)
197+
198+
185199
@singledispatch
186200
def to_bytes(
187201
primitive_type: PrimitiveType, _: bool | bytes | Decimal | date | datetime | float | int | str | time | uuid.UUID
@@ -274,6 +288,8 @@ def _(_: UUIDType, value: uuid.UUID | bytes) -> bytes:
274288

275289
@to_bytes.register(BinaryType)
276290
@to_bytes.register(FixedType)
291+
@to_bytes.register(GeometryType)
292+
@to_bytes.register(GeographyType)
277293
def _(_: PrimitiveType, value: bytes) -> bytes:
278294
return value
279295

@@ -355,6 +371,8 @@ def _(_: StringType, b: bytes) -> str:
355371
@from_bytes.register(BinaryType)
356372
@from_bytes.register(FixedType)
357373
@from_bytes.register(UUIDType)
374+
@from_bytes.register(GeometryType)
375+
@from_bytes.register(GeographyType)
358376
def _(_: PrimitiveType, b: bytes) -> bytes:
359377
return b
360378

@@ -475,6 +493,40 @@ def _(_: UUIDType, val: uuid.UUID) -> str:
475493
return str(val)
476494

477495

496+
@to_json.register(GeometryType)
497+
def _(_: GeometryType, val: bytes) -> str:
498+
"""Serialize geometry to WKT string per Iceberg spec.
499+
500+
Note: This requires WKB to WKT conversion which is not yet implemented.
501+
The Iceberg spec requires geometry values to be serialized as WKT strings
502+
in JSON, but PyIceberg stores geometry as WKB bytes at runtime.
503+
504+
Raises:
505+
NotImplementedError: WKB to WKT conversion is not yet supported.
506+
"""
507+
raise NotImplementedError(
508+
"Geometry JSON serialization requires WKB to WKT conversion, which is not yet implemented. "
509+
"See https://iceberg.apache.org/spec/#json-single-value-serialization for spec details."
510+
)
511+
512+
513+
@to_json.register(GeographyType)
514+
def _(_: GeographyType, val: bytes) -> str:
515+
"""Serialize geography to WKT string per Iceberg spec.
516+
517+
Note: This requires WKB to WKT conversion which is not yet implemented.
518+
The Iceberg spec requires geography values to be serialized as WKT strings
519+
in JSON, but PyIceberg stores geography as WKB bytes at runtime.
520+
521+
Raises:
522+
NotImplementedError: WKB to WKT conversion is not yet supported.
523+
"""
524+
raise NotImplementedError(
525+
"Geography JSON serialization requires WKB to WKT conversion, which is not yet implemented. "
526+
"See https://iceberg.apache.org/spec/#json-single-value-serialization for spec details."
527+
)
528+
529+
478530
@singledispatch # type: ignore
479531
def from_json(primitive_type: PrimitiveType, val: Any) -> L: # type: ignore
480532
"""Convert JSON value types into built-in python values.
@@ -594,3 +646,43 @@ def _(_: UUIDType, val: str | bytes | uuid.UUID) -> uuid.UUID:
594646
return uuid.UUID(bytes=val)
595647
else:
596648
return val
649+
650+
651+
@from_json.register(GeometryType)
652+
def _(_: GeometryType, val: str | bytes) -> bytes:
653+
"""Convert JSON WKT string into WKB bytes per Iceberg spec.
654+
655+
Note: This requires WKT to WKB conversion which is not yet implemented.
656+
The Iceberg spec requires geometry values to be represented as WKT strings
657+
in JSON, but PyIceberg stores geometry as WKB bytes at runtime.
658+
659+
Raises:
660+
NotImplementedError: WKT to WKB conversion is not yet supported.
661+
"""
662+
if isinstance(val, bytes):
663+
# Already WKB bytes, return as-is
664+
return val
665+
raise NotImplementedError(
666+
"Geometry JSON deserialization requires WKT to WKB conversion, which is not yet implemented. "
667+
"See https://iceberg.apache.org/spec/#json-single-value-serialization for spec details."
668+
)
669+
670+
671+
@from_json.register(GeographyType)
672+
def _(_: GeographyType, val: str | bytes) -> bytes:
673+
"""Convert JSON WKT string into WKB bytes per Iceberg spec.
674+
675+
Note: This requires WKT to WKB conversion which is not yet implemented.
676+
The Iceberg spec requires geography values to be represented as WKT strings
677+
in JSON, but PyIceberg stores geography as WKB bytes at runtime.
678+
679+
Raises:
680+
NotImplementedError: WKT to WKB conversion is not yet supported.
681+
"""
682+
if isinstance(val, bytes):
683+
# Already WKB bytes, return as-is
684+
return val
685+
raise NotImplementedError(
686+
"Geography JSON deserialization requires WKT to WKB conversion, which is not yet implemented. "
687+
"See https://iceberg.apache.org/spec/#json-single-value-serialization for spec details."
688+
)

pyiceberg/io/pyarrow.py

Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -156,6 +156,8 @@
156156
DoubleType,
157157
FixedType,
158158
FloatType,
159+
GeographyType,
160+
GeometryType,
159161
IcebergType,
160162
IntegerType,
161163
ListType,
@@ -799,6 +801,37 @@ def visit_unknown(self, _: UnknownType) -> pa.DataType:
799801
def visit_binary(self, _: BinaryType) -> pa.DataType:
800802
return pa.large_binary()
801803

804+
def visit_geometry(self, geometry_type: GeometryType) -> pa.DataType:
805+
"""Convert geometry type to PyArrow type.
806+
807+
When geoarrow-pyarrow is available, returns a GeoArrow WKB extension type
808+
with CRS metadata. Otherwise, falls back to large_binary which stores WKB bytes.
809+
"""
810+
try:
811+
import geoarrow.pyarrow as ga
812+
813+
return ga.wkb().with_crs(geometry_type.crs)
814+
except ImportError:
815+
return pa.large_binary()
816+
817+
def visit_geography(self, geography_type: GeographyType) -> pa.DataType:
818+
"""Convert geography type to PyArrow type.
819+
820+
When geoarrow-pyarrow is available, returns a GeoArrow WKB extension type
821+
with CRS and edge type metadata. Otherwise, falls back to large_binary which stores WKB bytes.
822+
"""
823+
try:
824+
import geoarrow.pyarrow as ga
825+
826+
wkb_type = ga.wkb().with_crs(geography_type.crs)
827+
# Map Iceberg algorithm to GeoArrow edge type
828+
if geography_type.algorithm == "spherical":
829+
wkb_type = wkb_type.with_edge_type(ga.EdgeType.SPHERICAL)
830+
# "planar" is the default edge type in GeoArrow, no need to set explicitly
831+
return wkb_type
832+
except ImportError:
833+
return pa.large_binary()
834+
802835

803836
def _convert_scalar(value: Any, iceberg_type: IcebergType) -> pa.scalar:
804837
if not isinstance(iceberg_type, PrimitiveType):
@@ -2130,6 +2163,12 @@ def visit_binary(self, binary_type: BinaryType) -> str:
21302163
def visit_unknown(self, unknown_type: UnknownType) -> str:
21312164
return "UNKNOWN"
21322165

2166+
def visit_geometry(self, geometry_type: GeometryType) -> str:
2167+
return "BYTE_ARRAY"
2168+
2169+
def visit_geography(self, geography_type: GeographyType) -> str:
2170+
return "BYTE_ARRAY"
2171+
21332172

21342173
_PRIMITIVE_TO_PHYSICAL_TYPE_VISITOR = PrimitiveToPhysicalType()
21352174

0 commit comments

Comments
 (0)