Skip to content

Commit 06a7b2d

Browse files
szehon-hoHyukjinKwon
authored andcommitted
[SPARK-55870][SQL] Add docs for Geo types
### What changes were proposed in this pull request? Add docs for Geometry/Geography type. The types are hidden behind a feature flag, but it should be removed by the next release. ### Why are the changes needed? There are new types to be introduced in Spark 4.2 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No ### Was this patch authored or co-authored using generative AI tooling? Yes, cursor Closes #54668 from szehon-ho/geo_doc. Lead-authored-by: Szehon Ho <szehon.apache@gmail.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
1 parent c1f4d11 commit 06a7b2d

File tree

3 files changed

+165
-0
lines changed

3 files changed

+165
-0
lines changed

docs/sql-ref-datatypes.md

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,6 +93,12 @@ Spark SQL and DataFrames support the following data types:
9393
|`DayTimeIntervalType(MINUTE, SECOND)`|INTERVAL MINUTE TO SECOND|`INTERVAL '1000:01.001' MINUTE TO SECOND`|
9494
|`DayTimeIntervalType(SECOND, SECOND)` or `DayTimeIntervalType(SECOND)`|INTERVAL SECOND|`INTERVAL '1000.000001' SECOND`|
9595

96+
* Spatial types
97+
Spatial objects as defined in the [OGC Simple Feature Access](https://portal.ogc.org/files/?artifact_id=25355) specification.
98+
- `GeometryType`: Represents GEOMETRY values—spatial objects in a Cartesian coordinate system. The type can be fixed to a single SRID, e.g. `geometry(4326)`, or allow mixed SRIDs with `geometry(any)`. Default SRID when not specified is 4326 (WGS 84).
99+
- `GeographyType`: Represents GEOGRAPHY values—spatial objects in a geographic coordinate system (latitude/longitude). Edge interpolation is always SPHERICAL. The type can be fixed to a single SRID, e.g. `geography(4326)`, or allow mixed SRIDs with `geography(any)`. Default SRID is 4326 (WGS 84).
100+
For more details and built-in functions, see [Geospatial (Geometry/Geography) types](sql-ref-geospatial-types.html).
101+
96102
* Complex types
97103
- `ArrayType(elementType, containsNull)`: Represents values comprising a sequence of
98104
elements with the type of `elementType`. `containsNull` is used to indicate if
@@ -137,6 +143,8 @@ from pyspark.sql.types import *
137143
|**TimestampNTZType**|datetime.datetime|TimestampNTZType()|
138144
|**DateType**|datetime.date|DateType()|
139145
|**DayTimeIntervalType**|datetime.timedelta|DayTimeIntervalType()|
146+
|**GeometryType**|Geometry|GeometryType() or GeometryType(*srid*)|
147+
|**GeographyType**|Geography|GeographyType() or GeographyType(*srid*)|
140148
|**ArrayType**|list, tuple, or array|ArrayType(*elementType*, [*containsNull*])<br/>**Note:**The default value of *containsNull* is True.|
141149
|**MapType**|dict|MapType(*keyType*, *valueType*, [*valueContainsNull]*)<br/>**Note:**The default value of *valueContainsNull* is True.|
142150
|**StructType**|list or tuple|StructType(*fields*)<br/>**Note:** *fields* is a Seq of StructFields. Also, two fields with the same name are not allowed.|
@@ -171,6 +179,8 @@ You can access them by doing
171179
|**TimeType**|java.time.LocalTime|TimeType|
172180
|**YearMonthIntervalType**|java.time.Period|YearMonthIntervalType|
173181
|**DayTimeIntervalType**|java.time.Duration|DayTimeIntervalType|
182+
|**GeometryType**|org.apache.spark.sql.types.Geometry|GeometryType or GeometryType(*srid*)|
183+
|**GeographyType**|org.apache.spark.sql.types.Geography|GeographyType or GeographyType(*srid*)|
174184
|**ArrayType**|scala.collection.Seq|ArrayType(*elementType*, [*containsNull]*)<br/>**Note:** The default value of *containsNull* is true.|
175185
|**MapType**|scala.collection.Map|MapType(*keyType*, *valueType*, [*valueContainsNull]*)<br/>**Note:** The default value of *valueContainsNull* is true.|
176186
|**StructType**|org.apache.spark.sql.Row|StructType(*fields*)<br/>**Note:** *fields* is a Seq of StructFields. Also, two fields with the same name are not allowed.|
@@ -205,6 +215,8 @@ please use factory methods provided in
205215
|**TimeType**|java.time.LocalTime|DataTypes.TimeType|
206216
|**YearMonthIntervalType**|java.time.Period|DataTypes.YearMonthIntervalType|
207217
|**DayTimeIntervalType**|java.time.Duration|DataTypes.DayTimeIntervalType|
218+
|**GeometryType**|org.apache.spark.sql.types.Geometry|DataTypes.createGeometryType(*srid*)|
219+
|**GeographyType**|org.apache.spark.sql.types.Geography|DataTypes.createGeographyType(*srid*)|
208220
|**ArrayType**|java.util.List|DataTypes.createArrayType(*elementType*)<br/>**Note:** The value of *containsNull* will be true.<br/>DataTypes.createArrayType(*elementType*, *containsNull*).|
209221
|**MapType**|java.util.Map|DataTypes.createMapType(*keyType*, *valueType*)<br/>**Note:** The value of *valueContainsNull* will be true.<br/>DataTypes.createMapType(*keyType*, *valueType*, *valueContainsNull*)|
210222
|**StructType**|org.apache.spark.sql.Row|DataTypes.createStructType(*fields*)<br/>**Note:** *fields* is a List or an array of StructFields.Also, two fields with the same name are not allowed.|
@@ -228,6 +240,8 @@ please use factory methods provided in
228240
|**BooleanType**|logical|"bool"|
229241
|**TimestampType**|POSIXct|"timestamp"|
230242
|**DateType**|Date|"date"|
243+
|**GeometryType**|Not supported|Not supported|
244+
|**GeographyType**|Not supported|Not supported|
231245
|**ArrayType**|vector or list|list(type="array", elementType=*elementType*, containsNull=[*containsNull*])<br/>**Note:** The default value of *containsNull* is TRUE.|
232246
|**MapType**|environment|list(type="map", keyType=*keyType*, valueType=*valueType*, valueContainsNull=[*valueContainsNull*])<br/> **Note:** The default value of *valueContainsNull* is TRUE.|
233247
|**StructType**|named list|list(type="struct", fields=*fields*)<br/> **Note:** *fields* is a Seq of StructFields. Also, two fields with the same name are not allowed.|
@@ -258,6 +272,8 @@ The following table shows the type names as well as aliases used in Spark SQL pa
258272
|**DecimalType**|DECIMAL, DEC, NUMERIC|
259273
|**YearMonthIntervalType**|INTERVAL YEAR, INTERVAL YEAR TO MONTH, INTERVAL MONTH|
260274
|**DayTimeIntervalType**|INTERVAL DAY, INTERVAL DAY TO HOUR, INTERVAL DAY TO MINUTE, INTERVAL DAY TO SECOND, INTERVAL HOUR, INTERVAL HOUR TO MINUTE, INTERVAL HOUR TO SECOND, INTERVAL MINUTE, INTERVAL MINUTE TO SECOND, INTERVAL SECOND|
275+
|**GeometryType**|GEOMETRY or GEOMETRY(*srid*) or GEOMETRY(ANY)|
276+
|**GeographyType**|GEOGRAPHY or GEOGRAPHY(*srid*) or GEOGRAPHY(ANY)|
261277
|**ArrayType**|ARRAY\<element_type>|
262278
|**StructType**|STRUCT<field1_name: field1_type, field2_name: field2_type, ...><br/> **Note:** ':' is optional.|
263279
|**MapType**|MAP<key_type, value_type>|

docs/sql-ref-geospatial-types.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
---
2+
layout: global
3+
title: Geospatial (Geometry/Geography) Types
4+
displayTitle: Geospatial (Geometry/Geography) Types
5+
license: |
6+
Licensed to the Apache Software Foundation (ASF) under one or more
7+
contributor license agreements. See the NOTICE file distributed with
8+
this work for additional information regarding copyright ownership.
9+
The ASF licenses this file to You under the Apache License, Version 2.0
10+
(the "License"); you may not use this file except in compliance with
11+
the License. You may obtain a copy of the License at
12+
13+
http://www.apache.org/licenses/LICENSE-2.0
14+
15+
Unless required by applicable law or agreed to in writing, software
16+
distributed under the License is distributed on an "AS IS" BASIS,
17+
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
18+
See the License for the specific language governing permissions and
19+
limitations under the License.
20+
---
21+
22+
Spark SQL supports **GEOMETRY** and **GEOGRAPHY** types for spatial data, as defined in the [Open Geospatial Consortium (OGC) Simple Feature Access](https://portal.ogc.org/files/?artifact_id=25355) specification. At runtime, values are represented as **Well-Known Binary (WKB)** and are associated with a **Spatial Reference Identifier (SRID)** that defines the coordinate system. How values are persisted is determined by each data source.
23+
24+
### Overview
25+
26+
| Type | Coordinate system | Typical use and notes |
27+
|------|-------------------|------------------------|
28+
| **GEOMETRY** | Cartesian (planar) | Projected or local coordinates; planar calculations. Represents points, lines, polygons in a flat coordinate system. Suitable for Web Mercator (SRID 3857), UTM, or local grids (e.g. engineering/CAD). Default SRID in Spark is 4326. |
29+
| **GEOGRAPHY** | Geographic (latitude/longitude) | Earth-based data; distances and areas on the sphere/ellipsoid. Coordinates in longitude and latitude (degrees). Edge interpolation is always **SPHERICAL**. Default SRID is 4326 (WGS 84). |
30+
31+
#### When to use GEOMETRY vs GEOGRAPHY
32+
33+
Choose **GEOMETRY** when:
34+
35+
* Data is in **local or projected coordinates** (e.g. engineering/CAD in meters, or map tiles in Web Mercator).
36+
* You need **planar operations** on a small or regional area: intersections, unions, clipping, containment, or overlays where treating the surface as flat is acceptable.
37+
* Vertices are closely spaced or the extent is small enough that Earth curvature is negligible.
38+
39+
Choose **GEOGRAPHY** when:
40+
41+
* Data is **global** or spans large extents (e.g. country boundaries, worldwide points of interest).
42+
* **Distances or areas** must respect Earth curvature (e.g. the shortest path between two cities, or the area of a polygon on the globe).
43+
* Use cases include **aviation, maritime, or global mobility** where great-circle or geodesic behavior matters.
44+
45+
Using the wrong type can give misleading results: for example, the shortest path between London and New York on a sphere crosses Canada, whereas a planar GEOMETRY may suggest a path that does not.
46+
47+
### Type Syntax in SQL
48+
49+
In SQL you must specify the type with an SRID or `ANY`:
50+
51+
* **Fixed SRID** (all values in the column share one SRID):
52+
* `GEOMETRY(srid)` — e.g. `GEOMETRY(4326)`, `GEOMETRY(3857)`
53+
* `GEOGRAPHY(srid)` — e.g. `GEOGRAPHY(4326)`
54+
* **Mixed SRID** (values in the column may have different SRIDs):
55+
* `GEOMETRY(ANY)`
56+
* `GEOGRAPHY(ANY)`
57+
58+
Unparameterized `GEOMETRY` or `GEOGRAPHY` (without `(srid)` or `(ANY)`) is not supported in SQL.
59+
60+
### Creating Tables with Geometry or Geography Columns
61+
62+
```sql
63+
-- Fixed SRID: all values must use the given SRID (e.g. WGS 84)
64+
CREATE TABLE points (
65+
id BIGINT,
66+
pt GEOMETRY(4326)
67+
);
68+
69+
CREATE TABLE locations (
70+
id BIGINT,
71+
loc GEOGRAPHY(4326)
72+
);
73+
74+
-- Mixed SRID: each row can have a different SRID
75+
CREATE TABLE mixed_geoms (
76+
id BIGINT,
77+
geom GEOMETRY(ANY)
78+
);
79+
```
80+
81+
### Constructing Geometry and Geography Values
82+
83+
Values are created from **Well-Known Binary (WKB)** using built-in functions. WKB is a standard binary encoding for spatial shapes (points, lines, polygons, etc.). See [Well-known binary](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary) for the format.
84+
85+
**From WKB (binary):**
86+
87+
* `ST_GeomFromWKB(wkb)` — returns GEOMETRY with default SRID 0.
88+
* `ST_GeomFromWKB(wkb, srid)` — returns GEOMETRY with the given SRID.
89+
* `ST_GeogFromWKB(wkb)` — returns GEOGRAPHY with SRID 4326.
90+
91+
**Example (point in WKB, then use in a table):**
92+
93+
```sql
94+
-- Point (1, 2) in WKB (little-endian point, 2D)
95+
SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040');
96+
SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 4326);
97+
SELECT ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040');
98+
99+
INSERT INTO points (id, pt)
100+
VALUES (1, ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 4326));
101+
```
102+
103+
#### WKB coordinate handling
104+
105+
When parsing WKB, Spark applies the following rules. Violations result in a parse error.
106+
107+
* **Empty points**: For **Point** geometries (including points inside MultiPoint), **NaN** (Not a Number) coordinate values are allowed and represent an **empty point** (e.g. `POINT EMPTY` in Well-Known Text). **LineString** and **Polygon** (and points inside them) do not allow NaN in coordinate values.
108+
* **Non-point coordinates**: Coordinate values in **LineString**, **Polygon** rings, and points that are part of those structures must be **finite** (no NaN, no positive or negative infinity).
109+
* **Infinity**: **Positive or negative infinity** is never accepted in any coordinate value.
110+
* **Polygon rings**: Each ring must be **closed** (first and last point equal) and have **at least 4 points**. A **LineString** must have at least 2 points.
111+
* **GEOGRAPHY bounds**: When WKB is parsed as **GEOGRAPHY** (e.g. via `ST_GeogFromWKB`), longitude must be in **[-180, 180]** (inclusive) and latitude in **[-90, 90]** (inclusive). GEOMETRY does not enforce these bounds.
112+
* **Invalid WKB**: Null or empty input, truncated bytes, invalid geometry class or byte order, or other malformed WKB.
113+
114+
### Built-in Geospatial (ST) Functions
115+
116+
Spark SQL provides scalar functions for working with GEOMETRY and GEOGRAPHY values. They are grouped under **st_funcs** in the [Built-in Functions](sql-ref-functions-builtin.html) API.
117+
118+
| Function | Description |
119+
|----------|-------------|
120+
| `ST_AsBinary(geo)` | Returns the GEOMETRY or GEOGRAPHY value as WKB (BINARY). |
121+
| `ST_GeomFromWKB(wkb)` | Parses WKB and returns a GEOMETRY with default SRID 0. |
122+
| `ST_GeomFromWKB(wkb, srid)` | Parses WKB and returns a GEOMETRY with the given SRID. |
123+
| `ST_GeogFromWKB(wkb)` | Parses WKB and returns a GEOGRAPHY with SRID 4326. |
124+
| `ST_Srid(geo)` | Returns the SRID of the GEOMETRY or GEOGRAPHY value (NULL if input is NULL). |
125+
| `ST_SetSrid(geo, srid)` | Returns a new GEOMETRY or GEOGRAPHY with the given SRID. |
126+
127+
**Examples:**
128+
129+
```sql
130+
SELECT hex(ST_AsBinary(ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040')));
131+
-- 0101000000000000000000F03F0000000000000040
132+
133+
SELECT ST_Srid(ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040'));
134+
-- 4326
135+
136+
SELECT ST_Srid(ST_SetSrid(ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040'), 3857));
137+
-- 3857
138+
```
139+
140+
### SRID and Stored Values
141+
142+
* **Fixed-SRID columns**: Every value in the column must have the same SRID as the column type. Inserting a value with a different SRID can raise an error (or you can use `ST_SetSrid` to set the value’s SRID to match the column).
143+
* **Mixed-SRID columns** (`GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`): Values can have different SRIDs. Only valid SRIDs are allowed.
144+
* **Storage**: Parquet, Delta, and Iceberg store geometry/geography with a fixed SRID per column; mixed-SRID types are for in-memory/query use. When writing to these formats, a concrete (fixed) SRID is required.
145+
146+
### Data Types Reference
147+
148+
For the full list of supported data types and API usage in Scala, Java, Python, and SQL, see [Data Types](sql-ref-datatypes.html).

docs/sql-ref.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@ Spark SQL is Apache Spark's module for working with structured data. This guide
2424

2525
* [ANSI Compliance](sql-ref-ansi-compliance.html)
2626
* [Data Types](sql-ref-datatypes.html)
27+
* [Geospatial (Geometry/Geography) Types](sql-ref-geospatial-types.html)
2728
* [Datetime Pattern](sql-ref-datetime-pattern.html)
2829
* [Number Pattern](sql-ref-number-pattern.html)
2930
* [Operators](sql-ref-operators.html)

0 commit comments

Comments
 (0)