|
| 1 | +--- |
| 2 | +layout: global |
| 3 | +title: Geospatial (Geometry/Geography) Types |
| 4 | +displayTitle: Geospatial (Geometry/Geography) Types |
| 5 | +license: | |
| 6 | + Licensed to the Apache Software Foundation (ASF) under one or more |
| 7 | + contributor license agreements. See the NOTICE file distributed with |
| 8 | + this work for additional information regarding copyright ownership. |
| 9 | + The ASF licenses this file to You under the Apache License, Version 2.0 |
| 10 | + (the "License"); you may not use this file except in compliance with |
| 11 | + the License. You may obtain a copy of the License at |
| 12 | +
|
| 13 | + http://www.apache.org/licenses/LICENSE-2.0 |
| 14 | +
|
| 15 | + Unless required by applicable law or agreed to in writing, software |
| 16 | + distributed under the License is distributed on an "AS IS" BASIS, |
| 17 | + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 18 | + See the License for the specific language governing permissions and |
| 19 | + limitations under the License. |
| 20 | +--- |
| 21 | + |
| 22 | +Spark SQL supports **GEOMETRY** and **GEOGRAPHY** types for spatial data, as defined in the [Open Geospatial Consortium (OGC) Simple Feature Access](https://portal.ogc.org/files/?artifact_id=25355) specification. At runtime, values are represented as **Well-Known Binary (WKB)** and are associated with a **Spatial Reference Identifier (SRID)** that defines the coordinate system. How values are persisted is determined by each data source. |
| 23 | + |
| 24 | +### Overview |
| 25 | + |
| 26 | +| Type | Coordinate system | Typical use and notes | |
| 27 | +|------|-------------------|------------------------| |
| 28 | +| **GEOMETRY** | Cartesian (planar) | Projected or local coordinates; planar calculations. Represents points, lines, polygons in a flat coordinate system. Suitable for Web Mercator (SRID 3857), UTM, or local grids (e.g. engineering/CAD). Default SRID in Spark is 4326. | |
| 29 | +| **GEOGRAPHY** | Geographic (latitude/longitude) | Earth-based data; distances and areas on the sphere/ellipsoid. Coordinates in longitude and latitude (degrees). Edge interpolation is always **SPHERICAL**. Default SRID is 4326 (WGS 84). | |
| 30 | + |
| 31 | +#### When to use GEOMETRY vs GEOGRAPHY |
| 32 | + |
| 33 | +Choose **GEOMETRY** when: |
| 34 | + |
| 35 | +* Data is in **local or projected coordinates** (e.g. engineering/CAD in meters, or map tiles in Web Mercator). |
| 36 | +* You need **planar operations** on a small or regional area: intersections, unions, clipping, containment, or overlays where treating the surface as flat is acceptable. |
| 37 | +* Vertices are closely spaced or the extent is small enough that Earth curvature is negligible. |
| 38 | + |
| 39 | +Choose **GEOGRAPHY** when: |
| 40 | + |
| 41 | +* Data is **global** or spans large extents (e.g. country boundaries, worldwide points of interest). |
| 42 | +* **Distances or areas** must respect Earth curvature (e.g. the shortest path between two cities, or the area of a polygon on the globe). |
| 43 | +* Use cases include **aviation, maritime, or global mobility** where great-circle or geodesic behavior matters. |
| 44 | + |
| 45 | +Using the wrong type can give misleading results: for example, the shortest path between London and New York on a sphere crosses Canada, whereas a planar GEOMETRY may suggest a path that does not. |
| 46 | + |
| 47 | +### Type Syntax in SQL |
| 48 | + |
| 49 | +In SQL you must specify the type with an SRID or `ANY`: |
| 50 | + |
| 51 | +* **Fixed SRID** (all values in the column share one SRID): |
| 52 | + * `GEOMETRY(srid)` — e.g. `GEOMETRY(4326)`, `GEOMETRY(3857)` |
| 53 | + * `GEOGRAPHY(srid)` — e.g. `GEOGRAPHY(4326)` |
| 54 | +* **Mixed SRID** (values in the column may have different SRIDs): |
| 55 | + * `GEOMETRY(ANY)` |
| 56 | + * `GEOGRAPHY(ANY)` |
| 57 | + |
| 58 | +Unparameterized `GEOMETRY` or `GEOGRAPHY` (without `(srid)` or `(ANY)`) is not supported in SQL. |
| 59 | + |
| 60 | +### Creating Tables with Geometry or Geography Columns |
| 61 | + |
| 62 | +```sql |
| 63 | +-- Fixed SRID: all values must use the given SRID (e.g. WGS 84) |
| 64 | +CREATE TABLE points ( |
| 65 | + id BIGINT, |
| 66 | + pt GEOMETRY(4326) |
| 67 | +); |
| 68 | + |
| 69 | +CREATE TABLE locations ( |
| 70 | + id BIGINT, |
| 71 | + loc GEOGRAPHY(4326) |
| 72 | +); |
| 73 | + |
| 74 | +-- Mixed SRID: each row can have a different SRID |
| 75 | +CREATE TABLE mixed_geoms ( |
| 76 | + id BIGINT, |
| 77 | + geom GEOMETRY(ANY) |
| 78 | +); |
| 79 | +``` |
| 80 | + |
| 81 | +### Constructing Geometry and Geography Values |
| 82 | + |
| 83 | +Values are created from **Well-Known Binary (WKB)** using built-in functions. WKB is a standard binary encoding for spatial shapes (points, lines, polygons, etc.). See [Well-known binary](https://en.wikipedia.org/wiki/Well-known_text_representation_of_geometry#Well-known_binary) for the format. |
| 84 | + |
| 85 | +**From WKB (binary):** |
| 86 | + |
| 87 | +* `ST_GeomFromWKB(wkb)` — returns GEOMETRY with default SRID 0. |
| 88 | +* `ST_GeomFromWKB(wkb, srid)` — returns GEOMETRY with the given SRID. |
| 89 | +* `ST_GeogFromWKB(wkb)` — returns GEOGRAPHY with SRID 4326. |
| 90 | + |
| 91 | +**Example (point in WKB, then use in a table):** |
| 92 | + |
| 93 | +```sql |
| 94 | +-- Point (1, 2) in WKB (little-endian point, 2D) |
| 95 | +SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040'); |
| 96 | +SELECT ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 4326); |
| 97 | +SELECT ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040'); |
| 98 | + |
| 99 | +INSERT INTO points (id, pt) |
| 100 | +VALUES (1, ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040', 4326)); |
| 101 | +``` |
| 102 | + |
| 103 | +#### WKB coordinate handling |
| 104 | + |
| 105 | +When parsing WKB, Spark applies the following rules. Violations result in a parse error. |
| 106 | + |
| 107 | +* **Empty points**: For **Point** geometries (including points inside MultiPoint), **NaN** (Not a Number) coordinate values are allowed and represent an **empty point** (e.g. `POINT EMPTY` in Well-Known Text). **LineString** and **Polygon** (and points inside them) do not allow NaN in coordinate values. |
| 108 | +* **Non-point coordinates**: Coordinate values in **LineString**, **Polygon** rings, and points that are part of those structures must be **finite** (no NaN, no positive or negative infinity). |
| 109 | +* **Infinity**: **Positive or negative infinity** is never accepted in any coordinate value. |
| 110 | +* **Polygon rings**: Each ring must be **closed** (first and last point equal) and have **at least 4 points**. A **LineString** must have at least 2 points. |
| 111 | +* **GEOGRAPHY bounds**: When WKB is parsed as **GEOGRAPHY** (e.g. via `ST_GeogFromWKB`), longitude must be in **[-180, 180]** (inclusive) and latitude in **[-90, 90]** (inclusive). GEOMETRY does not enforce these bounds. |
| 112 | +* **Invalid WKB**: Null or empty input, truncated bytes, invalid geometry class or byte order, or other malformed WKB. |
| 113 | + |
| 114 | +### Built-in Geospatial (ST) Functions |
| 115 | + |
| 116 | +Spark SQL provides scalar functions for working with GEOMETRY and GEOGRAPHY values. They are grouped under **st_funcs** in the [Built-in Functions](sql-ref-functions-builtin.html) API. |
| 117 | + |
| 118 | +| Function | Description | |
| 119 | +|----------|-------------| |
| 120 | +| `ST_AsBinary(geo)` | Returns the GEOMETRY or GEOGRAPHY value as WKB (BINARY). | |
| 121 | +| `ST_GeomFromWKB(wkb)` | Parses WKB and returns a GEOMETRY with default SRID 0. | |
| 122 | +| `ST_GeomFromWKB(wkb, srid)` | Parses WKB and returns a GEOMETRY with the given SRID. | |
| 123 | +| `ST_GeogFromWKB(wkb)` | Parses WKB and returns a GEOGRAPHY with SRID 4326. | |
| 124 | +| `ST_Srid(geo)` | Returns the SRID of the GEOMETRY or GEOGRAPHY value (NULL if input is NULL). | |
| 125 | +| `ST_SetSrid(geo, srid)` | Returns a new GEOMETRY or GEOGRAPHY with the given SRID. | |
| 126 | + |
| 127 | +**Examples:** |
| 128 | + |
| 129 | +```sql |
| 130 | +SELECT hex(ST_AsBinary(ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040'))); |
| 131 | +-- 0101000000000000000000F03F0000000000000040 |
| 132 | + |
| 133 | +SELECT ST_Srid(ST_GeogFromWKB(X'0101000000000000000000F03F0000000000000040')); |
| 134 | +-- 4326 |
| 135 | + |
| 136 | +SELECT ST_Srid(ST_SetSrid(ST_GeomFromWKB(X'0101000000000000000000F03F0000000000000040'), 3857)); |
| 137 | +-- 3857 |
| 138 | +``` |
| 139 | + |
| 140 | +### SRID and Stored Values |
| 141 | + |
| 142 | +* **Fixed-SRID columns**: Every value in the column must have the same SRID as the column type. Inserting a value with a different SRID can raise an error (or you can use `ST_SetSrid` to set the value’s SRID to match the column). |
| 143 | +* **Mixed-SRID columns** (`GEOMETRY(ANY)` or `GEOGRAPHY(ANY)`): Values can have different SRIDs. Only valid SRIDs are allowed. |
| 144 | +* **Storage**: Parquet, Delta, and Iceberg store geometry/geography with a fixed SRID per column; mixed-SRID types are for in-memory/query use. When writing to these formats, a concrete (fixed) SRID is required. |
| 145 | + |
| 146 | +### Data Types Reference |
| 147 | + |
| 148 | +For the full list of supported data types and API usage in Scala, Java, Python, and SQL, see [Data Types](sql-ref-datatypes.html). |
0 commit comments