Skip to content

Commit 8d44eea

Browse files
authored
GH-46270: [C++][Parquet] Clarify GeoStatistics docstring (#46649)
### Rationale for this change The distinction between "invalid" and "empty" is not clear in the current documentation! ### What changes are included in this PR? The docstring for GeoStatistics was improved. ### Are these changes tested? Just documention! ### Are there any user-facing changes? No * GitHub Issue: #46270 Authored-by: Dewey Dunnington <[email protected]> Signed-off-by: Dewey Dunnington <[email protected]>
1 parent 1ffc766 commit 8d44eea

File tree

1 file changed

+16
-12
lines changed

1 file changed

+16
-12
lines changed

cpp/src/parquet/geospatial/statistics.h

Lines changed: 16 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -63,18 +63,22 @@ class GeoStatisticsImpl;
6363
/// \brief Base type for computing geospatial column statistics while writing a file
6464
/// or representing them when reading a file
6565
///
66-
/// Note that NaN values that were encountered within coordinates are omitted; however,
67-
/// NaN values that were obtained via decoding encoded statistics are propagated. This
68-
/// behaviour ensures C++ clients that are inspecting statistics via the column metadata
69-
/// can detect the case where a writer generated NaNs (even though this implementation
70-
/// does not generate them).
71-
///
72-
/// The handling of NaN values in coordinates is not well-defined among bounding
73-
/// implementations except for the WKB convention for POINT EMPTY, which is consistently
74-
/// represented as a point whose ordinates are all NaN. Any other geometry that contains
75-
/// NaNs cannot expect defined behaviour here or elsewhere; however, a row group that
76-
/// contains both NaN-containing and normal (completely finite) geometries should not be
77-
/// excluded from predicate pushdown.
66+
/// These statistics track the minimum and maximum value (omitting NaN values) of the
67+
/// four possible dimensions (X, Y, Z, and M) and the distinct set of geometry
68+
/// type/dimension combinations (e.g., point XY, linestring XYZM) present in the data.
69+
/// Any of these individual components may be "invalid": for example, when reading a
70+
/// Parquet file, information about individual components obtained from the column
71+
/// chunk metadata may have been missing or deemed unusable. Orthogonally,
72+
/// any of these individual components may be "empty": for example, when using
73+
/// GeoStatistics to accumulate bounds whilst writing, if all geometries in a column chunk
74+
/// are null, all ranges (X, Y, Z, and M) will be empty. If all geometries in a column
75+
/// chunk contain only XY coordinates (the most common case), the Z and M ranges will
76+
/// be empty but the X and Y ranges will contain finite bounds. Empty ranges are
77+
/// considered "valid" because they are known to represent exactly zero values (in
78+
/// contrast to an invalid range, whose contents is completely unknown). These concepts
79+
/// are all necessary for this object to accurately represent (1) accumulated or partially
80+
/// accumulated statistics during the writing process and (2) deserialized statistics read
81+
/// from the column chunk metadata during the reading process.
7882
///
7983
/// EXPERIMENTAL
8084
class PARQUET_EXPORT GeoStatistics {

0 commit comments

Comments
 (0)