Add 'type' field in stac_table_to_ndjson #105

TomAugspurger · 2025-05-15T01:51:51Z

This adds a type field to tables being exported to ndjson if it isnt' present.

@kylebarron the type on table is table: pa.Table | pa.RecordBatchReader | ArrowStreamExportable. I wasn't sure if ArrowStreamExportable had things like schema and add_column. Do you have a suggestion there?

Closes #78

Closes stac-utils#78

gadomski · 2025-05-15T12:25:49Z

xref stac-utils/rustac#736, looks like type is being a PITA 😆

fwfichtner · 2025-06-16T08:54:51Z

What is the timeline or general chance that this will be merged? I am wondering whether I should revert back to stac-geoparquet for now because of stac-utils/rustac#736 (comment)

TomAugspurger · 2025-06-16T11:43:28Z

I think we need a few things:

A review
Figure out the ArrowStreamExportable
Decide how to roll out this change. I'd prefer to do it slowly, through a keyword that can control whether the column is included. The default of None will warn and use the old behavior (no type column) and passing True/False will silence the warning.

kylebarron · 2025-06-16T15:42:30Z

stac_geoparquet/arrow/_api.py

        dest: The destination where newline-delimited JSON should be written.
    """

+    if (
+        isinstance(table, (pa.Table, pa.RecordBatch))


Also note: the type hint is RecordBatchReader not RecordBatch

kylebarron · 2025-06-16T15:44:05Z

stac_geoparquet/arrow/_api.py

+    if (
+        isinstance(table, (pa.Table, pa.RecordBatch))
+        and "type" not in table.schema.names
+    ):
+        arr = pa.array(["Feature"] * len(table), type=pa.string())
+        table = table.add_column(0, "type", arr)
+
    # Coerce to record batch reader to avoid materializing entire stream
    reader = pa.RecordBatchReader.from_stream(table)


Sorry I didn't see my mention before;

@kylebarron the type on table is table: pa.Table | pa.RecordBatchReader | ArrowStreamExportable. I wasn't sure if ArrowStreamExportable had things like schema and add_column.

No it doesn't. The definition of ArrowStreamExportable is here:

stac-geoparquet/stac_geoparquet/arrow/types.py

Lines 6 to 7 in 1c3e672

class ArrowStreamExportable(Protocol):

def __arrow_c_stream__(self, requested_schema: object | None = None) -> object: ... # noqa

The point of this is that it can be any Arrow table-like object from any Arrow-compatible library that implements the PyCapsule interface.

The intended use here is to import the external Arrow object into one you can work with. That's what the pa.RecordBatchReader.from_stream line does. It converts from any Arrow-like representation to a concrete pyarrow object that we know how to work with.

But there are still a couple problems here:

As written, we'd have different behavior whether you pass in a pyarrow object or a polars or similar object. So really you want the conversion to pyarrow to happen first.

However we also don't want to force a materialization of the input stream, because it could be a larger-than-memory iterator.

What you should do is, below, within the for batch in reader loop, each iteration object is a pa.RecordBatch I think, and then inside that you can add the type key.

Makes sense, thanks. This should be done in 259a146.

kylebarron · 2025-06-16T15:45:02Z

stac_geoparquet/arrow/_api.py

+        isinstance(table, (pa.Table, pa.RecordBatch))
+        and "type" not in table.schema.names
+    ):
+        arr = pa.array(["Feature"] * len(table), type=pa.string())


Probably not a big deal to use "Feature" * num rows amount of memory but you could easily dictionary-encode this if you wanted.

Good call. Done in 259a146 as well.

…-field

* do it per batch, to handle all types * dict encode the array

TomAugspurger · 2025-06-17T01:40:30Z

Decide how to roll out this change. I'd prefer to do it slowly

I'm rethinking this now that I've reread the context here. Because the entire point of this is to output STAC JSON, I'm OK just calling this a bug and adding it. Unless someone asks for it, I don't plan to add a keyword to disable it.

gadomski

LGTM. Is this a chance to add a CHANGELOG, since this is a bugfix worth recording?

TomAugspurger · 2025-06-17T11:49:21Z

Thanks all.

Add 'type' field in stac_table_to_ndjson

ae7ed5e

Closes stac-utils#78

gadomski mentioned this pull request Jun 4, 2025

fix: keep type on flat item stac-utils/rustac#736

Merged

gadomski self-requested a review June 16, 2025 11:47

kylebarron reviewed Jun 16, 2025

View reviewed changes

TomAugspurger added 2 commits June 16, 2025 20:27

Merge remote-tracking branch 'upstream/main' into tom/stac-geoparquet…

4627664

…-field

Fix how we add type

259a146

* do it per batch, to handle all types * dict encode the array

gadomski approved these changes Jun 17, 2025

View reviewed changes

Changelog

2a3c1a5

TomAugspurger merged commit f2cd63c into stac-utils:main Jun 17, 2025
5 checks passed

TomAugspurger deleted the tom/stac-geoparquet-field branch June 17, 2025 11:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add 'type' field in stac_table_to_ndjson #105

Add 'type' field in stac_table_to_ndjson #105

Uh oh!

TomAugspurger commented May 15, 2025

Uh oh!

gadomski commented May 15, 2025

Uh oh!

fwfichtner commented Jun 16, 2025

Uh oh!

TomAugspurger commented Jun 16, 2025

Uh oh!

kylebarron Jun 16, 2025

Uh oh!

kylebarron Jun 16, 2025

Uh oh!

TomAugspurger Jun 17, 2025

Uh oh!

kylebarron Jun 16, 2025

Uh oh!

TomAugspurger Jun 17, 2025

Uh oh!

TomAugspurger commented Jun 17, 2025

Uh oh!

gadomski left a comment

Uh oh!

TomAugspurger commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!

	class ArrowStreamExportable(Protocol):
	def __arrow_c_stream__(self, requested_schema: object \| None = None) -> object: ... # noqa

Add 'type' field in stac_table_to_ndjson #105

Add 'type' field in stac_table_to_ndjson #105

Uh oh!

Conversation

TomAugspurger commented May 15, 2025

Uh oh!

gadomski commented May 15, 2025

Uh oh!

fwfichtner commented Jun 16, 2025

Uh oh!

TomAugspurger commented Jun 16, 2025

Uh oh!

kylebarron Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

kylebarron Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

kylebarron Jun 16, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger Jun 17, 2025

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Jun 17, 2025

Uh oh!

gadomski left a comment

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Jun 17, 2025

Uh oh!

Uh oh!

Uh oh!