Skip to content

Add 'type' field in stac_table_to_ndjson #105

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

TomAugspurger
Copy link
Collaborator

This adds a type field to tables being exported to ndjson if it isnt' present.

@kylebarron the type on table is table: pa.Table | pa.RecordBatchReader | ArrowStreamExportable. I wasn't sure if ArrowStreamExportable had things like schema and add_column. Do you have a suggestion there?

Closes #78

@gadomski
Copy link
Member

xref stac-utils/rustac#736, looks like type is being a PITA 😆

@fwfichtner
Copy link

What is the timeline or general chance that this will be merged? I am wondering whether I should revert back to stac-geoparquet for now because of stac-utils/rustac#736 (comment)

@TomAugspurger
Copy link
Collaborator Author

I think we need a few things:

  1. A review
  2. Figure out the ArrowStreamExportable
  3. Decide how to roll out this change. I'd prefer to do it slowly, through a keyword that can control whether the column is included. The default of None will warn and use the old behavior (no type column) and passing True/False will silence the warning.

@gadomski gadomski self-requested a review June 16, 2025 11:47
dest: The destination where newline-delimited JSON should be written.
"""

if (
isinstance(table, (pa.Table, pa.RecordBatch))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note: the type hint is RecordBatchReader not RecordBatch

Comment on lines 166 to 174
if (
isinstance(table, (pa.Table, pa.RecordBatch))
and "type" not in table.schema.names
):
arr = pa.array(["Feature"] * len(table), type=pa.string())
table = table.add_column(0, "type", arr)

# Coerce to record batch reader to avoid materializing entire stream
reader = pa.RecordBatchReader.from_stream(table)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't see my mention before;

@kylebarron the type on table is table: pa.Table | pa.RecordBatchReader | ArrowStreamExportable. I wasn't sure if ArrowStreamExportable had things like schema and add_column.

No it doesn't. The definition of ArrowStreamExportable is here:

class ArrowStreamExportable(Protocol):
def __arrow_c_stream__(self, requested_schema: object | None = None) -> object: ... # noqa

The point of this is that it can be any Arrow table-like object from any Arrow-compatible library that implements the PyCapsule interface.

The intended use here is to import the external Arrow object into one you can work with. That's what the pa.RecordBatchReader.from_stream line does. It converts from any Arrow-like representation to a concrete pyarrow object that we know how to work with.

But there are still a couple problems here:

  • As written, we'd have different behavior whether you pass in a pyarrow object or a polars or similar object. So really you want the conversion to pyarrow to happen first.
  • However we also don't want to force a materialization of the input stream, because it could be a larger-than-memory iterator.

What you should do is, below, within the for batch in reader loop, each iteration object is a pa.RecordBatch I think, and then inside that you can add the type key.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks. This should be done in 259a146.

isinstance(table, (pa.Table, pa.RecordBatch))
and "type" not in table.schema.names
):
arr = pa.array(["Feature"] * len(table), type=pa.string())
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not a big deal to use "Feature" * num rows amount of memory but you could easily dictionary-encode this if you wanted.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call. Done in 259a146 as well.

* do it per batch, to handle all types
* dict encode the array
@TomAugspurger
Copy link
Collaborator Author

Decide how to roll out this change. I'd prefer to do it slowly

I'm rethinking this now that I've reread the context here. Because the entire point of this is to output STAC JSON, I'm OK just calling this a bug and adding it. Unless someone asks for it, I don't plan to add a keyword to disable it.

Copy link
Member

@gadomski gadomski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Is this a chance to add a CHANGELOG, since this is a bugfix worth recording?

@TomAugspurger
Copy link
Collaborator Author

Thanks all.

@TomAugspurger TomAugspurger merged commit f2cd63c into stac-utils:main Jun 17, 2025
5 checks passed
@TomAugspurger TomAugspurger deleted the tom/stac-geoparquet-field branch June 17, 2025 11:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Automatically include "type": "Feature" in stac_table_to_ndjson?
4 participants