-
Notifications
You must be signed in to change notification settings - Fork 12
Add 'type' field in stac_table_to_ndjson #105
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add 'type' field in stac_table_to_ndjson #105
Conversation
xref stac-utils/rustac#736, looks like |
What is the timeline or general chance that this will be merged? I am wondering whether I should revert back to |
I think we need a few things:
|
stac_geoparquet/arrow/_api.py
Outdated
dest: The destination where newline-delimited JSON should be written. | ||
""" | ||
|
||
if ( | ||
isinstance(table, (pa.Table, pa.RecordBatch)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also note: the type hint is RecordBatchReader
not RecordBatch
stac_geoparquet/arrow/_api.py
Outdated
if ( | ||
isinstance(table, (pa.Table, pa.RecordBatch)) | ||
and "type" not in table.schema.names | ||
): | ||
arr = pa.array(["Feature"] * len(table), type=pa.string()) | ||
table = table.add_column(0, "type", arr) | ||
|
||
# Coerce to record batch reader to avoid materializing entire stream | ||
reader = pa.RecordBatchReader.from_stream(table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I didn't see my mention before;
@kylebarron the type on
table
istable: pa.Table | pa.RecordBatchReader | ArrowStreamExportable
. I wasn't sure ifArrowStreamExportable
had things likeschema
andadd_column
.
No it doesn't. The definition of ArrowStreamExportable
is here:
stac-geoparquet/stac_geoparquet/arrow/types.py
Lines 6 to 7 in 1c3e672
class ArrowStreamExportable(Protocol): | |
def __arrow_c_stream__(self, requested_schema: object | None = None) -> object: ... # noqa |
The point of this is that it can be any Arrow table-like object from any Arrow-compatible library that implements the PyCapsule interface.
The intended use here is to import the external Arrow object into one you can work with. That's what the pa.RecordBatchReader.from_stream
line does. It converts from any Arrow-like representation to a concrete pyarrow object that we know how to work with.
But there are still a couple problems here:
- As written, we'd have different behavior whether you pass in a pyarrow object or a polars or similar object. So really you want the conversion to pyarrow to happen first.
- However we also don't want to force a materialization of the input stream, because it could be a larger-than-memory iterator.
What you should do is, below, within the for batch in reader
loop, each iteration object is a pa.RecordBatch
I think, and then inside that you can add the type
key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, thanks. This should be done in 259a146.
stac_geoparquet/arrow/_api.py
Outdated
isinstance(table, (pa.Table, pa.RecordBatch)) | ||
and "type" not in table.schema.names | ||
): | ||
arr = pa.array(["Feature"] * len(table), type=pa.string()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably not a big deal to use "Feature" * num rows
amount of memory but you could easily dictionary-encode this if you wanted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Done in 259a146 as well.
* do it per batch, to handle all types * dict encode the array
I'm rethinking this now that I've reread the context here. Because the entire point of this is to output STAC JSON, I'm OK just calling this a bug and adding it. Unless someone asks for it, I don't plan to add a keyword to disable it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Is this a chance to add a CHANGELOG, since this is a bugfix worth recording?
Thanks all. |
This adds a
type
field to tables being exported to ndjson if it isnt' present.@kylebarron the type on
table
istable: pa.Table | pa.RecordBatchReader | ArrowStreamExportable
. I wasn't sure ifArrowStreamExportable
had things likeschema
andadd_column
. Do you have a suggestion there?Closes #78