-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Description
Is your feature request related to a problem or challenge?
Datafusion currently supports registering files in the Arrow IPC file format as tables:
ctx.register_arrow("my_table", "file.arrow", ArrowReadOptions::default())
.await
.unwrap();
ctx.sql("SELECT * FROM my_table LIMIT 10")
.await
.unwrap()
.show()
.await
.unwrap();
You can also just reference the file path from SQL in datafusion-cli:
> SELECT * FROM 'file.arrow' LIMIT 10;
You cannot, however, do the same with files in the Arrow IPC stream format. You get the error:
called `Result::unwrap()` on an `Err` value: ArrowError(ParseError("Arrow file does not contain correct footer"), None)
Describe the solution you'd like
I would love if register_arrow supported files in the Arrow IPC stream format, or if another equivalent function would be added to do the same. Additionally, it would be great if datafusion-cli could just reference the files by name in the same way it can for the alternative Arrow IPC format.
Describe alternatives you've considered
- Convert from the stream format to the file format and then query as shown above.
- Read all the record batches into memory and then register it as MemTable.
- Add a new
StreamProviderimpl and use aStreamTable.
There are probably others, too, but none as simple as just being able to register the arrow file with register_arrow or referencing the file directly in datafusion-cli.
Additional context
I'm interested in taking a crack at this feature but, assuming y'all are interested in it, I would love some implementation guidance.
Thanks for your time!