-
Notifications
You must be signed in to change notification settings - Fork 2
Description
Hi! Please forgive my very naive question! But please may I ask: What are the ultimate aims of arrow-zarr
?
I ask because I'm toying with the idea of helping to speed up Python's xarray
package. Perhaps by writing some of xarray
in Rust. The end-goal would be much faster execution on a single machine by optimising queries, using all CPU cores for processing, and overlapping IO with compute. (None of these ideas are original to me, of course! And some of these ideas are already implemented in packages like dask
, cubed
, zarrs
, and IceChunk
.)
It occured to me that, perhaps, it'd be possible to use datafusion
to do this! Specifically, maybe datafusion
could be taught how to process labelled multi-dimensional arrays (with the appropriate custom code, of course). Within datafusion
, we'd pass around arrow::Tensor
structs. The ultimate ambition might be to re-implement a sizable portion of xarray
in datafusion
, and then expose an xarray
-like API to Python.
And then I remembered that @maximedion2 had mentioned arrow-zarr
to me a year ago, back when I was playing with io_uring
! (And, in all likelihood, it was @maximedion2's comment that had primed me to think about datafusion
for multi-dimensional arrays!).
So, I was wondering: Have you already built what I've been dreaming about?! Is arrow-zarr
already an attempt to build something like xarray
in datafusion
?! (Absolutely no worries either way! I don't want to add pressure!)