Arrow2 -> Arrow-rs Migration #5741
universalmind303
started this conversation in
Roadmaps
Replies: 1 comment
-
|
LGTM, Thanks @universalmind303 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Background:
Back in the early days of the Rust Arrow ecosystem, two contenders vied for dominance: arrow2 and arrow-rs. Arrow2 made some enticing promises - better safety, improved performance, and had an active community at the time. We placed our bet on arrow2 and built Daft around it.
Fast-forward to today: arrow2 is completely abandoned, leaving projects like ourselves and Polars to vendor it internally just to keep things running. This creates a massive technical burden and prevents us from adopting new features readily available in the thriving arrow-rs ecosystem.
Why Migrate?
Switching to arrow-rs would unlock several key improvements:
f16 support
Would unblock multiple issues (#5293, #3126, #2889) by leveraging arrow-rs's native f16 support.
Dictionary encoding
Eliminates need to write our own dictionary encoding array (#1346).
Native Rust Lance integration
Simplifies implementation (#4841) by using arrow-rs directly with Lance APIs, removing arrow2 → arrow conversion overhead.
Improved count(*) performance
Benefits from arrow-rs's optimized parquet readers (#4450, #2298), which are up to 9x faster for metadata decoding.
Easier contribution
Using a well-documented and actively maintained library would lower the barrier for community contributions, as contributors won't need to understand our vendored arrow2 fork.
Reduced binary size & compile time
Currently we are compiling both arrow2 and arrow-rs, which increases both the binary size and our compile times. Removing arrow2 means we only need to compile one arrow crate instead of two.
Additional benefits
Result<T,E>datatypeProposed migration guide
Part 1: Encapsulation
Create a new
daft-arrowcrate that will act as a bridge while we are undergoing the migration.Initially daft-arrow will just reexport everything from arrow2. From there, we can ban all usage of
arrow2outside of that crate and make everything usedaft_arrowinstead.This reduces our surface area and makes all arrow usage encapsulated inside this single crate.
Acceptance Criteria:
This stage would be considered complete once there is no usage of
arrow2anywhere outside of this newly introduced crate.Part 2: Interface Migration
In this phase, we'll modify the core
Arraytrait to match the arrow-rs interface while maintaining compatibility with existing code.Once all arrow usage is encapsulated inside
daft_arrow, we'll update thedaft_arrow::Arraytrait to align witharrow_rs::Array. This foundational change requires either:We're prioritizing the Array trait first since it's the cornerstone of our interactions with Arrow data structures. Related components like
TrustedLenandBitmapwill be addressed as they come up.Acceptance Criteria:
daft_arrow::Arraytrait definition mirrors thearrow_rs::ArrayinterfacePart 3: Physical Array Migration
At this point, we’re now using the new
daft_arrow::Arraytrait throughout the codebase, but all of the concrete array implementations are still arrow2.We’ll start by moving over the physical arrays
Now that we are using the equivalent of the arrow_rs::Array trait, we can start swapping out the concrete implementations from arrow2 → arrow-rs.
Acceptance Criteria:
All Physical arrays (
DataArray) are now backed by arrow-rs arrays instead of arrow2 arrays.Part 4: Logical Array Migration
Same as part 3, but for logical arrays
Part 5: Final Cleanup
Now that everything is fully backed by arrow-rs instead of arrow2, we can start removing the compatibility layer (
daft_arrow), and remove our vendored arrow2 crate entirelyNote: we may want to keep
daft_arrowjust to keep the encapsulation layer.Acceptance Criteria
arrow2 is entirely removed from our codebase
Post-Migration Work
As mentioned above, there are various issues that become much easier to implement once we are using arrow-rs
To reiterate a few key work streams:
parquet2forparquetSuccess Metrics
Beta Was this translation helpful? Give feedback.
All reactions