-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Upgrade to arrow/parquet 55, and object_store
to 0.12.0
and pyo3 to 0.24.0
#15466
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -166,7 +166,7 @@ impl ObjectStore for BlockingObjectStore { | |||
fn list( | |||
&self, | |||
prefix: Option<&Path>, | |||
) -> BoxStream<'_, object_store::Result<ObjectMeta>> { | |||
) -> BoxStream<'static, object_store::Result<ObjectMeta>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -305,7 +305,7 @@ impl FileOpener for ArrowOpener { | |||
)?; | |||
// read footer according to footer_len | |||
let get_option = GetOptions { | |||
range: Some(GetRange::Suffix(10 + footer_len)), | |||
range: Some(GetRange::Suffix(10 + (footer_len as u64))), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The changes to usize/u64 are for better wasm support, see
self.inner.get_metadata() | ||
fn get_metadata<'a>( | ||
&'a mut self, | ||
options: Option<&'a ArrowReaderOptions>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -44,7 +44,7 @@ datafusion-common = { workspace = true, default-features = true } | |||
datafusion-expr = { workspace = true } | |||
futures = { workspace = true } | |||
log = { workspace = true } | |||
object_store = { workspace = true } | |||
object_store = { workspace = true, features = ["fs"] } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -884,7 +884,7 @@ SELECT extract(day from arrow_cast('14400 minutes', 'Interval(DayTime)')) | |||
query I | |||
SELECT extract(minute from arrow_cast('14400 minutes', 'Interval(DayTime)')) | |||
---- | |||
14400 | |||
0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use rand::Rng; | ||
use rand::SeedableRng; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I inlined the small amount of code from bench_util so this benchmark is standalone and easier to understand what is tested
🤖 |
🤖: Benchmark completed Details
|
I tried briefly to reproduce the performance improvements reported above and it seems like I can: andrewlamb@Andrews-MacBook-Pro-2:~/Downloads$ for i in `seq 1 5`; do datafusion-cli -f q14.sql ; done | grep seconds
Elapsed 0.374 seconds.
Elapsed 0.386 seconds.
Elapsed 0.378 seconds.
Elapsed 0.366 seconds.
Elapsed 0.366 seconds.
I poked around and I can't quite figure out if the improvement is related to concat batches or maybe reducing the number of IOs for reading the parquet metadata 🤔 |
My theory is that the improvement is due to @rluvaton 's PR to improve concat I thought it might be related to improved pre-fetching / fewer IOs due to However, I tried an experiment on main to reduce IO and it doesn't seem to have changed anything |
object_store
to 0.12.0
and pyo3 to 0.24.0
This should be easy to confirm with |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
Happy to improve performance 😄 I got more in my chamber |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks everyone! |
This PR upgrades to the latest arrow/parquet release and all the dependencies.
I used this PR to test the arrow/parquet 55 release prior to creating an RC.
Some benchmark results show performance is as good or better (see below)
I havn't figured out exactly why but I theorize it is at least in part due to @rluvaton 's PR to improve
concat
concat
performance, and addappend_array
for some array builder implementations arrow-rs#7309