Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve concat performance, and add append_array for some array builder implementations #7309

Merged
merged 20 commits into from
Apr 6, 2025

Conversation

rluvaton
Copy link
Contributor

@rluvaton rluvaton commented Mar 18, 2025

This is on top of


Which issue does this PR close?

N/A

Rationale for this change

This is a building block for implementing specialized concat

What changes are included in this PR?

added append_array function for:

  1. PrimitiveBuilder, GenericByteBuilder and BooleanBuilder with tests
  2. Change concat to use this specialized implementation

Are there any user-facing changes?

Yes, new method.


Local benchmark results:

$ cargo bench --features test_utils -p arrow --bench concatenate_kernel -- --baseline main-concatenate_kernel

concat i32 1024         time:   [234.82 ns 236.58 ns 238.50 ns]
                        change: [-52.116% -51.701% -51.234%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  4 (4.00%) high mild

concat i32 nulls 1024   time:   [357.82 ns 360.23 ns 363.06 ns]
                        change: [-46.113% -45.728% -45.370%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

concat i32 8192 over 100 arrays
                        time:   [130.41 µs 131.34 µs 132.28 µs]
                        change: [-5.5753% -4.6316% -3.5808%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild

concat i32 nulls 8192 over 100 arrays
                        time:   [151.43 µs 153.18 µs 155.05 µs]
                        change: [-4.1616% -2.6759% -1.1288%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

concat 1024 arrays i32 4
                        time:   [8.6263 µs 8.6465 µs 8.6699 µs]
                        change: [-89.718% -89.678% -89.638%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

concat str 1024         time:   [6.0463 µs 6.0539 µs 6.0626 µs]
                        change: [-10.066% -9.9362% -9.8074%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

concat str nulls 1024   time:   [3.7298 µs 3.7344 µs 3.7393 µs]
                        change: [-16.925% -16.796% -16.675%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

concat str_dict 1024    time:   [2.0799 µs 2.0838 µs 2.0877 µs]
                        change: [-1.7071% -1.3975% -1.1049%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

concat str_dict_sparse 1024
                        time:   [5.8639 µs 5.8710 µs 5.8775 µs]
                        change: [-0.7492% -0.3510% +0.0217%] (p = 0.08 > 0.05)
                        No change in performance detected.
Found 13 outliers among 100 measurements (13.00%)
  2 (2.00%) low severe
  8 (8.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

concat str nulls 1024 #2
                        time:   [3.7188 µs 3.7646 µs 3.8541 µs]
                        change: [-17.070% -16.553% -15.777%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  5 (5.00%) high mild
  2 (2.00%) high severe

concat fixed size lists time:   [338.68 µs 341.41 µs 344.52 µs]
                        change: [-0.5865% +1.4834% +3.2434%] (p = 0.14 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  8 (8.00%) high mild
  2 (2.00%) high severe

@github-actions github-actions bot added the arrow Changes to the arrow crate label Mar 18, 2025
@rluvaton rluvaton changed the title feat: add append_array for PrimitiveBuilder feat: add append_array for some array builder implementations Mar 18, 2025
@tustvold
Copy link
Contributor

What is the motivation for adding these here, as opposed to just using the array constructors to implement concat? IMO if you aren't constructing by value, there's no point using the builders?

@rluvaton
Copy link
Contributor Author

rluvaton commented Mar 18, 2025

The motivation is:

  1. when you don't have all the arrays at the moment
  2. Easier for users, as they would need to manually implement concat for different implementation
  3. Add specialize implementation for concat making it faster, in my local testing this can improve up to 50% faster for concat primitive for example

@rluvaton
Copy link
Contributor Author

I've pushed the concat updated implementation so you can run the benchmarks locally, for me:

concat i32 1024         time:   [241.43 ns 244.65 ns 247.98 ns]
                        change: [-47.617% -46.762% -45.826%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

concat i32 nulls 1024   time:   [350.99 ns 352.72 ns 354.66 ns]
                        change: [-45.490% -45.239% -44.996%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

concat 1024 arrays i32 4
                        time:   [7.5176 µs 7.5277 µs 7.5396 µs]
                        change: [-89.988% -89.963% -89.935%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 12 outliers among 100 measurements (12.00%)
  2 (2.00%) high mild
  10 (10.00%) high severe

@tustvold
Copy link
Contributor

This seems like quite a fair bit of code and codegen to shave off literal nanoseconds, no? Do we see similar returns for the byte array types? Also as this is really just reducing the dispatch overheads, I would expect as the array sizes increase the return would largely disappear.

I dunno I am always pretty wary of adding new unsafe APIs...

@rluvaton rluvaton marked this pull request as ready for review March 23, 2025 11:53
@alamb
Copy link
Contributor

alamb commented Mar 25, 2025

This is a building block for implementing specialized concat

I am curious what you mean by this / what you have in mind

Specifically, there is another PR that seems to be related to this type of operation

Also I have a writeup of a more specialized concat / take type operation that might be related

@rluvaton
Copy link
Contributor Author

This is a building block for implementing specialized concat

I am curious what you mean by this / what you have in mind

If I want to implement concat where I don't have all the data at hand and keep building it, while take_in can be used to implement this it won't be as efficient due to:

Specifically, there is another PR that seems to be related to this type of operation

take_in will always be slower than this function as it can take the same index multiple times and can't just copy the data as is.

@alamb alamb changed the title feat: add append_array for some array builder implementations Improve concat performance, and add append_array for some array builder implementations Mar 28, 2025
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rluvaton -- I think this is looking quite good so far (I didn't look at previous versions of this PR)

I have a few test suggestions, but otherwise I think all that is needed are a few more benchmarks and a run showing the performance benefits. I am happy to run such benchmarks if you could make a new PR with just the benchmarks

I think this is going to be pretty sweet

// Creating intermediate offsets instead of pushing each offset is faster
// (even if we make MutableBuffer to avoid updating length on each push
// and reserve the necessary capacity, it's still slower)
let mut intermediate = Vec::with_capacity(offsets.len() - 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 maybe we need something like extend_from_iter in offsets builder (not for this PR)

Copy link
Contributor

@Dandandan Dandandan Apr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think many of these instances could be changed to use Vec directly (and use optimized extend, etc. from them)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

alamb pushed a commit that referenced this pull request Apr 2, 2025
…7376)

* bench: add benchmarks for concat boolean and update string bench

* move updated benchmark from #7309 to here
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @rluvaton -- I think this PR looks really nice now -- well tested and benchmarked 🏆 🦾

@Dandandan's suggestion here is probably worth trying: https://github.com/apache/arrow-rs/pull/7309/files#r2023700374 (I will file a follow on issue)

I also ran the benchmarks and this PR shows significant improvements (as expected):

++ critcmp main add_append_array_builder
group                          add_append_array_builder               main
-----                          ------------------------               ----
concat 1024 arrays i32 4       1.00     14.9±0.04µs        ? ?/sec    13.85   206.4±0.25µs        ? ?/sec
concat fixed size lists        1.00   706.7±10.18µs        ? ?/sec    1.09   767.3±12.22µs        ? ?/sec
concat i32 1024                1.00    440.9±3.06ns        ? ?/sec    1.79    788.1±9.00ns        ? ?/sec
concat i32 nulls 1024          1.00    785.8±4.06ns        ? ?/sec    1.69   1328.5±2.27ns        ? ?/sec
concat str 1024                1.00     13.8±1.41µs        ? ?/sec    1.15     15.9±0.90µs        ? ?/sec
concat str nulls 1024          1.00      6.9±0.57µs        ? ?/sec    1.36      9.4±0.35µs        ? ?/sec
concat str_dict 1024           1.02      2.9±0.01µs        ? ?/sec    1.00      2.9±0.00µs        ? ?/sec
concat str_dict_sparse 1024    1.00      7.0±0.01µs        ? ?/sec    1.00      7.0±0.02µs        ? ?/sec

@alamb alamb merged commit a5af643 into apache:main Apr 6, 2025
26 checks passed
@alamb
Copy link
Contributor

alamb commented Apr 6, 2025

Thanks again @rluvaton

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
arrow Changes to the arrow crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants