Skip to content

[libc++] Optimize ranges::copy for random_access_iterator inputs and vector<bool> iterator outputs #120134

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

winner245
Copy link
Contributor

@winner245 winner245 commented Dec 16, 2024

This patch optimizes the performance of {std, ranges}::copy when copying from random_access_iterator inputs to a vector<bool>::iterator output, yielding a performance improvement of up to 3x.

Specifically, for random_access_iterator-pair inputs, instead of iterating through individual bits in vector<bool> using bitwise masks and writing bit by bit, the optimization first assembles the input data into storage words and then directly copies the entire words to the underlying storage of vector<bool>. This word-wise copying approach leads to a 3x performance improvement for {std, ranges}::copy.

This optimization also brings a similar performance improvement for segmented_iterator-pair inputs, where local iterators are random-access. Specifically, the existing segmented iterator optimization first subdivides the segmented inputs (e.g., std::deque and std::join_view inputs) into random-access segments, allowing the newly developed optimization for random_access_iterators to apply on each segment, resulting in a similar 3x speed-up for segmented iterators.

As a byproduct of this work, all the iterator-pair and range-based operations that internally call std::copy have also achieved a similar speed-up of at least 3x. The improved vector<bool> operations include:

  • range-ctor: vector(std::from_range_t, R&& rg, const Allocator& alloc)
  • range-assignment: assign_range(R&& rg)
  • range-insertion: insert_range(const_iterator pos, R&& rg)
  • range-append: append_range(R&& rg)
  • iterator-pair ctor: vector(InputIt first, InputIt last, const Allocator& alloc)
  • iterator-pair assignment: assign(InputIt first, InputIt last)
  • iterator-pair insert: insert(const_iterator pos, InputIt first, InputIt last)

Benchmarks

Comprehensive benchmarks have been provided to seamlessly integrate into the recently enhanced benchmark framework. These results demonstrate the substantial performance improvements for both {std, ranges}::copy and vector<bool> operations.

{std, ranges}::copy
-------------------------------------------------------------------------------------------
Benchmark                                                   Before         After    Speedup
-------------------------------------------------------------------------------------------
std::copy(vector<int>, std::vector<bool>)/8                8.37 ns       4.07 ns      2.1x
std::copy(vector<int>, std::vector<bool>)/64               60.5 ns       26.3 ns      2.3x
std::copy(vector<int>, std::vector<bool>)/512               577 ns        207 ns      2.8x
std::copy(vector<int>, std::vector<bool>)/4096             5063 ns       1686 ns      3.0x
std::copy(vector<int>, std::vector<bool>)/32768           37919 ns      13111 ns      2.9x
std::copy(vector<int>, std::vector<bool>)/262144         305365 ns     106846 ns      2.9x
std::copy(vector<int>, std::vector<bool>)/1048576       1273307 ns     465840 ns      2.7x
std::copy(deque<int>, std::vector<bool>)/8                 9.70 ns       5.15 ns      1.9x
std::copy(deque<int>, std::vector<bool>)/64                66.6 ns       26.9 ns      2.5x
std::copy(deque<int>, std::vector<bool>)/512                580 ns        218 ns      2.7x
std::copy(deque<int>, std::vector<bool>)/4096              4789 ns       1692 ns      2.8x
std::copy(deque<int>, std::vector<bool>)/32768            37840 ns      14920 ns      2.5x
std::copy(deque<int>, std::vector<bool>)/262144          322336 ns     118503 ns      2.7x
std::copy(deque<int>, std::vector<bool>)/1048576        1232040 ns     479524 ns      2.6x
std::copy(list<int>, std::vector<bool>)/8                  7.69 ns       9.11 ns      0.8x
std::copy(list<int>, std::vector<bool>)/64                 90.8 ns       94.2 ns      1.0x
std::copy(list<int>, std::vector<bool>)/512                 665 ns        706 ns      0.9x
std::copy(list<int>, std::vector<bool>)/4096               7273 ns       7350 ns      1.0x
std::copy(list<int>, std::vector<bool>)/32768             66961 ns      59105 ns      1.1x
std::copy(list<int>, std::vector<bool>)/262144           563599 ns     515650 ns      1.1x
std::copy(list<int>, std::vector<bool>)/1048576         3392020 ns    3366732 ns      1.0x
rng::copy(vector<int>, std::vector<bool>)/8                8.87 ns       4.86 ns      1.8x
rng::copy(vector<int>, std::vector<bool>)/64               74.6 ns       38.3 ns      1.9x
rng::copy(vector<int>, std::vector<bool>)/512               622 ns        258 ns      2.4x
rng::copy(vector<int>, std::vector<bool>)/4096             5008 ns       1773 ns      2.8x
rng::copy(vector<int>, std::vector<bool>)/32768           39716 ns      13775 ns      2.9x
rng::copy(vector<int>, std::vector<bool>)/262144         320175 ns     112049 ns      2.9x
rng::copy(vector<int>, std::vector<bool>)/1048576       1352474 ns     444207 ns      3.0x
rng::copy(deque<int>, std::vector<bool>)/8                 8.71 ns       5.80 ns      1.5x
rng::copy(deque<int>, std::vector<bool>)/64                63.1 ns       28.0 ns      2.3x
rng::copy(deque<int>, std::vector<bool>)/512                648 ns        215 ns      3.0x
rng::copy(deque<int>, std::vector<bool>)/4096              5158 ns       1745 ns      3.0x
rng::copy(deque<int>, std::vector<bool>)/32768            38204 ns      13954 ns      2.7x
rng::copy(deque<int>, std::vector<bool>)/262144          314286 ns     113385 ns      2.8x
rng::copy(deque<int>, std::vector<bool>)/1048576        1245843 ns     468045 ns      2.7x
rng::copy(list<int>, std::vector<bool>)/8                  8.10 ns       10.2 ns      0.8x
rng::copy(list<int>, std::vector<bool>)/64                 97.1 ns       90.6 ns      1.1x
rng::copy(list<int>, std::vector<bool>)/512                 682 ns        671 ns      1.0x
rng::copy(list<int>, std::vector<bool>)/4096               7052 ns       7060 ns      1.0x
rng::copy(list<int>, std::vector<bool>)/32768             57095 ns      56941 ns      1.0x
rng::copy(list<int>, std::vector<bool>)/262144           481966 ns     515630 ns      0.9x
rng::copy(list<int>, std::vector<bool>)/1048576         3103370 ns    3447819 ns      0.9x
vector<bool>
--------------------------------------------------------------------------------------------------------------------
Benchmark                                                                           Before          After    Speedup
--------------------------------------------------------------------------------------------------------------------
std::vector<bool>::ctor(ra_iter, ra_iter) (cheap elements)/1024                    1122 ns         424 ns      2.6x
std::vector<bool>::ctor(ra_iter, ra_iter) (cheap elements)/65536                  73961 ns       26137 ns      2.8x
std::vector<bool>::ctor(ra_iter, ra_iter) (cheap elements)/1048576              1197765 ns      428951 ns      2.8x
std::vector<bool>::assign(ra_iter, ra_iter) (cheap elements)/1024                  1155 ns         424 ns      2.7x
std::vector<bool>::assign(ra_iter, ra_iter) (cheap elements)/65536                73327 ns       26512 ns      2.8x
std::vector<bool>::assign(ra_iter, ra_iter) (cheap elements)/1048576            1204693 ns      439736 ns      2.7x
std::vector<bool>::insert(begin, ra_iter, ra_iter) (cheap elements)/1024           1267 ns         473 ns      2.7x
std::vector<bool>::insert(begin, ra_iter, ra_iter) (cheap elements)/65536         79917 ns       29656 ns      2.7x
std::vector<bool>::insert(begin, ra_iter, ra_iter) (cheap elements)/1048576     1360464 ns      481255 ns      2.8x
std::vector<bool>::ctor(ra_range) (cheap elements)/1024                            1221 ns         477 ns      2.6x
std::vector<bool>::ctor(ra_range) (cheap elements)/65536                          77053 ns       29660 ns      2.6x
std::vector<bool>::ctor(ra_range) (cheap elements)/1048576                      1249673 ns      477625 ns      2.6x
std::vector<bool>::assign_range(ra_range) (cheap elements)/1024                    1326 ns         464 ns      2.9x
std::vector<bool>::assign_range(ra_range) (cheap elements)/65536                  80035 ns       29477 ns      2.7x
std::vector<bool>::assign_range(ra_range) (cheap elements)/1048576              1276558 ns      528217 ns      2.4x
std::vector<bool>::insert_range(ra_range) (cheap elements)/1024                    1352 ns         485 ns      2.8x
std::vector<bool>::insert_range(ra_range) (cheap elements)/65536                  85661 ns       28019 ns      3.1x
std::vector<bool>::insert_range(ra_range) (cheap elements)/1048576              1380750 ns      450849 ns      3.1x
std::vector<bool>::append_range(ra_range) (cheap elements)/1024                    1394 ns         444 ns      3.1x
std::vector<bool>::append_range(ra_range) (cheap elements)/65536                  83086 ns       27550 ns      3.0x
std::vector<bool>::append_range(ra_range) (cheap elements)/1048576              1321555 ns      444045 ns      3.0x

@winner245 winner245 marked this pull request as ready for review December 16, 2024 20:19
@winner245 winner245 requested a review from a team as a code owner December 16, 2024 20:19
@llvmbot llvmbot added the libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. label Dec 16, 2024
@llvmbot
Copy link
Member

llvmbot commented Dec 16, 2024

@llvm/pr-subscribers-libcxx

Author: Peng Liu (winner245)

Changes

General description

This PR is part of a series aimed at significantly improving the performance of vector&lt;bool&gt;. Each PR focuses on enhancing a specific subset of operations, ensuring they are self-contained and easy to review. The main idea for performance improvements involves using word-wise implementation along with bit manipulation techniques, rather than solely using bit-wise operations in the previous implementation, resulting in substantial performance gains.

Current PR

This PR enhances the performance of all range-based operations in vector&lt;bool&gt; by at least 5x. The main idea is to provide a more efficient overload of std::__copy(_InIter __first, _InIter __last, __bit_iterator&lt;_Cp, false&gt; __result), which is used by various range-based operations in vector<bool>. With this efficient overload of std::__copy, all range-based operations benefit from significant performance improvements, which apply to the iterator-pair based range operations as well as C++23's range constructor and {insert, append}_range functions:

  • range-ctor vector( InputIt first, InputIt last, const Allocator&amp; alloc): 5.84x
  • C++23 range-ctor vector(std::from_range_t, R&amp;&amp; rg, const Allocator&amp; alloc): 5.86x
  • range-assignment assign(InputIt first, InputIt last): 5.84x
  • C++23 assign_range(R&amp;&amp; rg): 5.9x
  • range-insert insert( const_iterator pos, InputIt first, InputIt last ): 6.38x
  • C++23 insert_range(const_iterator pos, R&amp;&amp; rg): 6.45x
  • C++23 append_range(R&amp;&amp; rg): 5.5x

Before:

--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_ConstructIterIter/vector_bool/5140480      22432969 ns     22560977 ns           31
BM_ConstructFromRange/vector_bool/5140480     22499312 ns     22632239 ns           31
BM_Assign_IterIter/vector_bool/5140480        22542583 ns     22679677 ns           30
BM_Assign_Range/vector_bool/5140480           22739005 ns     22881371 ns           31
BM_Insert_Iter_IterIter/vector_bool/5140480   23249604 ns     23398233 ns           30
BM_Insert_Range/vector_bool/5140480           23031899 ns     23181587 ns           30
BM_Append_Range/vector_bool/5140480           23432886 ns     23586148 ns           29

After:

--------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations
--------------------------------------------------------------------------------------
BM_ConstructIterIter/vector_bool/5140480       3836990 ns      3857075 ns          182
BM_ConstructFromRange/vector_bool/5140480      3838558 ns      3860015 ns          177
BM_Assign_IterIter/vector_bool/5140480         3856720 ns      3879212 ns          181
BM_Assign_Range/vector_bool/5140480            3849086 ns      3872665 ns          178
BM_Insert_Iter_IterIter/vector_bool/5140480    3639338 ns      3661651 ns          189
BM_Insert_Range/vector_bool/5140480            3569611 ns      3592612 ns          195
BM_Append_Range/vector_bool/5140480            4256268 ns      4284186 ns          168

Full diff: https://github.com/llvm/llvm-project/pull/120134.diff

4 Files Affected:

  • (modified) libcxx/include/__algorithm/copy.h (+50)
  • (modified) libcxx/include/__bit_reference (+3)
  • (modified) libcxx/test/benchmarks/containers/ContainerBenchmarks.h (+58)
  • (added) libcxx/test/benchmarks/containers/vector_bool_operations.bench.cpp (+37)
diff --git a/libcxx/include/__algorithm/copy.h b/libcxx/include/__algorithm/copy.h
index 4f30b2050abbaf..f737bc4e98e6d6 100644
--- a/libcxx/include/__algorithm/copy.h
+++ b/libcxx/include/__algorithm/copy.h
@@ -13,6 +13,8 @@
 #include <__algorithm/for_each_segment.h>
 #include <__algorithm/min.h>
 #include <__config>
+#include <__fwd/bit_reference.h>
+#include <__iterator/distance.h>
 #include <__iterator/iterator_traits.h>
 #include <__iterator/segmented_iterator.h>
 #include <__type_traits/common_type.h>
@@ -95,6 +97,54 @@ struct __copy_impl {
     }
   }
 
+  template <class _InIter, class _Cp, __enable_if_t<__has_forward_iterator_category<_InIter>::value, int> = 0>
+  _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX20 pair<_InIter, __bit_iterator<_Cp, false>>
+  operator()(_InIter __first, _InIter __last, __bit_iterator<_Cp, false> __result) {
+    using _It                      = __bit_iterator<_Cp, false>;
+    using __storage_type           = typename _It::__storage_type;
+    __storage_type __n             = static_cast<__storage_type>(std::distance(__first, __last));
+    const unsigned __bits_per_word = _It::__bits_per_word;
+
+    if (__n) {
+      // do first partial word, if present
+      if (__result.__ctz_ != 0) {
+        __storage_type __clz = static_cast<__storage_type>(__bits_per_word - __result.__ctz_);
+        __storage_type __dn  = std::min(__clz, __n);
+        __storage_type __w   = *__result.__seg_;
+        __storage_type __m   = (~__storage_type(0) << __result.__ctz_) & (~__storage_type(0) >> (__clz - __dn));
+        __w &= ~__m;
+        for (__storage_type __i = 0; __i < __dn; ++__i, ++__first)
+          __w |= static_cast<__storage_type>(*__first) << __result.__ctz_++;
+        *__result.__seg_ = __w;
+        if (__result.__ctz_ == __bits_per_word) {
+          __result.__ctz_ = 0;
+          ++__result.__seg_;
+        }
+        __n -= __dn;
+      }
+    }
+    // do middle whole words, if present
+    __storage_type __nw = __n / __bits_per_word;
+    __n -= __nw * __bits_per_word;
+    for (; __nw; --__nw) {
+      __storage_type __w = 0;
+      for (__storage_type __i = 0; __i < __bits_per_word; ++__i, ++__first)
+        __w |= static_cast<__storage_type>(*__first) << __i;
+      *__result.__seg_++ = __w;
+    }
+    // do last partial word, if present
+    if (__n) {
+      __storage_type __w = 0;
+      for (__storage_type __i = 0; __i < __n; ++__i, ++__first)
+        __w |= static_cast<__storage_type>(*__first) << __i;
+      __storage_type __m = ~__storage_type(0) >> (__bits_per_word - __n);
+      *__result.__seg_ &= ~__m;
+      *__result.__seg_ |= __w;
+      __result.__ctz_ = __n;
+    }
+    return std::make_pair(std::move(__first), std::move(__result));
+  }
+
   // At this point, the iterators have been unwrapped so any `contiguous_iterator` has been unwrapped to a pointer.
   template <class _In, class _Out, __enable_if_t<__can_lower_copy_assignment_to_memmove<_In, _Out>::value, int> = 0>
   _LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 pair<_In*, _Out*>
diff --git a/libcxx/include/__bit_reference b/libcxx/include/__bit_reference
index 22637d43974123..e8cbb63988ba54 100644
--- a/libcxx/include/__bit_reference
+++ b/libcxx/include/__bit_reference
@@ -10,6 +10,7 @@
 #ifndef _LIBCPP___BIT_REFERENCE
 #define _LIBCPP___BIT_REFERENCE
 
+#include <__algorithm/copy.h>
 #include <__algorithm/copy_n.h>
 #include <__algorithm/fill_n.h>
 #include <__algorithm/min.h>
@@ -970,6 +971,8 @@ private:
   _LIBCPP_CONSTEXPR_SINCE_CXX20 friend void
   __fill_n_bool(__bit_iterator<_Dp, false> __first, typename _Dp::size_type __n);
 
+  friend struct __copy_impl;
+
   template <class _Dp, bool _IC>
   _LIBCPP_CONSTEXPR_SINCE_CXX20 friend __bit_iterator<_Dp, false> __copy_aligned(
       __bit_iterator<_Dp, _IC> __first, __bit_iterator<_Dp, _IC> __last, __bit_iterator<_Dp, false> __result);
diff --git a/libcxx/test/benchmarks/containers/ContainerBenchmarks.h b/libcxx/test/benchmarks/containers/ContainerBenchmarks.h
index 6d21e12896ec9e..123f7bc95d4745 100644
--- a/libcxx/test/benchmarks/containers/ContainerBenchmarks.h
+++ b/libcxx/test/benchmarks/containers/ContainerBenchmarks.h
@@ -51,6 +51,30 @@ void BM_Assignment(benchmark::State& st, Container) {
   }
 }
 
+template <class Container, class GenInputs>
+void BM_Assign_IterIter(benchmark::State& st, Container c, GenInputs gen) {
+  auto in  = gen(st.range(0));
+  auto beg = in.begin();
+  auto end = in.end();
+  for (auto _ : st) {
+    c.assign(beg, end);
+    DoNotOptimizeData(c);
+    DoNotOptimizeData(in);
+    benchmark::ClobberMemory();
+  }
+}
+
+template <std::size_t... sz, typename Container, typename GenInputs>
+void BM_Assign_Range(benchmark::State& st, Container c, GenInputs gen) {
+  auto in = gen(st.range(0));
+  for (auto _ : st) {
+    c.assign_range(in);
+    DoNotOptimizeData(c);
+    DoNotOptimizeData(in);
+    benchmark::ClobberMemory();
+  }
+}
+
 template <std::size_t... sz, typename Container, typename GenInputs>
 void BM_AssignInputIterIter(benchmark::State& st, Container c, GenInputs gen) {
   auto v = gen(1, sz...);
@@ -108,6 +132,40 @@ void BM_Pushback_no_grow(benchmark::State& state, Container c) {
   }
 }
 
+template <class Container, class GenInputs>
+void BM_Insert_Iter_IterIter(benchmark::State& st, Container c, GenInputs gen) {
+  auto in        = gen(st.range(0));
+  const auto beg = in.begin();
+  const auto end = in.end();
+  for (auto _ : st) {
+    c.resize(100);
+    c.insert(c.begin() + 50, beg, end);
+    DoNotOptimizeData(c);
+    benchmark::ClobberMemory();
+  }
+}
+
+template <class Container, class GenInputs>
+void BM_Insert_Range(benchmark::State& st, Container c, GenInputs gen) {
+  auto in = gen(st.range(0));
+  for (auto _ : st) {
+    c.resize(100);
+    c.insert_range(c.begin() + 50, in);
+    DoNotOptimizeData(c);
+    benchmark::ClobberMemory();
+  }
+}
+
+template <class Container, class GenInputs>
+void BM_Append_Range(benchmark::State& st, Container c, GenInputs gen) {
+  auto in = gen(st.range(0));
+  for (auto _ : st) {
+    c.append_range(in);
+    DoNotOptimizeData(c);
+    benchmark::ClobberMemory();
+  }
+}
+
 template <class Container, class GenInputs>
 void BM_InsertValue(benchmark::State& st, Container c, GenInputs gen) {
   auto in        = gen(st.range(0));
diff --git a/libcxx/test/benchmarks/containers/vector_bool_operations.bench.cpp b/libcxx/test/benchmarks/containers/vector_bool_operations.bench.cpp
new file mode 100644
index 00000000000000..2ce10cb6d3d1b6
--- /dev/null
+++ b/libcxx/test/benchmarks/containers/vector_bool_operations.bench.cpp
@@ -0,0 +1,37 @@
+//===----------------------------------------------------------------------===//
+//
+// Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
+// See https://llvm.org/LICENSE.txt for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+//===----------------------------------------------------------------------===//
+
+// UNSUPPORTED: c++03, c++11, c++14, c++17, c++20
+
+#include <cstdint>
+#include <cstdlib>
+#include <cstring>
+#include <deque>
+#include <functional>
+#include <memory>
+#include <string>
+#include <vector>
+
+#include "benchmark/benchmark.h"
+#include "ContainerBenchmarks.h"
+#include "../GenerateInput.h"
+
+using namespace ContainerBenchmarks;
+
+BENCHMARK_CAPTURE(BM_ConstructIterIter, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)->Arg(5140480);
+BENCHMARK_CAPTURE(BM_ConstructFromRange, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)->Arg(5140480);
+
+BENCHMARK_CAPTURE(BM_Assign_IterIter, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)->Arg(5140480);
+BENCHMARK_CAPTURE(BM_Assign_Range, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)->Arg(5140480);
+
+BENCHMARK_CAPTURE(BM_Insert_Iter_IterIter, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)
+    ->Arg(5140480);
+BENCHMARK_CAPTURE(BM_Insert_Range, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)->Arg(5140480);
+BENCHMARK_CAPTURE(BM_Append_Range, vector_bool, std::vector<bool>{}, getRandomIntegerInputs<bool>)->Arg(5140480);
+
+BENCHMARK_MAIN();
\ No newline at end of file

@winner245 winner245 force-pushed the speed-up-range-function branch 8 times, most recently from b98a2ef to 4239066 Compare December 18, 2024 15:14
Copy link

github-actions bot commented Dec 18, 2024

✅ With the latest revision this PR passed the C/C++ code formatter.

@winner245 winner245 force-pushed the speed-up-range-function branch 7 times, most recently from 996d1fe to 81c929a Compare December 21, 2024 17:04
Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're optimizing an algorithm (and nothing specific to vector<bool> itself) we should just benchmark that instead. That's significantly less convoluted. I'd also like to see some additional tests, especially with iterators that don't return a bool. I'm pretty sure your current implementation is completely broken with that. Lastly, I think we should move this into __copy_impl, since we might be able to unwrap iterators to __bit_iterators. I don't think we do that currently, but I see no reason we couldn't in the future. It would also be nice to improve std::move in the same way (and hopefully share the code).

@winner245
Copy link
Contributor Author

Thank you for your suggestion.

Since you're optimizing an algorithm (and nothing specific to vector<bool> itself) we should just benchmark that instead. That's significantly less convoluted.

My original motivation for this series of work was to improve the performance of vector<bool>. However, I understand your point, and I can focus on optimizing the std::copy and std::move algorithms instead, and benchmark the performance for the algorithms themselves.

I'd also like to see some additional tests, especially with iterators that don't return a bool. I'm pretty sure your current implementation is completely broken with that.

Since we are dealing with __bit_iterator, my current implementation only works for the bool return type. I plan to add template type constraints to _InIter to ensure it returns types that are either bool or convertible to bool. Do you think this approach meets your expectations?

Lastly, I think we should move this into __copy_impl, since we might be able to unwrap iterators to __bit_iterators. I don't think we do that currently, but I see no reason we couldn't in the future. It would also be nice to improve std::move in the same way (and hopefully share the code).

I agree with you and this was also what I planned to do next.

@winner245 winner245 force-pushed the speed-up-range-function branch 5 times, most recently from 05c18a3 to 7bef800 Compare January 21, 2025 04:01
@winner245 winner245 force-pushed the speed-up-range-function branch 2 times, most recently from 2f32561 to 948da90 Compare January 25, 2025 03:16
is_convertible<typename iterator_traits<_InIter>::value_type, bool>::value,
int> = 0>
_LIBCPP_HIDE_FROM_ABI _LIBCPP_CONSTEXPR_SINCE_CXX14 pair<_InIter, __bit_iterator<_Cp, false> >
operator()(_InIter __first, _Sent __last, __bit_iterator<_Cp, false> __result) const {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
operator()(_InIter __first, _Sent __last, __bit_iterator<_Cp, false> __result) const {
operator()(_InIter __first, _Sent __last, __bit_iterator<_Cp, /* IsConst */false> __result) const {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the thought-provoking comment. As suggested, I've implemented segmented iterator inputs for std::copy. With segmented_iterator input, each input segment reduces to a forward_iterator-pair, which is the case this patch optimizes for. As a result, the performance improvements for forward_iterator-pair inputs also extend to segmented_iterator inputs, yielding a 9x speed-up in both cases. For a more detailed explanation, please refer to my updated PR description.

@winner245 winner245 force-pushed the speed-up-range-function branch 2 times, most recently from 3dbf418 to 37d52d7 Compare March 23, 2025 02:32
@winner245 winner245 changed the title [libc++] Speed-up {random_access, forward}_range-based operations in vector<bool>[3/3] [libc++] Optimize ranges::copy for forward_iterator and segmented_iterator Mar 23, 2025
@winner245 winner245 force-pushed the speed-up-range-function branch 4 times, most recently from 2876f93 to f0eb051 Compare March 23, 2025 18:22
@winner245 winner245 force-pushed the speed-up-range-function branch 3 times, most recently from c342808 to 9767903 Compare June 17, 2025 23:46
@winner245 winner245 changed the title [libc++] Optimize ranges::copy for forward_iterator and segmented_iterator [libc++] Optimize ranges::copy for random_accsess_iterator and segmented_iterator Jun 17, 2025
@winner245 winner245 force-pushed the speed-up-range-function branch from 9767903 to 92e8094 Compare June 18, 2025 01:22
@ldionne ldionne changed the title [libc++] Optimize ranges::copy for random_accsess_iterator and segmented_iterator [libc++] Optimize ranges::copy for random_access_iterator and segmented_iterator Jun 18, 2025
@winner245 winner245 changed the title [libc++] Optimize ranges::copy for random_access_iterator and segmented_iterator [libc++] Optimize ranges::copy for random_access_iterator inputs Jun 18, 2025
@winner245 winner245 force-pushed the speed-up-range-function branch from ab34aec to b5751e5 Compare June 19, 2025 03:47
Copy link
Member

@ldionne ldionne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with minor changes to the benchmarks. Thanks!

@@ -50,6 +50,25 @@ constexpr auto wrap_input(std::vector<T>& input) {
return std::ranges::subrange(std::move(b), std::move(e));
}

template <class Iter, class Sent>
class random_access_range_wrapper {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what I would do here is actually remove this type, and then do:

auto bm = [&generators, &bench_vb, &tostr]<template <class> class Iterator>(std::string range) {
  for (auto gen : generators)
    bench_vb("append_range(" + range + ")" + tostr(gen), [gen](auto& st) {
      auto const size = st.range(0);
      std::vector<int> in;
      std::generate_n(std::back_inserter(in), size, gen);
      std::ranges::subrange rg(Iterator(std::ranges::begin(in)), Iterator(std::ranges::end(in)));
      DoNotOptimizeData(in);

      Container c;
      for ([[maybe_unused]] auto _ : st) {
        c.append_range(rg);
        c.erase(c.begin(), c.end()); // avoid growing indefinitely
        DoNotOptimizeData(c);
      }
    });
};
bm.template operator()< cpp20_random_access_iterator >("ra_range");

Note that there was also a bug where you did c.append_range(in) instead of c.append_range(rg), making rg` unused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching this! It has now been fixed, and I've also rerun all the benchmarks. As we've introduced an extra static_cast to bool inside the inner loop for every bit, this has led to a nonnegligible cost. However, overall the optimization still brings a 3x performance improvement, as confirmed by my new benchmarks. Accordingly, I've updated the commit message a bit to reflect this.

DoNotOptimizeData(c);

for ([[maybe_unused]] auto _ : st) {
c.insert_range(c.begin(), in);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue here with c.insert_range(c.begin(), in);.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not attached to the file: Let's add a release note for this!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added.

Copy link
Contributor

@philnik777 philnik777 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we change the title to mention vector<bool>::iterator? As-is this sounds like we're optimizing copy for any random_access_iterator combination, which clearly isn't the case.

@winner245 winner245 changed the title [libc++] Optimize ranges::copy for random_access_iterator inputs [libc++] Optimize ranges::copy for random_access_iterator inputs and vector<bool> iterator outputs Jun 26, 2025
@winner245
Copy link
Contributor Author

@philnik777

Could we change the title to mention vector<bool>::iterator? As-is this sounds like we're optimizing copy for any random_access_iterator combination, which clearly isn't the case.

Yeah, I've mentioned explicitly in the title that the optimization works specifically for copying from random_access_iterator to vector<bool>::iterator.

@winner245 winner245 force-pushed the speed-up-range-function branch from cb0d366 to e86e27e Compare June 28, 2025 22:05
@winner245 winner245 force-pushed the speed-up-range-function branch from e86e27e to a53a235 Compare June 28, 2025 22:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libc++ libc++ C++ Standard Library. Not GNU libstdc++. Not libc++abi. performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants