Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

improve performance of span_iterator w/ clang #1166

Merged
merged 3 commits into from
Nov 12, 2024

Conversation

carsonRadtke
Copy link
Collaborator

Issue: #1165

Before this PR, the range-for loop was ~3300x slower. After this PR, it is ~1.005x slower

The clang optimizer is very good at optimizing current != end, so we changed to this idiom. This moves the Expects assertion into the constructor instead of on the hot-path which is called whenever either operator++ or operator* is called.

Note: The codegen for the assertion is still a missed optimization, but less worrisome as it only happens once per iterator.

@carsonRadtke
Copy link
Collaborator Author

carsonRadtke commented Nov 9, 2024

Benchmark details:

  • M1 Macbook Pro
  • Apple Clang 16.0.0
  • span over std::vector<int>(10000)

Benchmark results:

#include <gsl/gsl>
#include <span>

template <typename T>
void foo(const T &sp) {
    for (const auto &x : sp) {
    }
}

auto gsl_foo = foo<gsl::span<int>>;
auto std_foo = foo<std::span<int>>;

Before:

Run on (8 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 2.00, 1.97, 1.92
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_gsl_foo       3147 ns         3146 ns       220999
BM_std_foo      0.941 ns        0.941 ns    743739309

After:

Run on (8 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x8)
Load Average: 1.99, 1.97, 1.92
-----------------------------------------------------
Benchmark           Time             CPU   Iterations
-----------------------------------------------------
BM_gsl_foo      0.955 ns        0.953 ns    724442697
BM_std_foo      0.949 ns        0.948 ns    742548000

Issue: microsoft#1165

Before this PR, the range-for loop was ~3300x slower. After this PR, it
is ~1.005x slower

The clang optimizer is very good at optimizing `current != end`, so
we changed to this idiom. This moves the Expects assertion into the
constructor instead of on the hot-path which is called whenever either
operator++ or operator* is called.

Note: The codegen for the assertion is still a missed optimization,
but less worrisome as it only happens once per iterator.

Note: benchmarks on M1 Macbook Pro w/ Apple Clang 16.0.0
@carsonRadtke carsonRadtke marked this pull request as ready for review November 11, 2024 20:22
@carsonRadtke
Copy link
Collaborator Author

Benchmark CPU Before (ns) CPU After (ns) Ratio
BM_IsSortedStdSpan 325 323 1.0
BM_IsSortedGslSpan 1573 322 4.9
BM_IsSortedRangesStdSpan 325 323 1.0
BM_IsSortedRangesGslSpan 1583 475 3.3
BM_IsSortedCustomStdSpan 442 435 1.0
BM_IsSortedCustomGslSpan 1420 629 2.3
BM_MinElementAlgorithmStdSpan 637 631 1.0
BM_MinElementAlgorithmGslSpan 1913 1852 1.0
BM_MinElementRangeForStdSpan 33.8 33.6 1.0
BM_MinElementRangeForGslSpan 635 33.2 19.1

More benchmarks (thanks to @galenelias). I'd like to see an improvement in std::min_element, but I will open an issue for it instead of blocking this PR.

It would also be nice to have performance regression tests (see #1165 (comment)). I will open an issue for this as well.

@carsonRadtke carsonRadtke merged commit f8ec309 into microsoft:main Nov 12, 2024
95 checks passed
@carsonRadtke carsonRadtke deleted the clang-perf branch November 12, 2024 21:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant