Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gsl::span performance is a bottleneck for adoption #1165

Open
tiagomacarios opened this issue Nov 8, 2024 · 4 comments
Open

gsl::span performance is a bottleneck for adoption #1165

tiagomacarios opened this issue Nov 8, 2024 · 4 comments
Assignees

Comments

@tiagomacarios
Copy link
Member

Office is trying to adopt gsl::span more broadly, but we are seeing some unexpected performance regressions when compiling with clang:
A - range-for loops
B - std algorithms

On a mail thread @StephanTLavavej suggested:

I'd recommend reaching out to the libc++ devs on either the LLVM Discord server or their Discourse forum, about whether they have an "unwrapping" system similar to what MSVC's STL has. (Presumably you'd want to do the same for libstdc++.)

We use libc++'s portable subset of tests, and we're aligned on a few implementation things (backporting import std; to C++20, names of compiler builtins, etc.), but our internal mechanisms are wildly different - there is no commonality between our unwrapping mechanisms.

In C++20 we can detect arbitrary contiguous iterators, which lets us extract raw pointers, but in C++14/17 that's not physically possible so iterators really need to opt in to unwrapping.

Could one of the maintainers please follow up?

@carsonRadtke
Copy link
Collaborator

@tiagomacarios Thank you for reporting this issue.

After a brief look into libcxx's std::span, it looks like they call "special" iterators when Safe Buffers are enabled - these seem to make std::span analogous to gsl::span, but the bounds checking gets optimized away with -O2. I will start a discussion with clang folks to determine their thoughts on the matter and see if there is a protocol that gsl::span can use.

In the meantime, could you share what version of clang you are using?

@carsonRadtke carsonRadtke self-assigned this Nov 8, 2024
@tiagomacarios
Copy link
Member Author

We use 3 different version of clang:

We usually update compilers as soon as new LTS versions are available.

carsonRadtke added a commit to carsonRadtke/GSL that referenced this issue Nov 9, 2024
Issue: microsoft#1165

Before this PR, the range-for loop was ~3300x slower. After this PR, it
is ~1.005x slower

The clang optimizer is very good at optimizing `current != end`, so
we changed to this idiom. This moves the Expects assertion into the
constructor instead of on the hot-path which is called whenever either
operator++ or operator* is called.

Note: The codegen for the assertion is still a missed optimization,
but less worrisome as it only happens once per iterator.

Note: benchmarks on M1 Macbook Pro w/ Apple Clang 16.0.0
carsonRadtke added a commit to carsonRadtke/GSL that referenced this issue Nov 9, 2024
Issue: microsoft#1165

Before this PR, the range-for loop was ~3300x slower. After this PR, it
is ~1.005x slower

The clang optimizer is very good at optimizing `current != end`, so
we changed to this idiom. This moves the Expects assertion into the
constructor instead of on the hot-path which is called whenever either
operator++ or operator* is called.

Note: The codegen for the assertion is still a missed optimization,
but less worrisome as it only happens once per iterator.

Note: benchmarks on M1 Macbook Pro w/ Apple Clang 16.0.0
@tiagomacarios
Copy link
Member Author

From the mail thread I see that this has been fixed. Would you be against having performance tests? I wonder if we could use @galenelias tests and have the perf tests fail if they are not within a threshold from each other.

@carsonRadtke
Copy link
Collaborator

From the mail thread I see that this has been fixed. Would you be against having performance tests? I wonder if we could use @galenelias tests and have the perf tests fail if they are not within a threshold from each other.

Performance tests would be great. I'd be happy to either review a PR with Galen's benchmark or I can create the PR if that is more convenient. I'd rather not add a new dimension to the test matrix because we are already pushing 100 runners per PR commit. If we could add one test that compares to the latest standard library implementation for each Clang, GCC, and MSVC that'd be my preference.

carsonRadtke added a commit that referenced this issue Nov 12, 2024
* improve performance of span_iterator w/ clang

Issue: #1165

Before this PR, the range-for loop was ~3300x slower. After this PR, it
is ~1.005x slower

The clang optimizer is very good at optimizing `current != end`, so
we changed to this idiom. This moves the Expects assertion into the
constructor instead of on the hot-path which is called whenever either
operator++ or operator* is called.

Note: The codegen for the assertion is still a missed optimization,
but less worrisome as it only happens once per iterator.

Note: benchmarks on M1 Macbook Pro w/ Apple Clang 16.0.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants