Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gh-131269: Avoid binding functions in random.py #131270

Merged
merged 2 commits into from
Mar 20, 2025

Conversation

colesbury
Copy link
Contributor

@colesbury colesbury commented Mar 15, 2025

This speeds up calls like random.randint() by about 10% and also avoids some scaling bottlenecks in the free threading build.

This speeds up calls like `random.randint()` by about 10%.
Copy link
Member

@picnixz picnixz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are also some other places where we did from os import urandom as _urandom and used _urandom afterwards. Should this also be changed or is this is still a real optimisation? (namely, replace _urandom by os.urandom?)

@colesbury
Copy link
Contributor Author

The from os import urandom as _urandom; _urandom() pattern may still be slightly faster than calls like os.urandom(). The difference is small enough that I would not use that pattern in new code, but I'm not sure it's worth changing existing code.

@rhettinger rhettinger self-assigned this Mar 18, 2025
@rhettinger
Copy link
Contributor

rhettinger commented Mar 18, 2025

When I last looked at this a few months ago, the speed-up was questionable (it varied quite a bit across builds, compilers, and operating systems). The interpreter implementation is in rapid flux. For a long time, the pre-binding was faster. Then the LOAD_ATTR optimizations pulled ahead, but there is no reason think that advantage will persist. (We has a similar situation where LOAD_GLOBAL briefly became as fast as LOAD_LOCAL which seemed to invalidate previous efforts to use local variables). I would rather not "chase the interpreter" into a local minimum. and keep the current logic intact until the interpreter stablizes. Experience PyPy showing that prebinding could maintain its advantage even with JITted code.

Ultimately, I expect that the current code would win out because it hoists some of the lookup logic outside of the loop. Right now, it adds a STORE_FAST opcode but the cost of that will drop to almost free as the impending JIT work moves forward.

Can you elaborate more on the the "free-threading bottlenecks"? When I last looked at this while working on KDE for the statistics module, I found that prebinding wasn't the problem. Instead, what was needed was separate instances of random.Random() so that there was no shared state.

Addenda: IIRC this was also discussed in the forums this year and Guido was against sweeping through and replacing bound methods in existing stable code.

Maintainer's note: I personally find the calls to random() to be more readable than self.random() sprinkled in the middle of complicated formulas. Unless this edit must be made, I greatly prefer the current code some of which I have recently written.

@rhettinger rhettinger requested a review from tim-one March 18, 2025 01:09
@colesbury
Copy link
Contributor Author

Hi @rhettinger - here's an example that demonstrates the free threading bottleneck, adapted from Paul Moore's program: montecarlo.py. It's nearly twice as fast with this change. It already uses separate instances of random.Random().

The underlying bottleneck is due to two issues:

  • The expression getrandbits = self.getrandbits creates a new method object that increases the reference count of the shared getrandbits function. Even though the random.Random instances are separate, the function is global. The self.getrandbits() call avoids creating the temporary method object.
  • The _PyType_LookupRef call in LOAD_ATTR also temporarily increases the reference count of the shared function. The specialized LOAD_ATTR_METHOD avoids this.

I think it's possible to change the implementation to avoid both of these issues, but that will be complicated and require new techniques.

The single threaded improvement is mostly related to avoiding the creation of the temporary method object and the bookkeeping that involves. I've seen an improvement for Python from to 3.9 to 3.14.

I agree that with a sufficiently advanced JIT the existing code may perform the same as the existing code, but I don't think we're there yet. I also agree that we don't want to make sweeping changes across the code base. I think this change is pretty small and targeted and the new code is still idiomatic Python.

@rhettinger
Copy link
Contributor

rhettinger commented Mar 18, 2025

How about making only the edit to randbelow methods (as proposed in the issue) and not sweeping through the rest of the module? I'm really uncomfortable with the latter. Unless there is compelling urgency for making all of these edits, we can discuss those at the sprints.

Also ISTM the monte carlo benchmark is a toy example and we shouldn't tune to it. Benchmark chasing is rarely advisable and even more so in a time where the interpreter implementation undergoing so many changes.

@colesbury
Copy link
Contributor Author

I've limited the changes to _randbelow_with_getrandbits

@rhettinger rhettinger merged commit 844765b into python:main Mar 20, 2025
38 checks passed
@colesbury colesbury deleted the gh-131269-random branch March 27, 2025 20:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants