-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block dimension calculation can lead to non-optimal occupancy #266
Comments
It appears that the number of threads we pick is not always giving optimal occupancy, see #266
Alright, I think I got it. There are three hardware limits for the execution of blocks on SMs:
That means in order to run 2048 threads in two blocks concurrently on a single SM, the kernel can use at most 32 registers per thread (32 registers * 2048 = 65536 registers). I was thinking of this limit on a per block basis, assuming that up to 64 register would be fine. But it is per SM! In the example above, the stateupdater needs 44 registers. The I'll leave this issue open for the other points I marked above. Also we should investigate who the register usage depends on the neuron model definition. Because the mushroom body benchmark, which has a very similar model as the COBAHH benchmark, requires only 32 registers on the A100 GPU, allowing it to reach 100% occupancy. |
The stateupdater for the neurons in our
COBAHHUncoupled
benchmark (in single-precision) seems to have been simulated with only 640 threads per block even though it could use 1024 threads per block... This leads to a theoretical occupancy of only 62.5%.We choose the number of threads based on
cudaOccupancyMaxPotentialBlockSize
, which happens here. The documentation says:But in this blog post it sounds like the number of threads is based on some heuristics on not best suited for performance-critical kernels:
Until now, I didn't see
cudaOccupancyMaxPotentialBlockSize
return a configuration that didn't give optimal occupancy. And I checked the logs for all other benchmarks, which seem to run maximal threads per block where possible. The only occasion where the number of threads is lower is when there are hardware limits reached by the kernel (e.g. for COBAHH and Mushroom body in double-precision). But that is not the case single-precision, where< 64
registers per thread are required. Form my MX150 (same on A100):This needs some digging into. I think we could just get rid of
cudaOccupancyMaxPotentialBlockSize
alltogether? The only reason to keep it would be if for very small networks it would be more efficient to call more small blocks instead of few larger ones. I'm not sure if that is ever the case? Only if scheduler overheads prefer many small kernel over few large ones?For now I will add a preference to manually overwrite the thread number given by
cudaOccupancyMaxPotentialBlockSize
and rerun theCOBAHHUncoupled
benchmark.When fixing this, also do:
min_num_threads
tomin_num_blocks
(it is wrong).The text was updated successfully, but these errors were encountered: