Description
The stateupdater for the neurons in our COBAHHUncoupled
benchmark (in single-precision) seems to have been simulated with only 640 threads per block even though it could use 1024 threads per block... This leads to a theoretical occupancy of only 62.5%.
We choose the number of threads based on cudaOccupancyMaxPotentialBlockSize
, which happens here. The documentation says:
Returns grid and block size that achieves maximum potential occupancy for a device function.
But in this blog post it sounds like the number of threads is based on some heuristics on not best suited for performance-critical kernels:
cudaOccupancyMaxPotentialBlockSize makes it possible to compute a reasonably efficient execution configuration for a kernel without having to directly query the kernel’s attributes or the device properties, regardless of what device is present or any compilation details. This can greatly simplify the task of frameworks (such as Thrust), that must launch user-defined kernels. This is also handy for kernels that are not primary performance bottlenecks, where the programmer just wants a simple way to run the kernel with correct results, rather than hand-tuning the execution configuration.
Until now, I didn't see cudaOccupancyMaxPotentialBlockSize
return a configuration that didn't give optimal occupancy. And I checked the logs for all other benchmarks, which seem to run maximal threads per block where possible. The only occasion where the number of threads is lower is when there are hardware limits reached by the kernel (e.g. for COBAHH and Mushroom body in double-precision). But that is not the case single-precision, where < 64
registers per thread are required. Form my MX150 (same on A100):
|| INFO _run_kernel_neurongroup_stateupdater_codeobject
|| 7 blocks
|| 1024 threads
|| 44 registers per thread
|| 0 bytes statically-allocated shared memory per block
|| 0 bytes local memory per thread
|| 1512 bytes user-allocated constant memory
|| 0.625 theoretical occupancy
This needs some digging into. I think we could just get rid of cudaOccupancyMaxPotentialBlockSize
alltogether? The only reason to keep it would be if for very small networks it would be more efficient to call more small blocks instead of few larger ones. I'm not sure if that is ever the case? Only if scheduler overheads prefer many small kernel over few large ones?
For now I will add a preference to manually overwrite the thread number given by cudaOccupancyMaxPotentialBlockSize
and rerun the COBAHHUncoupled
benchmark.
When fixing this, also do:
- Just found a related issue Investigate occupancy limitation / calculation on MX150 GPU. #208
- Rename
min_num_threads
tomin_num_blocks
(it is wrong). - I seem to have changed the " registers per thread" into " registers per block" in the kernel information in a recent commit. That is wrong.
- Check if this issue is fixed and close: Invalid argument error in occupancy calculation when num_threads is zero #123
- There is a branch where I wanted to fix some occupancy related thing, check that out: fix-occupancy-calc