Block dimension calculation can lead to non-optimal occupancy

The stateupdater for the neurons in our `COBAHHUncoupled` benchmark (in single-precision) seems to have been simulated with only 640 threads per block even though it could use 1024 threads per block... This leads to a theoretical occupancy of only 62.5%.

We choose the number of threads based on `cudaOccupancyMaxPotentialBlockSize`, which happens [here](https://github.com/brian-team/brian2cuda/blob/dba8942b03cf1d5154919094f3edb859336c41e9/brian2cuda/templates/common_group.cu#L194). The documentation says:
> Returns grid and block size that achieves maximum potential occupancy for a device function. 

But in [this blog post](https://developer.nvidia.com/blog/cuda-pro-tip-occupancy-api-simplifies-launch-configuration/) it sounds like the number of threads is based on some heuristics on not best suited for performance-critical kernels:
> cudaOccupancyMaxPotentialBlockSize makes it possible to compute a reasonably efficient execution configuration for a kernel without having to directly query the kernel’s attributes or the device properties, regardless of what device is present or any compilation details. This can greatly simplify the task of frameworks (such as Thrust), that must launch user-defined kernels. This is also handy for kernels that are not primary performance bottlenecks, where the programmer just wants a simple way to run the kernel with correct results, rather than hand-tuning the execution configuration.

Until now, I didn't see `cudaOccupancyMaxPotentialBlockSize` return a configuration that didn't give optimal occupancy. And I checked the logs for all other benchmarks, which seem to run maximal threads per block where possible. The only occasion where the number of threads is lower is when there are hardware limits reached by the kernel (e.g. for COBAHH and Mushroom body in double-precision). But that is not the case single-precision, where `< 64` registers per thread are required. Form my MX150 (same on A100):
```
|| INFO _run_kernel_neurongroup_stateupdater_codeobject
|| 	7 blocks
|| 	1024 threads
|| 	44 registers per thread
|| 	0 bytes statically-allocated shared memory per block
|| 	0 bytes local memory per thread
|| 	1512 bytes user-allocated constant memory
|| 	0.625 theoretical occupancy
``` 

This needs some digging into. I think we could just get rid of `cudaOccupancyMaxPotentialBlockSize` alltogether? The only reason to keep it would be if for very small networks it would be more efficient to call more small blocks instead of few larger ones. I'm not sure if that is ever the case? Only if scheduler overheads prefer many small kernel over few large ones?

For now I will add a preference to manually overwrite the thread number given by `cudaOccupancyMaxPotentialBlockSize` and rerun the `COBAHHUncoupled` benchmark.

When fixing this, also do:
- [ ] Just found a related issue #208 
- [ ] Rename `min_num_threads` to `min_num_blocks` (it is wrong).
- [ ] I seem to have changed the "<numRegs> registers per thread" into "<numRegs> registers per block" in the kernel information in a recent commit. That is wrong.
- [ ] Check if this issue is fixed and close: #123 
- [ ] There is a branch where I wanted to fix some occupancy related thing, check that out: [fix-occupancy-calc](https://github.com/brian-team/brian2cuda/tree/fix-occupancy-calc)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Block dimension calculation can lead to non-optimal occupancy #266

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Block dimension calculation can lead to non-optimal occupancy #266

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions