Skip to content

Block dimension calculation can lead to non-optimal occupancy #266

Open
@denisalevi

Description

@denisalevi

The stateupdater for the neurons in our COBAHHUncoupled benchmark (in single-precision) seems to have been simulated with only 640 threads per block even though it could use 1024 threads per block... This leads to a theoretical occupancy of only 62.5%.

We choose the number of threads based on cudaOccupancyMaxPotentialBlockSize, which happens here. The documentation says:

Returns grid and block size that achieves maximum potential occupancy for a device function.

But in this blog post it sounds like the number of threads is based on some heuristics on not best suited for performance-critical kernels:

cudaOccupancyMaxPotentialBlockSize makes it possible to compute a reasonably efficient execution configuration for a kernel without having to directly query the kernel’s attributes or the device properties, regardless of what device is present or any compilation details. This can greatly simplify the task of frameworks (such as Thrust), that must launch user-defined kernels. This is also handy for kernels that are not primary performance bottlenecks, where the programmer just wants a simple way to run the kernel with correct results, rather than hand-tuning the execution configuration.

Until now, I didn't see cudaOccupancyMaxPotentialBlockSize return a configuration that didn't give optimal occupancy. And I checked the logs for all other benchmarks, which seem to run maximal threads per block where possible. The only occasion where the number of threads is lower is when there are hardware limits reached by the kernel (e.g. for COBAHH and Mushroom body in double-precision). But that is not the case single-precision, where < 64 registers per thread are required. Form my MX150 (same on A100):

|| INFO _run_kernel_neurongroup_stateupdater_codeobject
|| 	7 blocks
|| 	1024 threads
|| 	44 registers per thread
|| 	0 bytes statically-allocated shared memory per block
|| 	0 bytes local memory per thread
|| 	1512 bytes user-allocated constant memory
|| 	0.625 theoretical occupancy

This needs some digging into. I think we could just get rid of cudaOccupancyMaxPotentialBlockSize alltogether? The only reason to keep it would be if for very small networks it would be more efficient to call more small blocks instead of few larger ones. I'm not sure if that is ever the case? Only if scheduler overheads prefer many small kernel over few large ones?

For now I will add a preference to manually overwrite the thread number given by cudaOccupancyMaxPotentialBlockSize and rerun the COBAHHUncoupled benchmark.

When fixing this, also do:

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions