Skip to content

Conversation

@jataylo
Copy link
Collaborator

@jataylo jataylo commented Dec 30, 2025

From pytorch#168073

It has been observed that in the case of heavy contended atomics poor performance is being achieved.

To solve this problem while minimizing kernel overhead this PR proposes an fx pass which will replace the index_put operation with an alternative scatter approach.

Algorithm:

  1. Enumerate scatter operations: operation_id = [0, 1, 2, ..., N-1]
  2. Assign to partitions: partition_id = operation_id % num_partitions
  3. Create expanded buffers along scatter_dim: size = num_partitions × dim_size
  4. Adjust indices: adjusted_idx = original_idx + (partition_id × dim_size)
  5. Perform partitioned scatter with reduced contention
  6. Reduce across partitions: sum(partitions, dim=scatter_dim)

This will reduce atomic contention at the cost of memory usage. In order to combat this we have built heuristics around the total number of partitions for the expanded buffer, as well as setting a cap on how large these expanded tensors can be (currently 10% of GPU memory)

Note the heuristic cannot be perfect as we do not know the true indices data at compile time, in real world models the indices will have duplicates and not be uniformly distributed which increases atomic contention, currently this cannot be modelled and we have to estimate contention based on input and output buffer sizes.

Benchmark code: https://gist.github.com/jataylo/dd3a6353ad2859efd65fa87b28aa3ebd
This code executes 3 index_add ops to 3 seperate buffers.
N = 1000000
D = 100
n = 501

values = float32 [N,D]
indices = int64 [N]
output = float32 [n, D]

For each run we modify the range of randint to simulate various levels of atomic contention

Gathered two sets of results, one with partitioned_scatter_enabled=True, the other partitioned_scatter_enabled=False

MI300

uniform_range no_compile_ms compile_ms (partitioned_scatter_enabled=False) compile_ms (partitioned_scatter_enabled=True) speedup
0-0 85.52 28.50 3.55 8.03
0-1 46.99 15.66 2.47 6.33
0-3 25.16 8.31 2.20 3.78
0-7 12.92 4.32 1.63 2.66
0-15 6.66 4.24 1.60 2.66
0-31 3.43 3.19 1.33 2.40
0-63 1.79 1.62 1.32 1.23
0-127 1.76 1.59 1.24 1.28
0-255 1.73 1.32 1.24 1.07
0-500 1.61 1.27 1.23 1.04

H100

uniform_range no_compile_ms compile_ms   (partitioned_scatter_enabled=False) compile_ms   (partitioned_scatter_enabled=True)
0-0 19.842156 5.504691 0.756135
0-1 19.516249 5.526914 0.779779
0-3 10.450396 3.079321 0.702327
0-7 5.417728 1.906823 0.683553
0-15 3.545023 1.570733 0.655154
0-31 2.631531 1.223358 0.641384
0-63 2.629384 0.832651 0.63534
0-127 2.629735 0.726054 0.768287
0-255 2.62846 0.625181 1.317291
0-500 2.629526 0.548981 1.724292

We can see this could potentially benefit H100 on worst-case examples but would degrade perf in the best case, the atomic add cost on MI300 is heavier meaning this is more beneficial.

On MI300 we can see a mixed bag of e2e model improvements
https://hud.pytorch.org/benchmark/v3/dashboard/compiler_inductor?renderGroupId=main&time.start=2025-11-05T00%3A00%3A00.000Z&time.end=2025-12-04T02%3A00%3A00.000Z&filters.repo=pytorch%2Fpytorch&filters.benchmarkName=compiler&filters.mode=training&filters.dtype=amp&filters.deviceName=rocm+%28mi300x%29&filters.device=rocm&filters.suite=all&filters.compiler=default&lcommit.commit=38c42c575d342a7ea6f4a555bf845071e03b5f35&lcommit.workflow_id=19635538449&lcommit.date=2025-11-24T14%3A00%3A00Z&lcommit.branch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&rcommit.commit=fedb7f15d177a259bf25c94e888137e0a9a69a81&rcommit.workflow_id=19856622912&rcommit.date=2025-12-02T12%3A00%3A00Z&rcommit.branch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&lbranch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&rbranch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&maxSampling=110

Due to mixed-bag of results we will initially enable this as non default feature but testing passed CI with this enabled here
https://hud.pytorch.org/pytorch/pytorch/pull/168073?sha=fedb7f15d177a259bf25c94e888137e0a9a69a81

Note there are improvements to make after this lands:

  1. Add dynamic shape support, needs to be conservative here to not explode memory usage.
  2. Update IR and codegen directly to avoid iota op and needing to update indices via torch ops, we can likely do this in store codegen itself.
  3. Develop new implementations for memory constrained environments

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @dllehr-amd @chenyang78

jithunnair-amd and others added 15 commits December 17, 2025 03:11
(cherry picked from commit e294d4d with
modifications for release/2.8)

Reintroduce CIRCLE_TAG to be able to set PYTORCH_BUILD_VERSION without date

(cherry picked from commit 71a30ea)
…for py3.9;

upgrade tensorboard compatible with numpy 2

Co-authored-by: Ethan Wee <[email protected]>
(cherry picked from commit e867a3d)
(cherry picked from commit c7a1e32)
(cherry picked from commit 2a215e4)
(cherry picked from commit 866cc1d)
(cherry picked from commit 4b46310)
(cherry picked from commit ba1ba26)
(cherry picked from commit 4e3462e)
(cherry picked from commit 85ac538)
This PR fixes the unit test,

test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED
[0.1163s]

```
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction
    tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda")
RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432]
```
This error occurs only on gfx1101 arch.

This error is coming from an integer overflow when another unit test,
test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel
creates a tensor with a huge numel, which overflows into a higher
torch.cuda.max_memory_reserved() when you call
test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction
afterward. To avoid this we introduced torch.cuda.empty_cache() and
torch.cuda.reset_peak_memory_stats() to clean up CUDA states.

JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295
(cherry picked from commit f86d184)
(cherry picked from commit 1b44228)
…ersistent reduction and no_x_dim removal (ROCm#2454)

Cherry-pick of ROCm#2417
Need to resolve conflicts

---------

Co-authored-by: Jack Taylor <[email protected]>
(cherry picked from commit eb47158)
These changes are currently in progress of being upstreamed. Bring into
release 2.9 for customer model perf improvement

---------

Co-authored-by: Nichols A. Romero <[email protected]>
Co-authored-by: Sampsa Riikonen <[email protected]>
Co-authored-by: Nichols A. Romero <[email protected]>
Co-authored-by: AmdSampsa <[email protected]>
…m#2742)

hipblaslt should provide better performance in general
…d_memory_with_allocator (ROCm#2811)

Use try/finally block. This follows a similar pattern elsewhere in
test_cuda.py.

Fixes #ROCm/TheRock#2118.
…_GROUP_GEMM_CK and default to fallback path (ROCm#2865)

On ROCm fast path routes to group_gemm_ck and slow path to _grouped_mm_fallback.
By default, fast path = False route is activated since CK path is not performant yet.
To activate CK path, use ROCM_ALLOW_GROUP_GEMM_CK=1 env variable.

---------

Signed-off-by: Jagadish Krishnamoorthy <[email protected]>
…s_121625

Cherry-picks from release/2.9 into release/2.10
@jataylo jataylo changed the title Squashed commit: Add partitioned scatter approach with optimizations … [release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations Dec 30, 2025
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Dec 30, 2025

Jenkins build for 913cce10b0702def5f47d91a9f217f70ad1ae339 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.