[release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations #2894

jataylo · 2025-12-30T09:55:04Z

It has been observed that in the case of heavy contended atomics poor performance is being achieved.

To solve this problem while minimizing kernel overhead this PR proposes an fx pass which will replace the index_put operation with an alternative scatter approach.

Algorithm:

Enumerate scatter operations: operation_id = [0, 1, 2, ..., N-1]
Assign to partitions: partition_id = operation_id % num_partitions
Create expanded buffers along scatter_dim: size = num_partitions × dim_size
Adjust indices: adjusted_idx = original_idx + (partition_id × dim_size)
Perform partitioned scatter with reduced contention
Reduce across partitions: sum(partitions, dim=scatter_dim)

This will reduce atomic contention at the cost of memory usage. In order to combat this we have built heuristics around the total number of partitions for the expanded buffer, as well as setting a cap on how large these expanded tensors can be (currently 10% of GPU memory)

Note the heuristic cannot be perfect as we do not know the true indices data at compile time, in real world models the indices will have duplicates and not be uniformly distributed which increases atomic contention, currently this cannot be modelled and we have to estimate contention based on input and output buffer sizes.

Benchmark code: https://gist.github.com/jataylo/dd3a6353ad2859efd65fa87b28aa3ebd
This code executes 3 index_add ops to 3 seperate buffers.
N = 1000000
D = 100
n = 501

values = float32 [N,D]
indices = int64 [N]
output = float32 [n, D]

For each run we modify the range of randint to simulate various levels of atomic contention

Gathered two sets of results, one with partitioned_scatter_enabled=True, the other partitioned_scatter_enabled=False

MI300

uniform_range	no_compile_ms	compile_ms (partitioned_scatter_enabled=False)	compile_ms (partitioned_scatter_enabled=True)	speedup
0-0	85.52	28.50	3.55	8.03
0-1	46.99	15.66	2.47	6.33
0-3	25.16	8.31	2.20	3.78
0-7	12.92	4.32	1.63	2.66
0-15	6.66	4.24	1.60	2.66
0-31	3.43	3.19	1.33	2.40
0-63	1.79	1.62	1.32	1.23
0-127	1.76	1.59	1.24	1.28
0-255	1.73	1.32	1.24	1.07
0-500	1.61	1.27	1.23	1.04

H100

uniform_range	no_compile_ms	compile_ms (partitioned_scatter_enabled=False)	compile_ms (partitioned_scatter_enabled=True)
0-0	19.842156	5.504691	0.756135
0-1	19.516249	5.526914	0.779779
0-3	10.450396	3.079321	0.702327
0-7	5.417728	1.906823	0.683553
0-15	3.545023	1.570733	0.655154
0-31	2.631531	1.223358	0.641384
0-63	2.629384	0.832651	0.63534
0-127	2.629735	0.726054	0.768287
0-255	2.62846	0.625181	1.317291
0-500	2.629526	0.548981	1.724292

We can see this could potentially benefit H100 on worst-case examples but would degrade perf in the best case, the atomic add cost on MI300 is heavier meaning this is more beneficial.

On MI300 we can see a mixed bag of e2e model improvements
https://hud.pytorch.org/benchmark/v3/dashboard/compiler_inductor?renderGroupId=main&time.start=2025-11-05T00%3A00%3A00.000Z&time.end=2025-12-04T02%3A00%3A00.000Z&filters.repo=pytorch%2Fpytorch&filters.benchmarkName=compiler&filters.mode=training&filters.dtype=amp&filters.deviceName=rocm+%28mi300x%29&filters.device=rocm&filters.suite=all&filters.compiler=default&lcommit.commit=38c42c575d342a7ea6f4a555bf845071e03b5f35&lcommit.workflow_id=19635538449&lcommit.date=2025-11-24T14%3A00%3A00Z&lcommit.branch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&rcommit.commit=fedb7f15d177a259bf25c94e888137e0a9a69a81&rcommit.workflow_id=19856622912&rcommit.date=2025-12-02T12%3A00%3A00Z&rcommit.branch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&lbranch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&rbranch=refs%2Ftags%2Fciflow%2Finductor-perf-test-nightly-rocm-mi300%2F168073&maxSampling=110

Due to mixed-bag of results we will initially enable this as non default feature but testing passed CI with this enabled here
https://hud.pytorch.org/pytorch/pytorch/pull/168073?sha=fedb7f15d177a259bf25c94e888137e0a9a69a81

Note there are improvements to make after this lands:

Add dynamic shape support, needs to be conservative here to not explode memory usage.
Update IR and codegen directly to avoid iota op and needing to update indices via torch ops, we can likely do this in store codegen itself.
Develop new implementations for memory constrained environments

cc @jeffdaily @sunway513 @jithunnair-amd @pruthvistony @ROCmSupport @hongxiayang @naromero77amd @pragupta @jerrymannil @xinyazhang @voznesenskym @penguinwu @EikanWang @jgong5 @Guobing-Chen @XiaobingSuper @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @ipiszy @kadeng @muchulee8 @amjames @chauhang @aakhundov @coconutruben @dllehr-amd @chenyang78

(cherry picked from commit e294d4d with modifications for release/2.8) Reintroduce CIRCLE_TAG to be able to set PYTORCH_BUILD_VERSION without date (cherry picked from commit 71a30ea)

…for py3.9; upgrade tensorboard compatible with numpy 2 Co-authored-by: Ethan Wee <[email protected]> (cherry picked from commit e867a3d) (cherry picked from commit c7a1e32) (cherry picked from commit 2a215e4) (cherry picked from commit 866cc1d) (cherry picked from commit 4b46310)

(cherry picked from commit ba1ba26) (cherry picked from commit 4e3462e) (cherry picked from commit 85ac538)

This PR fixes the unit test, test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.1163s] ``` Traceback (most recent call last): File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda") RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432] ``` This error occurs only on gfx1101 arch. This error is coming from an integer overflow when another unit test, test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel creates a tensor with a huge numel, which overflows into a higher torch.cuda.max_memory_reserved() when you call test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction afterward. To avoid this we introduced torch.cuda.empty_cache() and torch.cuda.reset_peak_memory_stats() to clean up CUDA states. JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295 (cherry picked from commit f86d184) (cherry picked from commit 1b44228)

…ersistent reduction and no_x_dim removal (ROCm#2454) Cherry-pick of ROCm#2417 Need to resolve conflicts --------- Co-authored-by: Jack Taylor <[email protected]> (cherry picked from commit eb47158)

These changes are currently in progress of being upstreamed. Bring into release 2.9 for customer model perf improvement --------- Co-authored-by: Nichols A. Romero <[email protected]> Co-authored-by: Sampsa Riikonen <[email protected]> Co-authored-by: Nichols A. Romero <[email protected]> Co-authored-by: AmdSampsa <[email protected]>

…m#2742) hipblaslt should provide better performance in general

…d_memory_with_allocator (ROCm#2811) Use try/finally block. This follows a similar pattern elsewhere in test_cuda.py. Fixes #ROCm/TheRock#2118.

…config. (ROCm#2861) In support of [SWDEV-566103](https://ontrack-internal.amd.com/browse/SWDEV-566103)

…_GROUP_GEMM_CK and default to fallback path (ROCm#2865) On ROCm fast path routes to group_gemm_ck and slow path to _grouped_mm_fallback. By default, fast path = False route is activated since CK path is not performant yet. To activate CK path, use ROCM_ALLOW_GROUP_GEMM_CK=1 env variable. --------- Signed-off-by: Jagadish Krishnamoorthy <[email protected]>

…s_121625 Cherry-picks from release/2.9 into release/2.10

…and linting fixes (cherry picked from commit 4cb344f)

rocm-repo-management-api · 2025-12-30T10:07:30Z

Jenkins build for 913cce10b0702def5f47d91a9f217f70ad1ae339 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

jithunnair-amd and others added 15 commits December 17, 2025 03:11

[release/2.8] Enable wheels

60b3103

(cherry picked from commit e294d4d with modifications for release/2.8) Reintroduce CIRCLE_TAG to be able to set PYTORCH_BUILD_VERSION without date (cherry picked from commit 71a30ea)

Add QA automation scripts for running PyTorch unit tests

7e6441e

(cherry picked from commit ba1ba26) (cherry picked from commit 4e3462e) (cherry picked from commit 85ac538)

[AUTOGENERATED] [release/2.8] [SWDEV-539215] - Autotune support for p…

94f0207

…ersistent reduction and no_x_dim removal (ROCm#2454) Cherry-pick of ROCm#2417 Need to resolve conflicts --------- Co-authored-by: Jack Taylor <[email protected]> (cherry picked from commit eb47158)

new autotuning configs for wri0 (ROCm#2767)

6e8843e

[release/2.9] Add gfx110X and gfx115X to prefered hipBLASLt list (ROC…

fb09bc4

…m#2742) hipblaslt should provide better performance in general

reset per process memory fraction in test_cuda.py test_mempool_limite…

1042a63

…d_memory_with_allocator (ROCm#2811) Use try/finally block. This follows a similar pattern elsewhere in test_cuda.py. Fixes #ROCm/TheRock#2118.

[release/2.9][ROCm][inductor] Add ROCm specific persistent reduction …

382f0c8

…config. (ROCm#2861) In support of [SWDEV-566103](https://ontrack-internal.amd.com/browse/SWDEV-566103)

Update version to 2.10.0

18c9f36

Add related commits file

67d9eb8

Merge pull request ROCm#2883 from chinmaydk99/release_2.10_cherrypick…

bda7875

…s_121625 Cherry-picks from release/2.9 into release/2.10

Squashed commit: Add partitioned scatter approach with optimizations …

913cce1

…and linting fixes (cherry picked from commit 4cb344f)

jataylo changed the title ~~Squashed commit: Add partitioned scatter approach with optimizations …~~ [release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations Dec 30, 2025

jithunnair-amd force-pushed the release/2.10 branch from bda7875 to e809b45 Compare January 14, 2026 19:48

jithunnair-amd requested a review from jeffdaily as a code owner January 14, 2026 19:48

jithunnair-amd force-pushed the release/2.10 branch from cb2ee2f to 4b515e6 Compare January 22, 2026 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations #2894

[release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations #2894

jataylo commented Dec 30, 2025 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Dec 30, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

[release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations #2894

Are you sure you want to change the base?

[release/2.10] [Upstream cherry-pick] Add partitioned scatter approach with optimizations #2894

Conversation

jataylo commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Dec 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

jataylo commented Dec 30, 2025 •

edited

Loading

rocm-repo-management-api bot commented Dec 30, 2025 •

edited

Loading