Add torch.backends.cuda.math_sdp.fp32_precision to 2.10 #2916

anatoliylitv · 2026-01-19T22:46:40Z

Overview
This PR adds a new float32 precision API
torch.backends.cuda.math_sdp.fp32_precision to configure fp32 precision
behavior of SDPBackend.MATH

Rationale
The test/test_transformers.py testing suite calculates the numerical
tolerance by comparing output tensors from the same precision ("reference")
and higher precision ("golden"), both calculated by SDPBackend.MATH.
However, the golden output is calculated with TF32 rather than FP32, which in
fact is less accurate than the FA/ME backend if they used IEEE rather than
TF32 for their accumulation.

The loss of precison causes false negatives in SDPA tests like
TestSDPACudaOnlyCUDA.test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_enable_gqa_True_n_heads1_cuda_float16
, at least on ROCM platform. The false negative disappears after forcing
higher_precision_dtype = torch.float64

Major Changes
To restore the precision of golden output, a new API
torch.backends.cuda.math_sdp.fp32_precision is introduced, which allows
configuration of "matmul" precision during SDPBackend.MATH, and a new
decorator @math_sdp_precision("ieee") is added to all tests that use
check_out_and_grad. At last, an assert is added to the inner most function
_check_equal as a sanity check to ensure math_sdp has the right precison
configured for torch.float32 golden tensors.

Known Issues
The backward phase honors the configuration when calling backward(), regardless
the configuration when creating the graph.

This is copy of PR pytorch#167157 due to pytorchbot limited access to the ROCm fork.

As reference, pr for checking test state: https://github.com/pytorch/pytorch/pull/169676Fixes #ISSUE_NUMBER

(cherry picked from commit e294d4d with modifications for release/2.8) Reintroduce CIRCLE_TAG to be able to set PYTORCH_BUILD_VERSION without date (cherry picked from commit 71a30ea)

…for py3.9; upgrade tensorboard compatible with numpy 2 Co-authored-by: Ethan Wee <[email protected]> (cherry picked from commit e867a3d) (cherry picked from commit c7a1e32) (cherry picked from commit 2a215e4) (cherry picked from commit 866cc1d) (cherry picked from commit 4b46310)

(cherry picked from commit ba1ba26) (cherry picked from commit 4e3462e) (cherry picked from commit 85ac538)

This PR fixes the unit test, test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED [0.1163s] ``` Traceback (most recent call last): File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda") RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432] ``` This error occurs only on gfx1101 arch. This error is coming from an integer overflow when another unit test, test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel creates a tensor with a huge numel, which overflows into a higher torch.cuda.max_memory_reserved() when you call test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction afterward. To avoid this we introduced torch.cuda.empty_cache() and torch.cuda.reset_peak_memory_stats() to clean up CUDA states. JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295 (cherry picked from commit f86d184) (cherry picked from commit 1b44228)

…ersistent reduction and no_x_dim removal (#2454) Cherry-pick of #2417 Need to resolve conflicts --------- Co-authored-by: Jack Taylor <[email protected]> (cherry picked from commit eb47158)

hipblaslt should provide better performance in general

…d_memory_with_allocator (#2811) Use try/finally block. This follows a similar pattern elsewhere in test_cuda.py. Fixes #ROCm/TheRock#2118.

…config. (#2861) In support of [SWDEV-566103](https://ontrack-internal.amd.com/browse/SWDEV-566103)

…form between Blas.cpp and Context.cpp

… version suffix removed (#2912) This PR updates triton.txt to point to commit ba5c1517e6f5906761cf5783036efb587026208d which is a cherry-pick of [d63831ae4a73b8fdac814f1bc060d669a8ae1b06](ROCm/triton@d63831a) onto the `release/internal/3.6.x` branch. The cherry-picked commit removes the git version suffix from the Triton version string. Related Triton PR: ROCm/triton#920 --------- Co-authored-by: Jithun Nair <[email protected]>

(cherry picked from commit a8796d1)

(cherry picked from commit 20640de)

(cherry picked from commit 9ed7301)

…den to use ieee rather than tf32 (cherry picked from commit 5b5115c)

(cherry picked from commit c103f03)

(cherry picked from commit dff8429)

(cherry picked from commit 3e92e80)

(cherry picked from commit c1b1775)

(cherry picked from commit b9fb653)

rocm-repo-management-api · 2026-01-19T22:51:44Z

Jenkins build for d597c8cc0d5c0ce84521c05a5a1f7f5f46f61520 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

jithunnair-amd and others added 21 commits January 14, 2026 19:43

[release/2.8] Enable wheels

35d994a

(cherry picked from commit e294d4d with modifications for release/2.8) Reintroduce CIRCLE_TAG to be able to set PYTORCH_BUILD_VERSION without date (cherry picked from commit 71a30ea)

Add QA automation scripts for running PyTorch unit tests

0baf2a4

(cherry picked from commit ba1ba26) (cherry picked from commit 4e3462e) (cherry picked from commit 85ac538)

[AUTOGENERATED] [release/2.8] [SWDEV-539215] - Autotune support for p…

8c4ffc9

…ersistent reduction and no_x_dim removal (#2454) Cherry-pick of #2417 Need to resolve conflicts --------- Co-authored-by: Jack Taylor <[email protected]> (cherry picked from commit eb47158)

[release/2.9] Add gfx110X and gfx115X to prefered hipBLASLt list (#2742)

1de45d0

hipblaslt should provide better performance in general

reset per process memory fraction in test_cuda.py test_mempool_limite…

ba4bfb8

…d_memory_with_allocator (#2811) Use try/finally block. This follows a similar pattern elsewhere in test_cuda.py. Fixes #ROCm/TheRock#2118.

[release/2.9][ROCm][inductor] Add ROCm specific persistent reduction …

d9027b1

…config. (#2861) In support of [SWDEV-566103](https://ontrack-internal.amd.com/browse/SWDEV-566103)

Update version to 2.10.0

9199772

Add related commits file

e809b45

Fix merge conflict markers and make hipBlastLT arch support logic uni…

c269209

…form between Blas.cpp and Context.cpp

Add torch.backends.cuda.math_sdp.fp32_precision

151b42f

(cherry picked from commit a8796d1)

Make torch.backends.cuda.math_sdp.fp32_precision effective for math_sdp

103287a

(cherry picked from commit 20640de)

torch/testing: add ctx mananger math_sdp_precision

0453711

(cherry picked from commit 9ed7301)

test/test_transformers: decorate all tests that uses fp32 math as gol…

d6dbee0

…den to use ieee rather than tf32 (cherry picked from commit 5b5115c)

fix build error

dc91637

(cherry picked from commit c103f03)

test/test_transformers: sanity check of golden tensor

a4549c2

(cherry picked from commit dff8429)

more documentation

8c86a07

(cherry picked from commit 3e92e80)

fix lint

cd323e5

(cherry picked from commit c1b1775)

docs: document API torch.backends.cuda.math_sdp.fp32_precision

d597c8c

(cherry picked from commit b9fb653)

anatoliylitv changed the title ~~Anatoliylitv/math sdp ieee 2.10~~ Add torch.backends.cuda.math_sdp.fp32_precision to 2.10 Jan 20, 2026

jithunnair-amd force-pushed the release/2.10 branch from cb2ee2f to 4b515e6 Compare January 22, 2026 18:20

jithunnair-amd requested review from jataylo, jeffdaily, jithunnair-amd and pruthvistony as code owners January 22, 2026 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add torch.backends.cuda.math_sdp.fp32_precision to 2.10 #2916

Add torch.backends.cuda.math_sdp.fp32_precision to 2.10 #2916

anatoliylitv commented Jan 19, 2026 •

edited

Loading

Uh oh!

rocm-repo-management-api bot commented Jan 19, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

Add torch.backends.cuda.math_sdp.fp32_precision to 2.10 #2916

Are you sure you want to change the base?

Add torch.backends.cuda.math_sdp.fp32_precision to 2.10 #2916

Conversation

anatoliylitv commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rocm-repo-management-api bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

11 participants

anatoliylitv commented Jan 19, 2026 •

edited

Loading

rocm-repo-management-api bot commented Jan 19, 2026 •

edited

Loading