Skip to content

Conversation

@anatoliylitv
Copy link

@anatoliylitv anatoliylitv commented Jan 19, 2026

Overview
This PR adds a new float32 precision API
torch.backends.cuda.math_sdp.fp32_precision to configure fp32 precision
behavior of SDPBackend.MATH

Rationale
The test/test_transformers.py testing suite calculates the numerical
tolerance by comparing output tensors from the same precision ("reference")
and higher precision ("golden"), both calculated by SDPBackend.MATH.
However, the golden output is calculated with TF32 rather than FP32, which in
fact is less accurate than the FA/ME backend if they used IEEE rather than
TF32 for their accumulation.

The loss of precison causes false negatives in SDPA tests like
TestSDPACudaOnlyCUDA.test_flash_attention_vs_math_ref_grads_batch_size_8_seq_len_q_143_seq_len_k_4_head_dim_203_is_causal_False_dropout_p_0_22_float16_scale_l1_enable_gqa_True_n_heads1_cuda_float16
, at least on ROCM platform. The false negative disappears after forcing
higher_precision_dtype = torch.float64

Major Changes
To restore the precision of golden output, a new API
torch.backends.cuda.math_sdp.fp32_precision is introduced, which allows
configuration of "matmul" precision during SDPBackend.MATH, and a new
decorator @math_sdp_precision("ieee") is added to all tests that use
check_out_and_grad. At last, an assert is added to the inner most function
_check_equal as a sanity check to ensure math_sdp has the right precison
configured for torch.float32 golden tensors.

Known Issues
The backward phase honors the configuration when calling backward(), regardless
the configuration when creating the graph.

This is copy of PR pytorch#167157 due to pytorchbot limited access to the ROCm fork.

As reference, pr for checking test state: https://github.com/pytorch/pytorch/pull/169676Fixes #ISSUE_NUMBER

jithunnair-amd and others added 21 commits January 14, 2026 19:43
(cherry picked from commit e294d4d with
modifications for release/2.8)

Reintroduce CIRCLE_TAG to be able to set PYTORCH_BUILD_VERSION without date

(cherry picked from commit 71a30ea)
…for py3.9;

upgrade tensorboard compatible with numpy 2

Co-authored-by: Ethan Wee <[email protected]>
(cherry picked from commit e867a3d)
(cherry picked from commit c7a1e32)
(cherry picked from commit 2a215e4)
(cherry picked from commit 866cc1d)
(cherry picked from commit 4b46310)
(cherry picked from commit ba1ba26)
(cherry picked from commit 4e3462e)
(cherry picked from commit 85ac538)
This PR fixes the unit test,

test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction FAILED
[0.1163s]

```
Traceback (most recent call last):
  File "/var/lib/jenkins/pytorch/test/test_cuda.py", line 471, in test_set_per_process_memory_fraction
    tmp_tensor = torch.empty(application, dtype=torch.int8, device="cuda")
RuntimeError: Trying to create tensor with negative dimension -5681285432: [-5681285432]
```
This error occurs only on gfx1101 arch.

This error is coming from an integer overflow when another unit test,
test/test_cuda.py::TestCuda::test_randint_generation_for_large_numel
creates a tensor with a huge numel, which overflows into a higher
torch.cuda.max_memory_reserved() when you call
test/test_cuda.py::TestCuda::test_set_per_process_memory_fraction
afterward. To avoid this we introduced torch.cuda.empty_cache() and
torch.cuda.reset_peak_memory_stats() to clean up CUDA states.

JIRA: https://ontrack-internal.amd.com/browse/SWDEV-535295
(cherry picked from commit f86d184)
(cherry picked from commit 1b44228)
…ersistent reduction and no_x_dim removal (#2454)

Cherry-pick of #2417
Need to resolve conflicts

---------

Co-authored-by: Jack Taylor <[email protected]>
(cherry picked from commit eb47158)
hipblaslt should provide better performance in general
…d_memory_with_allocator (#2811)

Use try/finally block. This follows a similar pattern elsewhere in
test_cuda.py.

Fixes #ROCm/TheRock#2118.
… version suffix removed (#2912)

This PR updates triton.txt to point to commit
ba5c1517e6f5906761cf5783036efb587026208d which is a cherry-pick of
[d63831ae4a73b8fdac814f1bc060d669a8ae1b06](ROCm/triton@d63831a)
onto the `release/internal/3.6.x` branch.

The cherry-picked commit removes the git version suffix from the Triton
version string.

Related Triton PR: ROCm/triton#920

---------

Co-authored-by: Jithun Nair <[email protected]>
…den to use ieee rather than tf32

(cherry picked from commit 5b5115c)
(cherry picked from commit c103f03)
(cherry picked from commit 3e92e80)
(cherry picked from commit c1b1775)
@rocm-repo-management-api
Copy link

rocm-repo-management-api bot commented Jan 19, 2026

Jenkins build for d597c8cc0d5c0ce84521c05a5a1f7f5f46f61520 commit finished as FAILURE
Links: Pipeline Overview / Build artifacts / Test Results

@anatoliylitv anatoliylitv changed the title Anatoliylitv/math sdp ieee 2.10 Add torch.backends.cuda.math_sdp.fp32_precision to 2.10 Jan 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.