feat(models): Triton Attention backend #716

cathalobrien · 2025-12-03T20:00:07Z

Description

triton backend for the transformer processor. As fast as flash-attention 2 without all the hassle to install.

I based it on the official (MIT licensed) triton fused attention demo and added support for sliding window. I also changed the BW pass structure to make it simpler and easier to support other attention modifications

performance tests and loss comparisons against longer runs are shown in comments below

There is a pytest suite which tests numerous different configurations compared to a reference implementation:

========== 144 passed, 240 skipped, 1 warning in 19.01s ==========

This PR changes the default attention implementation when using the transformer processor to the 'triton' backend. Since the triton backend has the same performance as flash attention v2 and does not have to be installed, this will allow users to train transformer models out-of-the-box.

I also added two env vars to allow users at inference to select the attention implementation. This is a quality of life feature a few users have suggested.

2026-01-13 15:33:22 INFO 'ANEMOI_INFERENCE_GRAPHTRANSFORMER_ATTENTION_BACKEND' environment variable has been set. Overwriting attention backend from triton to pyg

2026-01-13 15:27:07 INFO 'ANEMOI_INFERENCE_TRANSFORMER_ATTENTION_BACKEND' environment variable has been set. Overwriting attention backend from triton to flash_attention

In the future it would be better to replace the use of env vars with passing the information through the anemoi-inference config

📚 Documentation preview 📚: https://anemoi-training--716.org.readthedocs.build/en/716/

📚 Documentation preview 📚: https://anemoi-graphs--716.org.readthedocs.build/en/716/

📚 Documentation preview 📚: https://anemoi-models--716.org.readthedocs.build/en/716/

fused-attention-batch1-head8-d64-fwd-window=0: N_CTX Triton [FP16] Flash-2 0 1024.0 53.951905 30.154757 1 2048.0 251.862878 225.601753 2 4096.0 1003.585900 883.967242 3 8192.0 3087.151752 2919.118612 4 16384.0 14424.333821 12164.023961 fused-attention-batch1-head8-d64-fwd-window=256: N_CTX Triton [FP16] Flash-2 0 1024.0 66.118373 49.047213 1 2048.0 212.385098 224.781601 2 4096.0 627.400725 754.676149 3 8192.0 2403.518850 2072.301033 4 16384.0 5407.098626 4389.974473 fused-attention-batch1-head8-d64-bwd-window=0: N_CTX Triton [FP16] Flash-2 0 1024.0 29.038600 28.145139 1 2048.0 103.598599 107.078404 2 4096.0 221.267879 466.277452 3 8192.0 238.154238 1926.255278 4 16384.0 244.473300 7453.972862 fused-attention-batch1-head8-d64-bwd-window=256: N_CTX Triton [FP16] Flash-2 0 1024.0 31.743517 30.864637 1 2048.0 90.926897 119.329910 2 4096.0 219.786079 486.217597 3 8192.0 236.157174 1804.149605 4 16384.0 243.949151 4204.442805 fused-attention-batch1-head8-d128-fwd-window=0: N_CTX Triton [FP16] Flash-2 0 1024.0 113.015749 97.006563 1 2048.0 355.368512 378.363736 2 4096.0 1539.535832 1366.327294 3 8192.0 7007.632047 6028.029964 4 16384.0 19762.887419 17225.973294 fused-attention-batch1-head8-d128-fwd-window=256: N_CTX Triton [FP16] Flash-2 0 1024.0 103.508837 89.663668 1 2048.0 525.090419 464.968188 2 4096.0 1235.160998 1332.826567 3 8192.0 3203.617410 2936.990470 4 16384.0 6880.275897 6123.385195 fused-attention-batch1-head8-d128-bwd-window=0: N_CTX Triton [FP16] Flash-2 0 1024.0 31.435884 62.969277 1 2048.0 58.816000 240.568425 2 4096.0 81.556234 987.361971 3 8192.0 86.844913 3692.221169 4 16384.0 88.609923 8978.312907 fused-attention-batch1-head8-d128-bwd-window=256: N_CTX Triton [FP16] Flash-2 0 1024.0 31.168216 62.057629 1 2048.0 57.411547 240.228307 2 4096.0 73.313512 920.541072 3 8192.0 75.291760 2300.896123 4 16384.0 75.907589 4835.565817

for more information, see https://pre-commit.ci

…ore into feature/triton-flash-attn

…uld be off if window-size was not a factor of BLOCK_N

cathalobrien · 2025-12-05T13:13:07Z

Something strange is afoot with the loss for longer runs

cathalobrien · 2026-01-13T15:14:32Z

I compared an aifs-single setup over 4 GH200s for 4000 iterations here

Loss looks good, and the triton backend completed faster

…ring inference runtime

…ad dims. complex bc more block sizes and dependacnies between them due to sharing a gird. wont run but pytest which doesnt autotune passes

japols

Nice work!

Some comments on dtypes and a few minor changes.

models/src/anemoi/models/layers/attention.py

models/src/anemoi/models/layers/block.py

models/src/anemoi/models/triton/attention.py

Co-authored-by: Jan Polster <[email protected]>

…ore into feature/triton-flash-attn

…ce as fp16

for more information, see https://pre-commit.ci

…ore into feature/triton-flash-attn

…rid via autotuning

…sage

for more information, see https://pre-commit.ci

…ore into feature/triton-flash-attn

ssmmnn11

nice!

ssmmnn11 · 2026-01-15T11:12:10Z

models/src/anemoi/models/triton/attention.py

+            pre_hook=_host_descriptor_pre_hook,
+        )
+        for BM in [32, 64, 128]
+        for BN in [32, 64, 128]


Good idea. it passes pytests and gives a speedup from 8.55 ms / iter to 7.06 ms / iter at 2048 channels...maybe I should try 8 :D

ssmmnn11 · 2026-01-15T11:51:36Z

models/src/anemoi/models/triton/attention.py

+        # Meaning there is at least (BATCH_SIZE * NUM_HEADS) SMs
+        # Depending on BLOCK_FIXED, the context window might also be split across SMs
+        # BLOCK_FIXED is a hyperparameter which triton sets at runtime by running small performance tests
+        def grid(META):


is N_CTX always divisible by BLOCK_FIXED? Do we need a mask if not in _attn_fwd?

ssmmnn11 · 2026-01-15T12:19:46Z

models/src/anemoi/models/triton/attention.py

+    offset_y = off_z * (N_CTX * H) + off_h * N_CTX
+    qo_offset_y = offset_y + start_m * BLOCK_FIXED
+    # initialize offsets
+    offs_m = start_m * BLOCK_FIXED + tl.arange(0, BLOCK_FIXED)


do we need N_CTX % PRE_BLOCK == 0 here as well? or a mask to not write at invalid locations?

models/src/anemoi/models/triton/attention.py

ssmmnn11 · 2026-01-15T14:40:27Z

models/src/anemoi/models/triton/attention.py

+    # This frees up threads and registers to do other computations
+    # TMA requires global memory allocations, so we set the alloc_fn here
+    def alloc_fn(size: int, align: int, _):
+        return torch.empty(size, dtype=torch.int8, device="cuda")


could we allocate on current device? or is cuda always save / will we always get the correct device??

ssmmnn11 · 2026-01-15T15:00:25Z

models/src/anemoi/models/triton/attention.py

+    def alloc_fn(size: int, align: int, _):
+        return torch.empty(size, dtype=torch.int8, device="cuda")
+
+    triton.set_allocator(alloc_fn)


is this set in every call? should we set this once somewhere?

this is also interesting, but not really relevant I guess: pytorch/pytorch#155584?

Good catch. Now its only called once when the file is imported. Currently there isnt a public way to check if the allocator has been set (here). I left a TODO to add a check once there is a way

ssmmnn11 · 2026-01-15T15:12:50Z

models/src/anemoi/models/triton/attention.py

+        # -- update output accumulator --
+        acc = acc * alpha[:, None]
+        # prepare p and v for the dot
+        v = desc_v.load([iter_offset, 0])


do we need be carefully to not load something that goes out of bounds?

ssmmnn11 · 2026-01-15T15:12:56Z

models/src/anemoi/models/triton/attention.py

+        curr_iter = tl.multiple_of(curr_iter, BLOCK_ITER)  # Tells compiler curr_iter is a multiple of BLOCK_ITER
+
+        # -- compute qk ----
+        k = desc_k.load([iter_offset, 0]).T


do we need be carefully to not load something that goes out of bounds?

ssmmnn11 · 2026-01-15T15:16:32Z

models/src/anemoi/models/triton/attention.py

+    qk_scale = sm_scale
+    qk_scale *= 1.44269504  # 1/log(2) #hack to make calculating exponent faster, by merging 1/ln(2) now the cheaper exp2() fn can be called later instead of exp()
+    # load q: it will stay in SRAM throughout
+    q = desc_q.load([fixed_offset, 0])


do we need a mask so we don't load anything beyond N_CTX

HCookie · 2026-01-27T15:23:10Z

models/src/anemoi/models/layers/attention.py

@@ -1,4 +1,4 @@
-# (C) Copyright 2024 Anemoi contributors.
+# (C) Copyright 2026 Anemoi contributors.


Suggested change

# (C) Copyright 2026 Anemoi contributors.

# (C) Copyright 2024- Anemoi contributors.

HCookie · 2026-01-27T15:23:44Z

models/src/anemoi/models/layers/attention.py

 LOGGER = logging.getLogger(__name__)

+# Change attention implementation during inference runtime
+ATTENTION_BACKEND = os.environ.get("ANEMOI_INFERENCE_TRANSFORMER_ATTENTION_BACKEND", "")


At some point I'd like to consolidate these into utils.env

HCookie · 2026-01-27T15:24:22Z

models/src/anemoi/models/layers/attention.py

        }
+
+        # Check if 'ANEMOI_INFERENCE_TRANSFORMER_ATTENTION_BACKEND' env var has been set
+        if ATTENTION_BACKEND != "":


Suggested change

if ATTENTION_BACKEND != "":

if ATTENTION_BACKEND:

HCookie · 2026-01-27T15:25:01Z

models/src/anemoi/models/layers/attention.py

        value = self.lin_v(x)

+        # Check at runtime if the Attention backend env var has been set, and update attention backend accordingly
+        if ATTENTION_BACKEND != "":


Suggested change

if ATTENTION_BACKEND != "":

if ATTENTION_BACKEND:

HCookie · 2026-01-27T16:02:34Z

models/src/anemoi/models/schemas/decoder.py

    "Dropout probability used for multi-head self attention, default 0.0"
-    attention_implementation: str = Field(example="flash_attention")
-    "Attention implementation to use. Default to 'flash_attention'."
+    attention_implementation: str = Field(example="triton_attention")


As this is a required str, it doesn't really default to triton_attention

HCookie · 2026-01-27T16:04:00Z

training/src/anemoi/training/config/model/transformer_diffusiontend.yaml

  window_size: 512
  dropout_p: 0.0 # GraphTransformer
-  attention_implementation: flash_attention # flash_attention, scaled_dot_product_attention
+  attention_implementation: triton # Possible values:  flash_attention, triton, scaled_dot_product_attention


Suggested change

attention_implementation: triton # Possible values: flash_attention, triton, scaled_dot_product_attention

attention_implementation: triton_attention # Possible values: flash_attention, triton_attention, scaled_dot_product_attention

HCookie · 2026-01-27T16:04:09Z

training/src/anemoi/training/config/model/transformer_ens.yaml

  window_size: 512
  dropout_p: 0.0
-  attention_implementation: flash_attention # Possible values:  scaled_dot_product_attention, flash_attention
+  attention_implementation: triton # Possible values:  flash_attention, triton, scaled_dot_product_attention


Suggested change

attention_implementation: triton # Possible values: flash_attention, triton, scaled_dot_product_attention

attention_implementation: triton_attention # Possible values: flash_attention, triton_attention, scaled_dot_product_attention

HCookie · 2026-01-27T16:04:23Z

training/src/anemoi/training/config/model/transformer_transformermapper.yaml

  window_size: 512
  dropout_p: 0.0 # GraphTransformer
-  attention_implementation: flash_attention # flash_attention, scaled_dot_product_attention
+  attention_implementation: triton # Possible values:  flash_attention, triton, scaled_dot_product_attention


Suggested change

attention_implementation: triton # Possible values: flash_attention, triton, scaled_dot_product_attention

attention_implementation: triton_attention # Possible values: flash_attention, triton_attention, scaled_dot_product_attention

HCookie · 2026-01-27T16:04:31Z

training/src/anemoi/training/config/model/transformer_transformermapper.yaml

  window_size: -1
  dropout_p: 0.0
-  attention_implementation: flash_attention  # Possible values: scaled_dot_product_attention, flash_attention
+  attention_implementation: triton # Possible values:  flash_attention, triton, scaled_dot_product_attention


Suggested change

attention_implementation: triton # Possible values: flash_attention, triton, scaled_dot_product_attention

attention_implementation: triton_attention # Possible values: flash_attention, triton_attention, scaled_dot_product_attention

HCookie · 2026-01-27T16:04:39Z

training/src/anemoi/training/config/model/transformer_transformermapper.yaml

  window_size: -1
  dropout_p: 0.0
-  attention_implementation: flash_attention # Possible values: scaled_dot_product_attention, flash_attention
+  attention_implementation: triton # Possible values:  flash_attention, triton, scaled_dot_product_attention


Suggested change

attention_implementation: triton # Possible values: flash_attention, triton, scaled_dot_product_attention

attention_implementation: triton_attention # Possible values: flash_attention, triton_attention, scaled_dot_product_attention

cathalobrien added 11 commits October 23, 2025 14:06

add triton attention

13c4877

add flex attn to test correctness

92de7cf

add flex attn to benchmark tests

8d2148c

upgrade to use newest triton fusedAttn implementation

5d3753a

small bug fixes

062f513

SW mask wip

64850e5

working bwd sliding window

c6c2b25

small changes

165500b

pre-commit and license

7c4411f

move tests to its own file

cdd5c75

github-project-automation bot added this to Anemoi-dev Dec 3, 2025

github-project-automation bot moved this to To be triaged in Anemoi-dev Dec 3, 2025

cathalobrien added the ATS Approval Needed Approval needed by ATS label Dec 3, 2025

github-actions bot added the models label Dec 3, 2025

cathalobrien changed the title ~~feat(training)/triton-attn~~ feat(models)/triton-attn Dec 3, 2025

cathalobrien changed the title ~~feat(models)/triton-attn~~ feat(models): Triton Attention backend Dec 3, 2025

HCookie assigned cathalobrien Dec 4, 2025

HCookie moved this from To be triaged to Reviewers needed in Anemoi-dev Dec 4, 2025

cathalobrien and others added 10 commits December 4, 2025 09:34

rename file to prvent pytest error when running layers/test_attention.py

7b54284

mark as GPU only

96c246f

skip triton tests not on GPU

98749c4

remove unneeded device check which crashes on CPU

d9e9885

[pre-commit.ci] auto fixes from pre-commit.com hooks

305c79e

for more information, see https://pre-commit.ci

tests skip on CPU

ed78eae

Merge branch 'feature/triton-flash-attn' of github.com:ecmwf/anemoi-c…

eaf2bf3

…ore into feature/triton-flash-attn

Merge branch 'main' into feature/triton-flash-attn

3e36826

fix error in sliding window where a tiny number of elements (0.0%) wo…

af86ff4

…uld be off if window-size was not a factor of BLOCK_N

test o96 config against flash attn if its available

2ba5857

cathalobrien requested review from japols and ssmmnn11 January 13, 2026 14:57

cathalobrien added 3 commits January 13, 2026 15:30

added env vars to change transformer and graphtransformer backends du…

3a2bed9

…ring inference runtime

comments

fcf628b

wip adding autotuning to bwd pass to improve performance on larger he…

6e22b08

…ad dims. complex bc more block sizes and dependacnies between them due to sharing a gird. wont run but pytest which doesnt autotune passes

japols requested changes Jan 14, 2026

View reviewed changes

github-project-automation bot moved this from Reviewers needed to Under Review in Anemoi-dev Jan 14, 2026

cathalobrien and others added 19 commits January 14, 2026 13:56

Update models/src/anemoi/models/triton/attention.py

db86d18

Co-authored-by: Jan Polster <[email protected]>

Update models/src/anemoi/models/layers/block.py

683a882

Co-authored-by: Jan Polster <[email protected]>

autotuning bwd working btu not ideal

b1c78ce

pre-commit

7922baf

Merge branch 'feature/triton-flash-attn' of github.com:ecmwf/anemoi-c…

c99d7f7

…ore into feature/triton-flash-attn

support bf16. correctness tests added. benchmarks show same performan…

60cde78

…ce as fp16

[pre-commit.ci] auto fixes from pre-commit.com hooks

19b438a

for more information, see https://pre-commit.ci

raise not implemented error for causal

768398f

Merge branch 'feature/triton-flash-attn' of github.com:ecmwf/anemoi-c…

d52117e

…ore into feature/triton-flash-attn

split bwd into two seperate kernels to allow each to choose optimal g…

1c7d0f6

…rid via autotuning

Added tensor descriptors and warp spec to bwd_dkdv, reducing memory u…

4cae4fa

…sage

added tensor descriptors to bwd_dq

82bbcf6

rename fwd blocks to be consistent with bwd

bddecab

cache autotuning results to dsik...allegedly

ff22c16

[pre-commit.ci] auto fixes from pre-commit.com hooks

487f77f

for more information, see https://pre-commit.ci

check attention backend env vars during runtime

67d03a6

Merge branch 'feature/triton-flash-attn' of github.com:ecmwf/anemoi-c…

b6189cd

…ore into feature/triton-flash-attn

rename variables for clarity

66b87cd

check input tensors are contiguous

c2e494d

ssmmnn11 requested changes Jan 15, 2026

View reviewed changes

HCookie self-requested a review January 21, 2026 16:32

HCookie requested changes Jan 27, 2026

View reviewed changes

		@@ -1,4 +1,4 @@
		# (C) Copyright 2024 Anemoi contributors.
		# (C) Copyright 2026 Anemoi contributors.

	# (C) Copyright 2026 Anemoi contributors.
	# (C) Copyright 2024- Anemoi contributors.

	attention_implementation: triton # Possible values: flash_attention, triton, scaled_dot_product_attention
	attention_implementation: triton_attention # Possible values: flash_attention, triton_attention, scaled_dot_product_attention

feat(models): Triton Attention backend #716

Are you sure you want to change the base?

feat(models): Triton Attention backend #716

Uh oh!

Conversation

cathalobrien commented Dec 3, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Uh oh!

cathalobrien commented Dec 5, 2025

Uh oh!

cathalobrien commented Jan 13, 2026

Uh oh!

japols left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ssmmnn11 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ssmmnn11 Jan 15, 2026 • edited by HCookie Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

cathalobrien commented Dec 3, 2025 •

edited by github-actions bot

Loading

ssmmnn11 Jan 15, 2026 •

edited by HCookie

Loading