Skip to content

Conversation

dhernandez0
Copy link
Contributor

@dhernandez0 dhernandez0 commented Oct 7, 2025

Motivation

This PR adds a pass to prepare the pipeline for attention.

TODO: wait until #1990 is merged before merging this PR

Note that this includes changes that have been split into the following PRs:

Technical Details

The pass merges rock.stages to have the following:

  • GlobalReadGemm0
  • GlobalReadGemm1 + LDSWriteGemm0 + LDSReadGemm0
  • InitGemm0 + MMAGemm0 + LDSWriteGemm1 + PostProcessGemm0
  • InitGemm1 + LDSReadGemm1 + MMAGemm1 + PostProcessGemm1

PostProcessGemm0 includes softmax.

Performance

We get up to 1.5x speed up vs develop and a median speed up of 1.11x.

DataType G SeqLenQ SeqLenK HeadDimQK HeadDimV TFlops develop TFlops PR #1990 TFlops this PR speed up vs #1990 speed up vs develop
f16 12 384 384 64 64 31.41 31.87 37.83 1.19 1.20
f16 16 384 384 64 64 40.21 44.51 50.60 1.14 1.26
f32 12 384 384 64 64 18.53 19.55 20.54 1.05 1.11
f16 10 4096 4096 64 64 231.55 239.91 243.99 1.02 1.05
f16 10 4096 64 64 64 82.39 82.98 92.33 1.11 1.12
f16 20 1024 1024 64 64 136.24 135.72 137.67 1.01 1.01
f16 20 1024 64 64 64 57.14 56.18 55.79 0.99 0.98
f16 40 256 256 64 64 53.19 50.27 58.74 1.17 1.10
f16 40 256 64 64 64 22.2 23.18 33.41 1.44 1.51
f16 40 64 64 64 64 8.58 8.71 8.92 1.02 1.04
f16 1 4096 4096 512 512 117.71 140.77 146.98 1.04 1.25
f16 1 4096 4096 512 512 115.96 120.37 127.78 1.06 1.10
f16 32 256 256 128 128 93.39 93.1 91.35 0.98 0.98
f16 32 256 256 128 128 65.81 78.85 90.93 1.15 1.38
f16 32 256 256 96 96 63.44 61.34 75.08 1.22 1.18
f16 32 256 256 96 96 61.3 68.27 73.04 1.07 1.19
f32 20 1500 1500 64 64 71.77 73.51 74.82 1.02 1.04
f16 12 77 77 64 64 2.45 2.55 2.74 1.07 1.12
f16 12 77 77 64 64 2.52 2.73 3.18 1.17 1.26
f16 64 77 77 64 64 13.04 13.51 15.03 1.11 1.15
f16 20 77 77 64 64 4 4.23 4.60 1.09 1.15
f16 20 77 77 64 64 4.13 4.45 5.22 1.17 1.26
f32 2 64 64 512 512 0.56 0.69 0.74 1.07 1.32
f32 1 64 64 512 512 0.31 0.33 0.33 1.01 1.07
f16 32 4096 4096 128 128 715.07 725.89 748.19 1.03 1.05
f16 32 1 4096 128 128 0.6 0.58 0.60 1.04 1.00
f32 32 4096 4096 128 128 151.09 149.06 158.37 1.06 1.05
f32 32 1 4097 128 128 0.23 0.24 0.24 1.02 1.06

Test Plan

Tests pass.

Test Result

Submission Checklist

@dhernandez0 dhernandez0 self-assigned this Oct 7, 2025
@dhernandez0 dhernandez0 requested a review from causten as a code owner October 7, 2025 15:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant