Releases: xlite-dev/ffpa-attn-mma
v0.0.2.post4
What's Changed
- [README] Add FFPA Split-D Algo chart by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/79

Full Changelog: DefTruth/ffpa-attn-mma@v0.0.2.post3...v0.0.2.post4
v0.0.2.post3
What's Changed
- [misc] fix macro typo by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/75
- [docs] Add FFPA(Split-D) tech blog link by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/77
Full Changelog: DefTruth/ffpa-attn-mma@v0.0.2.post2...v0.0.2.post3
v0.0.2.post2
What's Changed
- [README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/71
- [tests] rename test.py -> test_ffpa_attn.py by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/72
Full Changelog: DefTruth/ffpa-attn-mma@v0.0.2.post1...v0.0.2.post2
v0.0.2.post1
What's Changed
- [misc] Add latest cutlass 3.7.0 submodule by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/62
- [Bugfix] fix macro typo by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/63
- [Misc] Update launch templates configs for small d by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/64
- [misc] remove some wrong comments by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/65
- [test] refactor ffpa-l1 multi-stages tests by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/66
- Revert "[test] refactor ffpa-l1 multi-stages tests" by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/67
- [test] refactor ffpa-l1 multi-stages tests by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/68
- [test] Add official flash-attn -> test cases by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/69
- [feat] support ffpa-l1 registers double buffers by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/70
Full Changelog: DefTruth/ffpa-attn-mma@v0.0.2...v0.0.2.post1
v0.0.2
What's Changed
- [Misc] Add install.sh & clear.sh by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/2
- [Docs] Add approximate complexity analysis by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/3
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/4
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/5
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/6
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/7
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/8
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/9
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/10
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/11
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/12
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/13
- [FFPA] Refactor FFPA-L1 Part-2โ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/14
- [test] Add gen bench table funcโ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/15
- [README] Update bench, 2x faster than SDPAโ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/16
- [README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/17
- [README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/18
- [README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/19
- [Bugfix] fix prefill.cuh un-used vars by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/20
- [FFPA] fix some macro typos by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/21
- [misc] fix setup.py by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/22
- [misc] update L20, 4090, A30, 3080 bench by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/23
- [FFPA] support L1 multi-stages 3/4 by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/24
- [Misc] find best tflops across multi-stages by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/25
- [FFPA] rename pyffpa -> ffpa_attn by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/26
- [README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/27
- [FFPA] L1 support prefetch QKV g2s by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/28
- [Bugfix] fix d < 256 accuracy errors by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/29
- [Feature] support L1 QKV smem separation by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/30
- [Feature] Add ENABLE_FFPA_SMEM_SWIZZLE_V flag by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/31
- [bench] Add RTX 3080 Laptop perf plots by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/32
- [bench] Add more bench perf plots by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/34
- [misc] fix bench link typos by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/35
- [Docs] Add Docker image -> Installationโ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/36
- [Feature] Add mma mode & fully QKV swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/37
- [bench] update perf plots for qkv swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/38
- [bench] update perf plots for qkv swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/39
- [bench] update perf plots for qkv swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/40
- [README] Update python test cases by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/41
- [README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/42
- [feat] add force ffpa Q*K^T mma acc f16 flag by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/43
- [Bugfix] fix ENABLE_FFPA_FORCE_QK_F16 typo by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/44
- [Docs] Add FFPA L1 kernel template signature by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/45
- [feat] support ffpa-l1 persist Q s2r by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/46
- [README] Update ffpa-attn logo by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/47
- [Misc] update ffpa-attn title & logo by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/48
- [Misc] update ffpa-attn title & logo by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/49
- [feat] support ffpa-l1 persist Q g2r by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/50
- [README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/51
- [feat] Add ffpa-l1 launch_templates by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/52
- [feat] ffpa-l1 persist-qkv g2s for d<=256 by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/53
- [feat] tune block size for L1 persist kv g2s by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/54
- [feat] ffpa-l1 persist-kv-s2r for small d by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/55
- [feat] update ffpa-l1 small d kernel launch configs by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/56
- [README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/57
- [Bugfix] fix compile error w/o V s2r by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/58
- [feat] refactor launch templates configs by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/59
- [Release] Bump up to v0.0.2 by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/60
- [Release] Bump up to v0.0.2 by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/61
Full Changelog: DefTruth/ffpa-attn-mma@v0.0.1...v0.0.2
v0.0.1.post3
What's Changed
- [Docs] Add Docker image -> Installationโ๏ธ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/36
- [Feature] Add mma mode & fully QKV swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/37
- [bench] update perf plots for qkv swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/38
- [bench] update perf plots for qkv swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/39
- [bench] update perf plots for qkv swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/40
Full Changelog: DefTruth/ffpa-attn-mma@v0.0.1.post2...v0.0.1.post3
v0.0.1.post2
What's Changed
- [misc] fix setup.py by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/22
- [misc] update L20, 4090, A30, 3080 bench by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/23
- [FFPA] support L1 multi-stages 3/4 by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/24
- [Misc] find best tflops across multi-stages by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/25
- [FFPA] rename pyffpa -> ffpa_attn by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/26
- [README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/27
- [FFPA] L1 support prefetch QKV g2s by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/28
- [Bugfix] fix d < 256 accuracy errors by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/29
- [Feature] support L1 QKV smem separation by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/30
- [Feature] Add ENABLE_FFPA_SMEM_SWIZZLE_V flag by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/31
- [bench] Add RTX 3080 Laptop perf plots by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/32
- [bench] Add more bench perf plots by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/34
- [misc] fix bench link typos by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/35
Full Changelog: DefTruth/ffpa-attn-mma@v0.0.1.post1...v0.0.1.post2
FFPA 0.0.1.post1
What's Changed
- [Misc] Add install.sh & clear.sh by @DefTruth in #2
- [Docs] Add approximate complexity analysis by @DefTruth in #3
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in #4
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in #5
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in #6
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in #7
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in #8
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in #9
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in #10
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in #11
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in #12
- [FFPA] Refactor FFPA-L1 Part-1โ๏ธ by @DefTruth in #13
- [FFPA] Refactor FFPA-L1 Part-2โ๏ธ by @DefTruth in #14
- [test] Add gen bench table funcโ๏ธ by @DefTruth in #15
- [README] Update bench, 2x faster than SDPAโ๏ธ by @DefTruth in #16
- [README] Update README.md by @DefTruth in #17
- [README] Update README.md by @DefTruth in #18
- [README] Update README.md by @DefTruth in #19
- [Bugfix] fix prefill.cuh un-used vars by @DefTruth in #20
- [FFPA] fix some macro typos by @DefTruth in #21
Full Changelog: v0.0.1...v0.0.1.post1
๐ cuffpa-py 0.0.1 beta L1 Release
๐ FFPA L1 (Level 1): Benchmark ๐๐
L1: level 1, O(Brx16)~O(1) SRAM complexity, O(d/4) register complexity, the same GPU HBM memory complexity as FlashAttention. B=1, H=48, N=8192, D=320-1024(FA2 not supported ๐). (Notes, *
=MMA Acc F32, ^
=MMA Acc F16, Softmax Acc dtype is always be F32, T=TFLOPS, ๐Benchmark)
- ๐ NVIDIA RTX 3080 Laptop (
*
=MMA Acc F32,^
=MMA Acc F16,T
=TFLOPS)
Algorithm | 320 | 384 | 448 | 512 | 576 | 640 | 704 | 768 | 832 | 896 | 960 | 1024 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SDPA EA | 13T | 16T | 12T | 16T | 15T | 15T | 15T | 15T | 15T | 15T | 15T | 15T |
FFPA L1* | 32T | 30T | 30T | 28T | 28T | 27T | 26T | 25T | 25T | 25T | 25T | 24T |
Speedup | 2.48x | 1.88x | 2.55x | 1.75x | 1.90x | 1.77x | 1.73x | 1.67x | 1.66x | 1.66x | 1.66x | 1.54x |
FFPA L1^ | 40T | 38T | 39T | 36T | 35T | 34T | 33T | 32T | 31T | 31T | 28T | 27T |
Speedup | 3.07x | 2.42x | 3.33x | 2.24x | 2.35x | 2.19x | 2.19x | 2.13x | 2.03x | 2.03x | 1.90x | 1.74x |
- ๐ NVIDIA RTX 4090 (
*
=MMA Acc F32,^
=MMA Acc F16,T
=TFLOPS)
Algorithm | 320 | 384 | 448 | 512 | 576 | 640 | 704 | 768 | 832 | 896 | 960 | 1024 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SDPA EA | 82T | 92T | 83T | 84T | 78T | 80T | 78T | 80T | 78T | 80T | 78T | 79T |
FFPA L1* | 136T | 135T | 135T | 132T | 133T | 133T | 132T | 131T | 130T | 125T | 123T | 93T |
Speedup | 1.64x | 1.45x | 1.61x | 1.57x | 1.71x | 1.65x | 1.68x | 1.62x | 1.65x | 1.56x | 1.55x | 1.17x |
FFPA L1^ | 154T | 161T | 160T | 157T | 156T | 155T | 157T | 154T | 149T | 150T | 145T | 100T |
Speedup | 1.85x | 1.73x | 1.92x | 1.87x | 1.99x | 1.93x | 1.99x | 1.90x | 1.90x | 1.88x | 1.84x | 1.25x |
- ๐ NVIDIA L20 (
*
=MMA Acc F32,^
=MMA Acc F16,T
=TFLOPS)
Algorithm | 320 | 384 | 448 | 512 | 576 | 640 | 704 | 768 | 832 | 896 | 960 | 1024 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SDPA EA | 56T | 63T | 57T | 58T | 55T | 56T | 54T | 55T | 54T | 55T | 54T | 56T |
FFPA L1* | 99T | 95T | 95T | 93T | 94T | 92T | 92T | 90T | 89T | 90T | 90T | 89T |
Speedup | 1.77x | 1.49x | 1.64x | 1.58x | 1.72x | 1.65x | 1.68x | 1.63x | 1.64x | 1.63x | 1.67x | 1.58x |
FFPA L1^ | 96T | 99T | 100T | 92T | 93T | 92T | 93T | 91T | 90T | 90T | 88T | 91T |
Speedup | 1.71x | 1.55x | 1.73x | 1.56x | 1.69x | 1.65x | 1.71x | 1.64x | 1.65x | 1.63x | 1.62x | 1.62x |
- ๐ NVIDIA A30 (
*
=MMA Acc F32,^
=MMA Acc F16,T
=TFLOPS)
Algorithm | 320 | 384 | 448 | 512 | 576 | 640 | 704 | 768 | 832 | 896 | 960 | 1024 |
---|---|---|---|---|---|---|---|---|---|---|---|---|
SDPA EA | 25T | 25T | 24T | 23T | 24T | 24T | 23T | 22T | 22T | 21T | 21T | 18T |
FFPA L1* | 33T | 33T | 32T | 31T | 32T | 32T | 30T | 28T | 25T | 24T | 24T | 24T |
Speedup | 1.33x | 1.33x | 1.30x | 1.31x | 1.33x | 1.33x | 1.32x | 1.23x | 1.15x | 1.11x | 1.11x | 1.27x |
FFPA L1^ | 33T | 33T | 33T | 30T | 31T | 32T | 31T | 30T | 30T | 27T | 24T | 23T |
Speedup | 1.33x | 1.33x | 1.36x | 1.30x | 1.31x | 1.33x | 1.37x | 1.35x | 1.35x | 1.25x | 1.11x | 1.25x |