Releases · xlite-dev/ffpa-attn-mma · GitHub

17 Mar 01:32

DefTruth

v0.0.2.post4 Latest

Latest

What's Changed

[README] Add FFPA Split-D Algo chart by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/79

Full Changelog: DefTruth/ffpa-attn-mma@v0.0.2.post3...v0.0.2.post4

Contributors

DefTruth

Assets 2

05 Mar 12:10

DefTruth

v0.0.2.post3

What's Changed

[misc] fix macro typo by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/75
[docs] Add FFPA(Split-D) tech blog link by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/77

Full Changelog: DefTruth/ffpa-attn-mma@v0.0.2.post2...v0.0.2.post3

Contributors

DefTruth

Assets 2

13 Feb 11:37

DefTruth

v0.0.2.post2

What's Changed

[README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/71
[tests] rename test.py -> test_ffpa_attn.py by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/72

Full Changelog: DefTruth/ffpa-attn-mma@v0.0.2.post1...v0.0.2.post2

Contributors

DefTruth

Assets 2

05 Feb 09:25

DefTruth

v0.0.2.post1

What's Changed

[misc] Add latest cutlass 3.7.0 submodule by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/62
[Bugfix] fix macro typo by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/63
[Misc] Update launch templates configs for small d by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/64
[misc] remove some wrong comments by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/65
[test] refactor ffpa-l1 multi-stages tests by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/66
Revert "[test] refactor ffpa-l1 multi-stages tests" by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/67
[test] refactor ffpa-l1 multi-stages tests by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/68
[test] Add official flash-attn -> test cases by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/69
[feat] support ffpa-l1 registers double buffers by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/70

Full Changelog: DefTruth/ffpa-attn-mma@v0.0.2...v0.0.2.post1

Contributors

DefTruth

Assets 2

23 Jan 02:23

DefTruth

v0.0.2

What's Changed

[Misc] Add install.sh & clear.sh by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/2
[Docs] Add approximate complexity analysis by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/3
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/4
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/5
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/6
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/7
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/8
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/9
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/10
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/11
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/12
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/13
[FFPA] Refactor FFPA-L1 Part-2✔️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/14
[test] Add gen bench table func✔️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/15
[README] Update bench, 2x faster than SDPA✔️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/16
[README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/17
[README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/18
[README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/19
[Bugfix] fix prefill.cuh un-used vars by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/20
[FFPA] fix some macro typos by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/21
[misc] fix setup.py by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/22
[misc] update L20, 4090, A30, 3080 bench by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/23
[FFPA] support L1 multi-stages 3/4 by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/24
[Misc] find best tflops across multi-stages by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/25
[FFPA] rename pyffpa -> ffpa_attn by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/26
[README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/27
[FFPA] L1 support prefetch QKV g2s by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/28
[Bugfix] fix d < 256 accuracy errors by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/29
[Feature] support L1 QKV smem separation by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/30
[Feature] Add ENABLE_FFPA_SMEM_SWIZZLE_V flag by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/31
[bench] Add RTX 3080 Laptop perf plots by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/32
[bench] Add more bench perf plots by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/34
[misc] fix bench link typos by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/35
[Docs] Add Docker image -> Installation⚙️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/36
[Feature] Add mma mode & fully QKV swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/37
[bench] update perf plots for qkv swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/38
[bench] update perf plots for qkv swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/39
[bench] update perf plots for qkv swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/40
[README] Update python test cases by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/41
[README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/42
[feat] add force ffpa Q*K^T mma acc f16 flag by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/43
[Bugfix] fix ENABLE_FFPA_FORCE_QK_F16 typo by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/44
[Docs] Add FFPA L1 kernel template signature by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/45
[feat] support ffpa-l1 persist Q s2r by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/46
[README] Update ffpa-attn logo by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/47
[Misc] update ffpa-attn title & logo by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/48
[Misc] update ffpa-attn title & logo by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/49
[feat] support ffpa-l1 persist Q g2r by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/50
[README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/51
[feat] Add ffpa-l1 launch_templates by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/52
[feat] ffpa-l1 persist-qkv g2s for d<=256 by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/53
[feat] tune block size for L1 persist kv g2s by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/54
[feat] ffpa-l1 persist-kv-s2r for small d by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/55
[feat] update ffpa-l1 small d kernel launch configs by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/56
[README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/57
[Bugfix] fix compile error w/o V s2r by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/58
[feat] refactor launch templates configs by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/59
[Release] Bump up to v0.0.2 by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/60
[Release] Bump up to v0.0.2 by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/61

Full Changelog: DefTruth/ffpa-attn-mma@v0.0.1...v0.0.2

Contributors

DefTruth

Assets 2

14 Jan 16:16

DefTruth

v0.0.1.post3

What's Changed

[Docs] Add Docker image -> Installation⚙️ by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/36
[Feature] Add mma mode & fully QKV swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/37
[bench] update perf plots for qkv swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/38
[bench] update perf plots for qkv swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/39
[bench] update perf plots for qkv swizzle by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/40

Full Changelog: DefTruth/ffpa-attn-mma@v0.0.1.post2...v0.0.1.post3

Contributors

DefTruth

Assets 2

14 Jan 04:58

DefTruth

v0.0.1.post2

What's Changed

[misc] fix setup.py by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/22
[misc] update L20, 4090, A30, 3080 bench by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/23
[FFPA] support L1 multi-stages 3/4 by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/24
[Misc] find best tflops across multi-stages by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/25
[FFPA] rename pyffpa -> ffpa_attn by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/26
[README] Update README.md by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/27
[FFPA] L1 support prefetch QKV g2s by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/28
[Bugfix] fix d < 256 accuracy errors by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/29
[Feature] support L1 QKV smem separation by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/30
[Feature] Add ENABLE_FFPA_SMEM_SWIZZLE_V flag by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/31
[bench] Add RTX 3080 Laptop perf plots by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/32
[bench] Add more bench perf plots by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/34
[misc] fix bench link typos by @DefTruth in https://github.com/DefTruth/ffpa-attn-mma/pull/35

Full Changelog: DefTruth/ffpa-attn-mma@v0.0.1.post1...v0.0.1.post2

Contributors

DefTruth

Assets 2

09 Jan 12:55

DefTruth

FFPA 0.0.1.post1

What's Changed

[Misc] Add install.sh & clear.sh by @DefTruth in #2
[Docs] Add approximate complexity analysis by @DefTruth in #3
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #4
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #5
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #6
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #7
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #8
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #9
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #10
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #11
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #12
[FFPA] Refactor FFPA-L1 Part-1✔️ by @DefTruth in #13
[FFPA] Refactor FFPA-L1 Part-2✔️ by @DefTruth in #14
[test] Add gen bench table func✔️ by @DefTruth in #15
[README] Update bench, 2x faster than SDPA✔️ by @DefTruth in #16
[README] Update README.md by @DefTruth in #17
[README] Update README.md by @DefTruth in #18
[README] Update README.md by @DefTruth in #19
[Bugfix] fix prefill.cuh un-used vars by @DefTruth in #20
[FFPA] fix some macro typos by @DefTruth in #21

Full Changelog: v0.0.1...v0.0.1.post1

Contributors

DefTruth

Assets 2

06 Jan 05:42

DefTruth

🎉 cuffpa-py 0.0.1 beta L1 Release

📖 FFPA L1 (Level 1): Benchmark 🎉🎉

L1: level 1, O(Brx16)~O(1) SRAM complexity, O(d/4) register complexity, the same GPU HBM memory complexity as FlashAttention. B=1, H=48, N=8192, D=320-1024(FA2 not supported 👀). (Notes, *=MMA Acc F32, ^=MMA Acc F16, Softmax Acc dtype is always be F32, T=TFLOPS, 👇Benchmark)

📚 NVIDIA RTX 3080 Laptop (*=MMA Acc F32, ^=MMA Acc F16, T=TFLOPS)

Algorithm	320	384	448	512	576	640	704	768	832	896	960	1024
SDPA EA	13T	16T	12T	16T	15T	15T	15T	15T	15T	15T	15T	15T
FFPA L1*	32T	30T	30T	28T	28T	27T	26T	25T	25T	25T	25T	24T
Speedup	2.48x	1.88x	2.55x	1.75x	1.90x	1.77x	1.73x	1.67x	1.66x	1.66x	1.66x	1.54x
FFPA L1^	40T	38T	39T	36T	35T	34T	33T	32T	31T	31T	28T	27T
Speedup	3.07x	2.42x	3.33x	2.24x	2.35x	2.19x	2.19x	2.13x	2.03x	2.03x	1.90x	1.74x

📚 NVIDIA RTX 4090 (*=MMA Acc F32, ^=MMA Acc F16, T=TFLOPS)

Algorithm	320	384	448	512	576	640	704	768	832	896	960	1024
SDPA EA	82T	92T	83T	84T	78T	80T	78T	80T	78T	80T	78T	79T
FFPA L1*	136T	135T	135T	132T	133T	133T	132T	131T	130T	125T	123T	93T
Speedup	1.64x	1.45x	1.61x	1.57x	1.71x	1.65x	1.68x	1.62x	1.65x	1.56x	1.55x	1.17x
FFPA L1^	154T	161T	160T	157T	156T	155T	157T	154T	149T	150T	145T	100T
Speedup	1.85x	1.73x	1.92x	1.87x	1.99x	1.93x	1.99x	1.90x	1.90x	1.88x	1.84x	1.25x

📚 NVIDIA L20 (*=MMA Acc F32, ^=MMA Acc F16, T=TFLOPS)

Algorithm	320	384	448	512	576	640	704	768	832	896	960	1024
SDPA EA	56T	63T	57T	58T	55T	56T	54T	55T	54T	55T	54T	56T
FFPA L1*	99T	95T	95T	93T	94T	92T	92T	90T	89T	90T	90T	89T
Speedup	1.77x	1.49x	1.64x	1.58x	1.72x	1.65x	1.68x	1.63x	1.64x	1.63x	1.67x	1.58x
FFPA L1^	96T	99T	100T	92T	93T	92T	93T	91T	90T	90T	88T	91T
Speedup	1.71x	1.55x	1.73x	1.56x	1.69x	1.65x	1.71x	1.64x	1.65x	1.63x	1.62x	1.62x

📚 NVIDIA A30 (*=MMA Acc F32, ^=MMA Acc F16, T=TFLOPS)

Algorithm	320	384	448	512	576	640	704	768	832	896	960	1024
SDPA EA	25T	25T	24T	23T	24T	24T	23T	22T	22T	21T	21T	18T
FFPA L1*	33T	33T	32T	31T	32T	32T	30T	28T	25T	24T	24T	24T
Speedup	1.33x	1.33x	1.30x	1.31x	1.33x	1.33x	1.32x	1.23x	1.15x	1.11x	1.11x	1.27x
FFPA L1^	33T	33T	33T	30T	31T	32T	31T	30T	30T	27T	24T	23T
Speedup	1.33x	1.33x	1.36x	1.30x	1.31x	1.33x	1.37x	1.35x	1.35x	1.25x	1.11x	1.25x

Assets 2