Skip to content

Releases: xlite-dev/ffpa-attn-mma

v0.0.2.post4

17 Mar 01:32
466e004
Compare
Choose a tag to compare

v0.0.2.post3

05 Mar 12:10
15d3c91
Compare
Choose a tag to compare

v0.0.2.post2

13 Feb 11:37
ffa4b7c
Compare
Choose a tag to compare

What's Changed

Full Changelog: DefTruth/ffpa-attn-mma@v0.0.2.post1...v0.0.2.post2

v0.0.2.post1

05 Feb 09:25
6a85c42
Compare
Choose a tag to compare

What's Changed

Full Changelog: DefTruth/ffpa-attn-mma@v0.0.2...v0.0.2.post1

v0.0.2

23 Jan 02:23
7eb2a3d
Compare
Choose a tag to compare

What's Changed

Full Changelog: DefTruth/ffpa-attn-mma@v0.0.1...v0.0.2

v0.0.1.post3

14 Jan 16:16
e63cf1b
Compare
Choose a tag to compare

What's Changed

Full Changelog: DefTruth/ffpa-attn-mma@v0.0.1.post2...v0.0.1.post3

v0.0.1.post2

14 Jan 04:58
8feacdc
Compare
Choose a tag to compare

What's Changed

Full Changelog: DefTruth/ffpa-attn-mma@v0.0.1.post1...v0.0.1.post2

FFPA 0.0.1.post1

09 Jan 12:55
e1b3bbc
Compare
Choose a tag to compare

What's Changed

  • [Misc] Add install.sh & clear.sh by @DefTruth in #2
  • [Docs] Add approximate complexity analysis by @DefTruth in #3
  • [FFPA] Refactor FFPA-L1 Part-1โœ”๏ธ by @DefTruth in #4
  • [FFPA] Refactor FFPA-L1 Part-1โœ”๏ธ by @DefTruth in #5
  • [FFPA] Refactor FFPA-L1 Part-1โœ”๏ธ by @DefTruth in #6
  • [FFPA] Refactor FFPA-L1 Part-1โœ”๏ธ by @DefTruth in #7
  • [FFPA] Refactor FFPA-L1 Part-1โœ”๏ธ by @DefTruth in #8
  • [FFPA] Refactor FFPA-L1 Part-1โœ”๏ธ by @DefTruth in #9
  • [FFPA] Refactor FFPA-L1 Part-1โœ”๏ธ by @DefTruth in #10
  • [FFPA] Refactor FFPA-L1 Part-1โœ”๏ธ by @DefTruth in #11
  • [FFPA] Refactor FFPA-L1 Part-1โœ”๏ธ by @DefTruth in #12
  • [FFPA] Refactor FFPA-L1 Part-1โœ”๏ธ by @DefTruth in #13
  • [FFPA] Refactor FFPA-L1 Part-2โœ”๏ธ by @DefTruth in #14
  • [test] Add gen bench table funcโœ”๏ธ by @DefTruth in #15
  • [README] Update bench, 2x faster than SDPAโœ”๏ธ by @DefTruth in #16
  • [README] Update README.md by @DefTruth in #17
  • [README] Update README.md by @DefTruth in #18
  • [README] Update README.md by @DefTruth in #19
  • [Bugfix] fix prefill.cuh un-used vars by @DefTruth in #20
  • [FFPA] fix some macro typos by @DefTruth in #21

Full Changelog: v0.0.1...v0.0.1.post1

๐ŸŽ‰ cuffpa-py 0.0.1 beta L1 Release

06 Jan 05:42
e24aede
Compare
Choose a tag to compare

๐Ÿ“– FFPA L1 (Level 1): Benchmark ๐ŸŽ‰๐ŸŽ‰

L1: level 1, O(Brx16)~O(1) SRAM complexity, O(d/4) register complexity, the same GPU HBM memory complexity as FlashAttention. B=1, H=48, N=8192, D=320-1024(FA2 not supported ๐Ÿ‘€). (Notes, *=MMA Acc F32, ^=MMA Acc F16, Softmax Acc dtype is always be F32, T=TFLOPS, ๐Ÿ‘‡Benchmark)

  • ๐Ÿ“š NVIDIA RTX 3080 Laptop (*=MMA Acc F32, ^=MMA Acc F16, T=TFLOPS)
Algorithm 320 384 448 512 576 640 704 768 832 896 960 1024
SDPA EA 13T 16T 12T 16T 15T 15T 15T 15T 15T 15T 15T 15T
FFPA L1* 32T 30T 30T 28T 28T 27T 26T 25T 25T 25T 25T 24T
Speedup 2.48x 1.88x 2.55x 1.75x 1.90x 1.77x 1.73x 1.67x 1.66x 1.66x 1.66x 1.54x
FFPA L1^ 40T 38T 39T 36T 35T 34T 33T 32T 31T 31T 28T 27T
Speedup 3.07x 2.42x 3.33x 2.24x 2.35x 2.19x 2.19x 2.13x 2.03x 2.03x 1.90x 1.74x
  • ๐Ÿ“š NVIDIA RTX 4090 (*=MMA Acc F32, ^=MMA Acc F16, T=TFLOPS)
Algorithm 320 384 448 512 576 640 704 768 832 896 960 1024
SDPA EA 82T 92T 83T 84T 78T 80T 78T 80T 78T 80T 78T 79T
FFPA L1* 136T 135T 135T 132T 133T 133T 132T 131T 130T 125T 123T 93T
Speedup 1.64x 1.45x 1.61x 1.57x 1.71x 1.65x 1.68x 1.62x 1.65x 1.56x 1.55x 1.17x
FFPA L1^ 154T 161T 160T 157T 156T 155T 157T 154T 149T 150T 145T 100T
Speedup 1.85x 1.73x 1.92x 1.87x 1.99x 1.93x 1.99x 1.90x 1.90x 1.88x 1.84x 1.25x
  • ๐Ÿ“š NVIDIA L20 (*=MMA Acc F32, ^=MMA Acc F16, T=TFLOPS)
Algorithm 320 384 448 512 576 640 704 768 832 896 960 1024
SDPA EA 56T 63T 57T 58T 55T 56T 54T 55T 54T 55T 54T 56T
FFPA L1* 99T 95T 95T 93T 94T 92T 92T 90T 89T 90T 90T 89T
Speedup 1.77x 1.49x 1.64x 1.58x 1.72x 1.65x 1.68x 1.63x 1.64x 1.63x 1.67x 1.58x
FFPA L1^ 96T 99T 100T 92T 93T 92T 93T 91T 90T 90T 88T 91T
Speedup 1.71x 1.55x 1.73x 1.56x 1.69x 1.65x 1.71x 1.64x 1.65x 1.63x 1.62x 1.62x
  • ๐Ÿ“š NVIDIA A30 (*=MMA Acc F32, ^=MMA Acc F16, T=TFLOPS)
Algorithm 320 384 448 512 576 640 704 768 832 896 960 1024
SDPA EA 25T 25T 24T 23T 24T 24T 23T 22T 22T 21T 21T 18T
FFPA L1* 33T 33T 32T 31T 32T 32T 30T 28T 25T 24T 24T 24T
Speedup 1.33x 1.33x 1.30x 1.31x 1.33x 1.33x 1.32x 1.23x 1.15x 1.11x 1.11x 1.27x
FFPA L1^ 33T 33T 33T 30T 31T 32T 31T 30T 30T 27T 24T 23T
Speedup 1.33x 1.33x 1.36x 1.30x 1.31x 1.33x 1.37x 1.35x 1.35x 1.25x 1.11x 1.25x