Releases · xlite-dev/Awesome-LLM-Inference · GitHub

17 Jun 09:57

DefTruth

v2.6.20 Latest

Latest

What's Changed

Add 4 papers by @woominsong in #148
Update new paper (KVzip) by @Janghyun1230 in #149
Add a new paper (GuidedQuant) by @jusjinuk in #150
Add STAND by @woominsong in #151
Add Inference-Time Hyper-Scaling by @CStanKonrad in #152

New Contributors

@woominsong made their first contribution in #148
@jusjinuk made their first contribution in #150
@CStanKonrad made their first contribution in #152

Full Changelog: v2.6.19...v2.6.20

Contributors

CStanKonrad, Janghyun1230, and 2 other contributors

Assets 2

27 May 05:52

DefTruth

v2.6.19

What's Changed

🔥[SageAttention-3] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training by @DefTruth in #147

Full Changelog: v2.6.18...v2.6.19

Contributors

DefTruth

Assets 2

15 May 06:04

DefTruth

v2.6.18

What's Changed

Flex Attention: a Programming Model for Generating Optimized Attention Kernels by @DefTruth in #146

Full Changelog: v2.6.17...v2.6.18

Contributors

DefTruth

Assets 2

06 May 02:25

DefTruth

v2.6.17

What's Changed

🔥[BitNet v2] Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs by @DefTruth in #144
Add The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs by @PiotrNawrot in #145

New Contributors

@PiotrNawrot made their first contribution in #145

Full Changelog: v2.6.16...v2.6.17

Contributors

PiotrNawrot and DefTruth

Assets 2

27 Apr 08:33

DefTruth

v2.6.16

What's Changed

Add PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters by @Lizonghang in #137
🔥🔥[SGLang] Efficiently Programming Large Language Models using SGLang by @DefTruth in #138
🔥[FSDP 1/2] PyTorch FSDP: Getting Started with Fully Sharded Data Parallel(FSDP) by @DefTruth in #139
🔥[MMInference] MMInference: Accelerating Pre-filling for Long-Context Visual Language Models via Modality-Aware Permutation Sparse Attention by @DefTruth in #140
Update Multi-GPUs/Multi-Nodes Parallelism by @DefTruth in #141
🔥[Triton-distributed] TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives by @DefTruth in #142

New Contributors

@Lizonghang made their first contribution in #137

Full Changelog: v2.6.15...v2.6.16

Contributors

Lizonghang and DefTruth

Assets 2

17 Apr 08:08

DefTruth

v2.6.15

What's Changed

MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism by @DefTruth in #131
TRITONBENCH: Benchmarking Large Language Model Capabilities for Generating Triton Operator by @DefTruth in #132
🔥[KV Cache Prefetch] Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching by @DefTruth in #133
Add SeerAttention and SlimAttention Paper by @sunshinemyson in #135

New Contributors

@sunshinemyson made their first contribution in #135

Full Changelog: v2.6.14...v2.6.15

Contributors

sunshinemyson and DefTruth

Assets 2

31 Mar 04:56

DefTruth

v2.6.14

What's Changed

[feat] add deepseek FlashMLA by @shaoyuyoung in #120
Add our ICLR2025 work Dynamic-LLaVA by @Blank-z0 in #121
🔥[MHA2MLA] Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs by @DefTruth in #122
update the title of SageAttention2 and add SpargeAttn by @jt-zhang in #123
Add DeepSeek Open Sources modules by @DefTruth in #124
Update DeepSeek/MLA Topics by @DefTruth in #125
Request to Add CacheCraft: A Relevant Work on Chunk-Aware KV Cache Reuse by @skejriwal44 in #126
🔥[X-EcoMLA] Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression by @DefTruth in #127
Add download_pdfs.py by @DefTruth in #128
Update README.md by @DefTruth in #129
Update Mooncake-v3 paper link by @DefTruth in #130

New Contributors

@Blank-z0 made their first contribution in #121
@jt-zhang made their first contribution in #123
@skejriwal44 made their first contribution in #126

Full Changelog: v2.6.13...v2.6.14

Contributors

DefTruth, skejriwal44, and 3 other contributors

Assets 2

19 Feb 11:46

DefTruth

v2.6.13

What's Changed

🔥[DeepSeek-NSA] Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention by @DefTruth in https://github.com/DefTruth/Awesome-LLM-Inference/pull/119

Full Changelog: DefTruth/Awesome-LLM-Inference@v2.6.12...v2.6.13

Contributors

DefTruth

Assets 2

13 Feb 04:21

DefTruth

v2.6.12

What's Changed

Add Multi-head Latent Attention(MLA) topic by @DefTruth in https://github.com/DefTruth/Awesome-LLM-Inference/pull/118

Full Changelog: DefTruth/Awesome-LLM-Inference@v2.6.11...v2.6.12

Contributors

DefTruth

Assets 2

31 Jan 06:54

DefTruth

v2.6.11

What's Changed

add MiniMax-01 in Trending LLM/VLM Topics and Long Context Attention by @shaoyuyoung in https://github.com/DefTruth/Awesome-LLM-Inference/pull/112
[feat] add deepseek-r1 by @shaoyuyoung in https://github.com/DefTruth/Awesome-LLM-Inference/pull/113
🔥🔥[DistServe] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving by @DefTruth in https://github.com/DefTruth/Awesome-LLM-Inference/pull/114
🔥🔥[KVDirect] KVDirect: Distributed Disaggregated LLM Inference by @DefTruth in https://github.com/DefTruth/Awesome-LLM-Inference/pull/115
🔥🔥[DeServe] DESERVE: TOWARDS AFFORDABLE OFFLINE LLM INFERENCE VIA DECENTRALIZATION by @DefTruth in https://github.com/DefTruth/Awesome-LLM-Inference/pull/116
🔥🔥[Mooncake] Mooncake: A KVCache-centric Disaggregated Architecture for LLM Serving by @DefTruth in https://github.com/DefTruth/Awesome-LLM-Inference/pull/117

New Contributors

@shaoyuyoung made their first contribution in https://github.com/DefTruth/Awesome-LLM-Inference/pull/112

Full Changelog: DefTruth/Awesome-LLM-Inference@v2.6.10...v2.6.11

Contributors

DefTruth and shaoyuyoung

Assets 2