Skip to content

Releases: xlite-dev/Awesome-LLM-Inference

v2.6.20

17 Jun 09:57
1250b60

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: v2.6.19...v2.6.20

v2.6.19

27 May 05:52
7d153bd

Choose a tag to compare

What's Changed

  • 🔥[SageAttention-3] SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-bit Training by @DefTruth in #147

Full Changelog: v2.6.18...v2.6.19

v2.6.18

15 May 06:04
7866762

Choose a tag to compare

What's Changed

  • Flex Attention: a Programming Model for Generating Optimized Attention Kernels by @DefTruth in #146

Full Changelog: v2.6.17...v2.6.18

v2.6.17

06 May 02:25
6d4ed04

Choose a tag to compare

What's Changed

  • 🔥[BitNet v2] Native 4-bit Activations with Hadamard Transformation for 1-bit LLMs by @DefTruth in #144
  • Add The Sparse Frontier: Sparse Attention Trade-offs in Transformer LLMs by @PiotrNawrot in #145

New Contributors

Full Changelog: v2.6.16...v2.6.17

v2.6.16

27 Apr 08:33
2889533

Choose a tag to compare

What's Changed

  • Add PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters by @Lizonghang in #137
  • 🔥🔥[SGLang] Efficiently Programming Large Language Models using SGLang by @DefTruth in #138
  • 🔥[FSDP 1/2] PyTorch FSDP: Getting Started with Fully Sharded Data Parallel(FSDP) by @DefTruth in #139
  • 🔥[MMInference] MMInference: Accelerating Pre-filling for Long-Context Visual Language Models via Modality-Aware Permutation Sparse Attention by @DefTruth in #140
  • Update Multi-GPUs/Multi-Nodes Parallelism by @DefTruth in #141
  • 🔥[Triton-distributed] TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives by @DefTruth in #142

New Contributors

Full Changelog: v2.6.15...v2.6.16

v2.6.15

17 Apr 08:08
73d8740

Choose a tag to compare

What's Changed

  • MegaScale-Infer: Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism by @DefTruth in #131
  • TRITONBENCH: Benchmarking Large Language Model Capabilities for Generating Triton Operator by @DefTruth in #132
  • 🔥[KV Cache Prefetch] Accelerating LLM Inference Throughput via Asynchronous KV Cache Prefetching by @DefTruth in #133
  • Add SeerAttention and SlimAttention Paper by @sunshinemyson in #135

New Contributors

Full Changelog: v2.6.14...v2.6.15

v2.6.14

31 Mar 04:56
ea4aa30

Choose a tag to compare

What's Changed

  • [feat] add deepseek FlashMLA by @shaoyuyoung in #120
  • Add our ICLR2025 work Dynamic-LLaVA by @Blank-z0 in #121
  • 🔥[MHA2MLA] Towards Economical Inference: Enabling DeepSeek’s Multi-Head Latent Attention in Any Transformer-based LLMs by @DefTruth in #122
  • update the title of SageAttention2 and add SpargeAttn by @jt-zhang in #123
  • Add DeepSeek Open Sources modules by @DefTruth in #124
  • Update DeepSeek/MLA Topics by @DefTruth in #125
  • Request to Add CacheCraft: A Relevant Work on Chunk-Aware KV Cache Reuse by @skejriwal44 in #126
  • 🔥[X-EcoMLA] Upcycling Pre-Trained Attention into MLA for Efficient and Extreme KV Compression by @DefTruth in #127
  • Add download_pdfs.py by @DefTruth in #128
  • Update README.md by @DefTruth in #129
  • Update Mooncake-v3 paper link by @DefTruth in #130

New Contributors

Full Changelog: v2.6.13...v2.6.14

v2.6.13

19 Feb 11:46
0525c4d

Choose a tag to compare

What's Changed

Full Changelog: DefTruth/Awesome-LLM-Inference@v2.6.12...v2.6.13

v2.6.12

13 Feb 04:21
1ddf093

Choose a tag to compare

What's Changed

Full Changelog: DefTruth/Awesome-LLM-Inference@v2.6.11...v2.6.12

v2.6.11

31 Jan 06:54
d7914c0

Choose a tag to compare

What's Changed

New Contributors

Full Changelog: DefTruth/Awesome-LLM-Inference@v2.6.10...v2.6.11