Skip to content

vipshop/cache-dit

Repository files navigation

A Unified and Flexible Inference Engine with 🤗🎉
Hybrid Cache Acceleration and Parallelism for DiTs
Featured|HelloGitHub

🔥Hightlight

We are excited to announce that the 🎉v1.1.0 version of cache-dit has finally been released! It brings 🔥Context Parallelism and 🔥Tensor Parallelism to cache-dit, thus making it a Unified and Flexible Inference Engine for 🤗DiTs. Key features: Unified Cache APIs, Forward Pattern Matching, Block Adapter, DBCache, DBPrune, Cache CFG, TaylorSeer, Context Parallelism, Tensor Parallelism and 🎉SOTA performance.

pip3 install -U cache-dit # Also, pip3 install git+https://github.com/huggingface/diffusers.git (latest)

You can install the stable release of cache-dit from PyPI, or the latest development version from GitHub. Then try ♥️ Cache Acceleration with just one line of code ~ ♥️

>>> import cache_dit
>>> from diffusers import DiffusionPipeline
>>> pipe = DiffusionPipeline.from_pretrained("Qwen/Qwen-Image") # Can be any diffusion pipeline
>>> cache_dit.enable_cache(pipe) # One-line code with default cache options.
>>> output = pipe(...) # Just call the pipe as normal.
>>> stats = cache_dit.summary(pipe) # Then, get the summary of cache acceleration stats.
>>> cache_dit.disable_cache(pipe) # Disable cache and run original pipe.

📚Core Features

  • 🎉Full 🤗Diffusers Support: Notably, cache-dit now supports nearly all of Diffusers' DiT-based pipelines, include 30+ series, nearly 100+ pipelines, such as FLUX.1, Qwen-Image, Qwen-Image-Lightning, Wan 2.1/2.2, HunyuanImage-2.1, HunyuanVideo, HiDream, AuraFlow, CogView3Plus, CogView4, CogVideoX, LTXVideo, ConsisID, SkyReelsV2, VisualCloze, PixArt, Chroma, Mochi, SD 3.5, DiT-XL, etc.
  • 🎉Extremely Easy to Use: In most cases, you only need one line of code: cache_dit.enable_cache(...). After calling this API, just use the pipeline as normal.
  • 🎉Easy New Model Integration: Features like Unified Cache APIs, Forward Pattern Matching, Automatic Block Adapter, Hybrid Forward Pattern, and Patch Functor make it highly functional and flexible. For example, we achieved 🎉 Day 1 support for HunyuanImage-2.1 with 1.7x speedup w/o precision loss—even before it was available in the Diffusers library.
  • 🎉State-of-the-Art Performance: Compared with algorithms including Δ-DiT, Chipmunk, FORA, DuCa, TaylorSeer and FoCa, cache-dit achieved the SOTA performance w/ 7.4x↑🎉 speedup on ClipScore!
  • 🎉Support for 4/8-Steps Distilled Models: Surprisingly, cache-dit's DBCache works for extremely few-step distilled models—something many other methods fail to do.
  • 🎉Compatibility with Other Optimizations: Designed to work seamlessly with torch.compile, Quantization (torchao, 🔥nunchaku), CPU or Sequential Offloading, 🔥Context Parallelism, 🔥Tensor Parallelism, etc.
  • 🎉Hybrid Cache Acceleration: Now supports hybrid Block-wise Cache + Calibrator schemes (e.g., DBCache or DBPrune + TaylorSeerCalibrator). DBCache or DBPrune acts as the Indicator to decide when to cache, while the Calibrator decides how to cache. More mainstream cache acceleration algorithms (e.g., FoCa) will be supported in the future, along with additional benchmarks—stay tuned for updates!
  • 🤗Diffusers Ecosystem Integration: 🔥cache-dit has joined the Diffusers community ecosystem as the first DiT-specific cache acceleration framework! Check out the documentation here:

The comparison between cache-dit and other algorithms shows that within a speedup ratio (TFLOPs) less than 🎉4x, cache-dit achieved the SOTA performance. Please refer to 📚Benchmarks for more details.

Method TFLOPs(↓) SpeedUp(↑) ImageReward(↑) Clip Score(↑)
[FLUX.1-dev]: 50 steps 3726.87 1.00× 0.9898 32.404
Chipmunk 1505.87 2.47× 0.9936 32.776
FORA(N=3) 1320.07 2.82× 0.9776 32.266
DBCache(S) 1400.08 2.66× 1.0065 32.838
DuCa(N=5) 978.76 3.80× 0.9955 32.241
TaylorSeer(N=4,O=2) 1042.27 3.57× 0.9857 32.413
DBCache(S)+TS 1153.05 3.23× 1.0221 32.819
DBCache(M) 944.75 3.94× 0.9997 32.849
DBCache(M)+TS 944.75 3.94× 1.0107 32.865
FoCa(N=5): arxiv.2508 893.54 4.16× 1.0029 32.948
[FLUX.1-dev]: 22% steps 818.29 4.55× 0.8183 31.772
FORA(N=7) 670.14 5.55× 0.7418 31.519
ToCa(N=12) 644.70 5.77× 0.7155 31.808
DuCa(N=10) 606.91 6.13× 0.8382 31.759
TeaCache(l=1.2) 669.27 5.56× 0.7394 31.704
TaylorSeer(N=7,O=2) 670.44 5.54× 0.9128 32.128
DBCache(F) 651.90 5.72x 0.9271 32.552
FoCa(N=8): arxiv.2508 596.07 6.24× 0.9502 32.706
DBCache(F)+TS 651.90 5.72x 0.9526 32.568
DBCache(U)+TS 505.47 7.37x 0.8645 32.719

🎉Surprisingly, cache-dit still works in the extremely few-step distill model, such as Qwen-Image-Lightning, with the F16B16 config, the PSNR is 34.8 and the ImageReward is 1.26. It maintained a relatively high precision.

Config PSNR(↑) Clip Score(↑) ImageReward(↑) TFLOPs(↓) SpeedUp(↑)
[Full 4 steps] INF 35.5797 1.2630 274.33 1.00x
F24B24 36.3242 35.6224 1.2630 264.74 1.04x
F16B16 34.8163 35.6109 1.2614 244.25 1.12x
F12B12 33.8953 35.6535 1.2549 234.63 1.17x
F8B8 33.1374 35.7284 1.2517 224.29 1.22x
F1B0 31.8317 35.6651 1.2397 206.90 1.33x

🔥Supported DiTs

Tip

One Model Series may contain many pipelines. cache-dit applies optimizations at the Transformer level; thus, any pipelines that include the supported transformer are already supported by cache-dit. ✅: known work and official supported now; ✖️: unofficial supported now, but maybe support in the future; Q: 4-bits models w/ nunchaku + SVDQ W4A4.

📚Model Cache CP TP 📚Model Cache CP TP
🎉FLUX.1 🎉FLUX.1 Q ✖️
🎉FLUX.1-Fill 🎉FLUX.1-Fill Q ✖️
🎉Qwen-Image 🎉Qwen-Image Q ✖️
🎉Qwen...Edit 🎉Qwen...Edit Q ✖️
🎉Qwen...Lightning 🎉Qwen...Light Q ✖️
🎉Qwen...Control.. 🎉Qwen...E...Light Q ✖️
🎉Wan 2.1 I2V/T2V 🎉Mochi ✖️
🎉Wan 2.1 VACE 🎉HiDream ✖️ ✖️
🎉Wan 2.2 I2V/T2V 🎉HunyunDiT ✖️
🎉HunyuanVideo 🎉Sana ✖️ ✖️
🎉ChronoEdit 🎉Bria ✖️ ✖️
🎉CogVideoX 🎉SkyReelsV2 ✖️ ✖️
🎉CogVideoX 1.5 🎉Lumina 1/2 ✖️ ✖️
🎉CogView4 🎉DiT-XL ✖️
🎉CogView3Plus 🎉Allegro ✖️ ✖️
🎉PixArt Sigma 🎉Cosmos ✖️ ✖️
🎉PixArt Alpha 🎉OmniGen ✖️ ✖️
🎉Chroma-HD ️✅ 🎉EasyAnimate ✖️ ✖️
🎉VisualCloze 🎉StableDiffusion3 ✖️ ✖️
🎉HunyuanImage 🎉PRX T2I ✖️ ✖️
🎉Kandinsky5 ✅️ ✅️ 🎉Amused ✖️ ✖️
🎉LTXVideo 🎉AuraFlow ✖️ ✖️
🎉ConsisID 🎉LongCatVideo ✖️ ✖️
🔥Click here to show many Image/Video cases🔥

🎉Now, cache-dit covers almost All Diffusers' DiT Pipelines🎉
🔥Qwen-Image | Qwen-Image-Edit | Qwen-Image-Edit-Plus 🔥
🔥FLUX.1 | Qwen-Image-Lightning 4/8 Steps | Wan 2.1 | Wan 2.2 🔥
🔥HunyuanImage-2.1 | HunyuanVideo | HunyuanDiT | HiDream | AuraFlow🔥
🔥CogView3Plus | CogView4 | LTXVideo | CogVideoX | CogVideoX 1.5 | ConsisID🔥
🔥Cosmos | SkyReelsV2 | VisualCloze | OmniGen 1/2 | Lumina 1/2 | PixArt🔥
🔥Chroma | Sana | Allegro | Mochi | SD 3/3.5 | Amused | ... | DiT-XL🔥

🔥Wan2.2 MoE | +cache-dit:2.0x↑🎉 | HunyuanVideo | +cache-dit:2.1x↑🎉

🔥Qwen-Image | +cache-dit:1.8x↑🎉 | FLUX.1-dev | +cache-dit:2.1x↑🎉

🔥Qwen...Lightning | +cache-dit:1.14x↑🎉 | HunyuanImage | +cache-dit:1.7x↑🎉

🔥Qwen-Image-Edit | Input w/o Edit | Baseline | +cache-dit:1.6x↑🎉 | 1.9x↑🎉

🔥FLUX-Kontext-dev | Baseline | +cache-dit:1.3x↑🎉 | 1.7x↑🎉 | 2.0x↑ 🎉

🔥HiDream-I1 | +cache-dit:1.9x↑🎉 | CogView4 | +cache-dit:1.4x↑🎉 | 1.7x↑🎉

🔥CogView3 | +cache-dit:1.5x↑🎉 | 2.0x↑🎉| Chroma1-HD | +cache-dit:1.9x↑🎉

🔥Mochi-1-preview | +cache-dit:1.8x↑🎉 | SkyReelsV2 | +cache-dit:1.6x↑🎉

🔥VisualCloze-512 | Model | Cloth | Baseline | +cache-dit:1.4x↑🎉 | 1.7x↑🎉

🔥LTX-Video-0.9.7 | +cache-dit:1.7x↑🎉 | CogVideoX1.5 | +cache-dit:2.0x↑🎉

🔥OmniGen-v1 | +cache-dit:1.5x↑🎉 | 3.3x↑🎉 | Lumina2 | +cache-dit:1.9x↑🎉

🔥Allegro | +cache-dit:1.36x↑🎉 | AuraFlow-v0.3 | +cache-dit:2.27x↑🎉

🔥Sana | +cache-dit:1.3x↑🎉 | 1.6x↑🎉| PixArt-Sigma | +cache-dit:2.3x↑🎉

🔥PixArt-Alpha | +cache-dit:1.6x↑🎉 | 1.8x↑🎉| SD 3.5 | +cache-dit:2.5x↑🎉

🔥Asumed | +cache-dit:1.1x↑🎉 | 1.2x↑🎉 | DiT-XL-256 | +cache-dit:1.8x↑🎉
♥️ Please consider to leave a ⭐️ Star to support us ~ ♥️

📖Table of Contents

For more advanced features such as Unified Cache APIs, Forward Pattern Matching, Automatic Block Adapter, Hybrid Forward Pattern, Patch Functor, DBCache, DBPrune, TaylorSeer Calibrator, Hybrid Cache CFG, Context Parallelism and Tensor Parallelism, please refer to the 🎉User_Guide.md for details.

👋Contribute

How to contribute? Star ⭐️ this repo to support us or check CONTRIBUTE.md.

🎉Projects Using CacheDiT

Here is a curated list of open-source projects integrating CacheDiT, including popular repositories like jetson-containers, flux-fast, and sdnext. 🎉CacheDiT has been recommended by: Wan 2.2, Qwen-Image-Lightning, Qwen-Image, LongCat-Video, Kandinsky-5, 🤗diffusers and HelloGitHub, among others.

©️Acknowledgements

Special thanks to vipshop's Computer Vision AI Team for supporting document, testing and production-level deployment of this project. We learned the design and reused code from the following projects: 🤗diffusers, ParaAttention, xDiT and TaylorSeer.

©️Citations

@misc{cache-dit@2025,
  title={cache-dit: A Unified and Flexible Inference Engine with Hybrid Cache Acceleration and Parallelism for DiTs.},
  url={https://github.com/vipshop/cache-dit.git},
  note={Open-source software available at https://github.com/vipshop/cache-dit.git},
  author={DefTruth, vipshop.com},
  year={2025}
}