Wuyang Li ยท Wentao Pan ยท Po-Chien Luan ยท Yang Gao ยท Alexandre Alahi
Technical introduction (unofficial): AI Papers Slop (English); WechatApp (Chinese)
Stable Video Infinity (SVI) is able to generate ANY-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines in ANY domains.
- OpenSVI: Everything is open-sourced: training & evaluation scripts, datasets, and more.
- Infinite Length: No inherent limit on video duration; generate arbitrarily long stories (see the 10โminute โTom and Jerryโ demo).
- Versatile: Supports diverse in-the-wild generation tasks: multi-scene short films, singleโscene animations, skeleton-/audio-conditioned generation, cartoons, and more.
- Efficient: Only LoRA adapters are tuned, requiring very little training data: anyone can make their own SVI easily.
You can watch our 8-minute crazy-version Tom & Jerry video from Bilibili or Youtube. If you think this project is useful, we would really appreciate your star โญ, which encourages us to better develop the open-source community! This repository will be continuously maintained. Thank you!
๐ง Contact: [email protected]
We've recently discovered that some users have been incorrectly using SVI workflows. We apologize for any confusion. Please note that SVI LoRA cannot directly use the original Wan 2.1 workflow - it requires modified padding settings.
Please use our official workflow: Stable-Video-Infinity/comfyui_workflow, which supports independent prompts for each video clip. Big thanks to @RuneGjerde, @Kijai, and @Taiwan1912!
Due to the significant impact of quantization and step distillation on the SVI-Film workflow, we currently only open-source the SVI-Shot workflow. Using our official workflow will generate infinite-length videos without drifting and forgetting. Below is a 3-minute interactive video demo (distinct prompts for each 5-second video continuation):
SVI-Shot-.Interactive-3min.mp4
If you canโt wait for the official ComfyUI release, try the testing versions of the Shot and Film workflows first with commercial GPUs based on quantization and distill Loras: Here. The official one (more stable) might be updated soon. Due to model quantization, the video quality may be affected (Better to try more sampling steps than 4/8).
- Please ensure that every video clip uses a different seed.
- SVI-Film uses 5 motion frames (last 5 frames) for i2v, not 1.
- SVI-Tom shares the workflow with SVI-Film, but uses 1 motion frame.
- SVI-Shot uses 1 motion frame (last frame) and uses extra VACE-based padding (the given reference image).
- Use the boat and cat demos for 50s generation and compare them with the reproduced ones to verify correctness.
- SVI-Shot also supports using different text for clips. See here. Thanks @Taiwan1912๏ผ
Thank you for playing with SVI!
[10-31-2025] Official SVI-Shot ComfUI workflow
[10-23-2025] Preview of Wan 2.2-5B-SVI and some tips for custom SVI implementation: See DevLog!
[10-21-2025] The error-banking strategy is optimized, further imporving the stability. See details in DevLog!
[10-13-2025] SVI is now fully open-sourced and online!
PS: Wan 2.2-5B-SVI is coming.
Self-Forcing achieves frame-by-frame causality, whereas SVI, a hybrid version, operates with clip-by-clip causality and bidirectional attention within each clip.
Targeting film and creative content production, our SVI design mirrors a director's workflow: (1) Directors repeatedly review clips in both forward and reverse directions to ensure quality, often calling "CUT" and "AGAIN" multiple times during the creative process. SVI maintains bidirectionality within each clip to emulate this process. (2) After that, directors seamlessly connect different clips along the temporal axis with causality (and some scene-transition animation), which aligns with SVI's clip-by-clip causality. The Self-Forcing series is better suited for scenarios prioritizing real-time interaction (e.g., gaming). In contrast, SVI focuses on story content creation, requiring higher standards for both content and visual quality. Intuitively, SVI's paradigm has unique advantages in end-to-end high-quality video content creation.
Please Refer to FAQ for More Questions.
We have tested the environment with A100 80G, cuda 12.0, and torch 2.8.0. This is our reproduced environment. The following script will automatically install the older version torch==2.5.0. We have also tested with the lower version: torch==2.4.1 and torch==2.5.0. Feel free to let me know if you meet issues.
conda create -n svi python=3.10
conda activate svi
# For svi family
pip install -e .
pip install flash_attn==2.8.0.post2
# If you encounter issues with flash-attn installation, please refer to the details at https://github.com/vita-epfl/Stable-Video-Infinity/issues/3.
conda install -c conda-forge ffmpeg
conda install -c conda-forge librosa
conda install -c conda-forge libiconvhuggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P| Model | Task | Input | Output | Hugging Face Link | Comments |
|---|---|---|---|---|---|
| ALL | Infinite possibility | Image + X | X video | ๐ค Folder | Family bucket! I want to play with all! |
| SVI-Shot | Single-scene generation | Image + Text prompt | Long video | ๐ค Model | Generate consistent long video with 1 text prompt. (This will never drift or forget in our 20 min test) |
| SVI-Film-Opt-10212025 (Latest) | Multi-scene generation | Image + Text prompt stream | Film-style video | ๐ค Model | Generate creative long video with 1 text prompt stream (5 second per text). |
| SVI-Film | Multi-scene generation | Image + Text prompt stream | Film-style video | ๐ค Model | Generate creative long video with 1 text prompt stream (5 second per text). |
| SVI-Film (Transition) | Multi-scene generation | Image + Text prompt stream | Film-style video | ๐ค Model | Generate creative long video with 1 text prompt stream. (More scene transitions due to the training data) |
| SVI-Tom&Jerry | Cartoon animation | Image | Cartoon video | ๐ค Model | Generate creative long cartoon videos with 1 text prompt stream (This will never drift or forget in our 20 min test) |
| SVI-Talk | Talking head | Image + Audio | Talking video | ๐ค Model | Generate long videos with audio-conditioned human speaking (This will never drift or forget in our 10 min test) |
| SVI-Dance | Dancing animation | Image + Skeleton | Dance video | ๐ค Model | Generate long videos with skeleton-conditioned human dancing |
Note: If you want to play with T2V, you can directly use SVI with an image generated by any T2I model!
# login with your fine-grained token
huggingface-cli login
# Option 1: Download SVI Family bucket!
huggingface-cli download vita-video-gen/svi-model --local-dir ./weights/Stable-Video-Infinity --include "version-1.0/*"
# Option 2: Download individual models
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-shot.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-film-opt-10212025.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-film.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-film-transitions.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-tom.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-talk.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-dance.safetensors --local-dir ./weights/Stable-Video-Infinity# Download audio encoder
huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base
huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base
# Download multitalk weight
huggingface-cli download MeiGen-AI/MeiGen-MultiTalk --local-dir ./weights/MeiGen-MultiTalk
# Link Multitalk
ln -s $PWD/weights/MeiGen-MultiTalk/multitalk.safetensors weights/Wan2.1-I2V-14B-480P/huggingface-cli download ZheWang123/UniAnimate-DiT --local-dir ./weights/UniAnimate-DiTAfter downloading all the models, your weights/ directory structure should look like this:
weights/
โโโ Wan2.1-I2V-14B-480P/
โ โโโ diffusion_pytorch_model-00001-of-00007.safetensors
โ โโโ diffusion_pytorch_model-00002-of-00007.safetensors
โ โโโ diffusion_pytorch_model-00003-of-00007.safetensors
โ โโโ diffusion_pytorch_model-00004-of-00007.safetensors
โ โโโ diffusion_pytorch_model-00005-of-00007.safetensors
โ โโโ diffusion_pytorch_model-00006-of-00007.safetensors
โ โโโ diffusion_pytorch_model-00007-of-00007.safetensors
โ โโโ diffusion_pytorch_model.safetensors.index.json
โ โโโ models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth
โ โโโ models_t5_umt5-xxl-enc-bf16.pth
โ โโโ Wan2.1_VAE.pth
โ โโโ multitalk.safetensors (symlink)
โ โโโ README.md
โโโ Stable-Video-Infinity/
โ โโโ version-1.0/
โ โโโ svi-shot.safetensors
โ โโโ svi-film.safetensors
โ โโโ svi-film-transitions.safetensors
โ โโโ svi-tom.safetensors
โ โโโ svi-talk.safetensors
โ โโโ svi-dance.safetensors
โโโ chinese-wav2vec2-base/ (for SVI-Talk)
โ โโโ config.json
โ โโโ model.safetensors
โ โโโ preprocessor_config.json
โ โโโ README.md
โโโ MeiGen-MultiTalk/ (for SVI-Talk)
โ โโโ diffusion_pytorch_model.safetensors.index.json
โ โโโ multitalk.safetensors
โ โโโ README.md
โโโ UniAnimate-DiT/ (for SVI-Dance)
โโโ dw-ll_ucoco_384.onnx
โโโ UniAnimate-Wan2.1-14B-Lora-12000.ckpt
โโโ yolox_l.onnx
โโโ README.md
The following scripts will use data in data/demo for inference. You can also use custom data to inference by simply changing the data path.
# SVI-Shot
bash scripts/test/svi_shot.sh
# SVI-Film
bash scripts/test/svi_film.sh
# SVI-Talk
bash scripts/test/svi_talk.sh
# SVI-Dance
bash scripts/test/svi_dance.sh
# SVI-Tom&Jerry
bash scripts/test/svi_tom.sh Currently, gradio demo only supports SVI-Shot and SVI-Film.
bash gradio_demo.shWe have prepared the toy training data data/toy_train/. You can simply follow the data format to train SVI with your custom data.
Please modify --num_nodes if you use more nodes for training. We have tested both 8 and 64 GPUs for training, where larger batch-size gave a better performance.
# (Optionally) Use scripts/data_preprocess/process_mixkit.py from CausVid to pre-process data
# start training
bash scripts/train/svi_shot.sh # (Optionally) Use scripts/data_preprocess/process_mixkit.py from CausVid to pre-process data
# start training
bash scripts/train/svi_film.sh # Preprocess the toy training data
python scripts/data_preprocess/prepare_video_audio.py
# Start training
bash scripts/train/svi_talk.sh # Preprocess the toy training data
python scripts/data_preprocess/prepare_video_audio.py
# Start training
bash scripts/train/svi_dance.sh # Change .pt files to .safetensors files
# zero_to_fp32.py will be automatically generated in your model dir, change $DIR_WITH_SAFETENSORS into your desired DIR
python zero_to_fp32.py . $DIR_WITH_SAFETENSORS --safe_serialization
# (Optionally) Extract and only save LoRA parameters to reduce disk space
python utils/extract_lora.py --checkpoint_dir $DIR_WITH_SAFETENSORS --output_dir $XXXPlease modify the inference scripts in ./scripts/test/ accordingly by changing the inference samples and your new weight
You can also use our benchmark datasets made by our Automatic Prompt Stream Engine (see Appendix. A for more details), where you can find images and associated prompt streams according to specific storylines.
| Data | Use | HuggingFace Link | Comment |
|---|---|---|---|
| Consistent Video Generation | Test | ๐ค Dataset | Generate 1 long video using 1 text prompt |
| Creative Video Generation | Test | ๐ค Dataset | Generate 1 long video using 1 text prompt stream according to storyline (1 prompt for 5 sec clip) |
| Creative Video Generation (More prompts) | Test | ๐ค Dataset | Generate 1 long video using 1 text prompt stream according to storyline (1 prompt for 5 sec clip) |
The following is the training data we used for SVI family.
| Data | Use | HuggingFace Link | Comment |
|---|---|---|---|
| Customized Datasets | Train | ๐ค Dataset | You can make your customized datasets using this format |
| Consistent/Creative Video Generation | Train | ๐ค Dataset | MixKit Dataset |
| Consistent/Creative Video Generation | Train | ๐ค Dataset | UltraVideo Dataset |
| Human Talking | Train | ๐ค Dataset | 5k subset from Hallo 3 |
| Human Dancing | Train | ๐ค Dataset | TikTok |
huggingface-cli download --repo-type dataset vita-video-gen/svi-benchmark --local-dir ./data/svi-benchmark-
Release everything about SVI
-
Wan 2.2 5B based SVI [Issue #1 #7]
-
Wan 2.2 14B based SVI [Issue #1]
-
Streaming generation model
-
[Call for TODO] Write down your idea in the Issue
We greatly appreciate the tremendous effort for the following fantastic projects!
[1] Wan: Open and Advanced Large-Scale Video Generative Models
[2] UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer
[3] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation
If you find our work helpful for your research, please consider citing our paper. Thank you so much!
@article{li2025stable,
title={Stable Video Infinity: Infinite-Length Video Generation with Error Recycling},
author={Li, Wuyang and Pan, Wentao and Luan, Po-Chien and Gao, Yang and Alahi, Alexandre},
journal={arXiv preprint arXiv:2510.09212},
year={2025}
}We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)'s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.


