Skip to content

vita-epfl/Stable-Video-Infinity

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

44 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

SVI

Stable Video Infinity: Infinite-Length Video Generation with Error Recycling

Wuyang Li ยท Wentao Pan ยท Po-Chien Luan ยท Yang Gao ยท Alexandre Alahi

VITA@EPFL

Technical introduction (unofficial): AI Papers Slop (English); WechatApp (Chinese)

โœจ Highlight

Stable Video Infinity (SVI) is able to generate ANY-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines in ANY domains.

  • OpenSVI: Everything is open-sourced: training & evaluation scripts, datasets, and more.
  • Infinite Length: No inherent limit on video duration; generate arbitrarily long stories (see the 10โ€‘minute โ€œTom and Jerryโ€ demo).
  • Versatile: Supports diverse in-the-wild generation tasks: multi-scene short films, singleโ€‘scene animations, skeleton-/audio-conditioned generation, cartoons, and more.
  • Efficient: Only LoRA adapters are tuned, requiring very little training data: anyone can make their own SVI easily.

You can watch our 8-minute crazy-version Tom & Jerry video from Bilibili or Youtube. If you think this project is useful, we would really appreciate your star โญ, which encourages us to better develop the open-source community! This repository will be continuously maintained. Thank you!

๐Ÿ“ง Contact: [email protected]

๐Ÿ˜€ ComfyUI Users

Official ComfyUI

We've recently discovered that some users have been incorrectly using SVI workflows. We apologize for any confusion. Please note that SVI LoRA cannot directly use the original Wan 2.1 workflow - it requires modified padding settings.

Please use our official workflow: Stable-Video-Infinity/comfyui_workflow, which supports independent prompts for each video clip. Big thanks to @RuneGjerde, @Kijai, and @Taiwan1912!

Due to the significant impact of quantization and step distillation on the SVI-Film workflow, we currently only open-source the SVI-Shot workflow. Using our official workflow will generate infinite-length videos without drifting and forgetting. Below is a 3-minute interactive video demo (distinct prompts for each 5-second video continuation):

SVI-Shot-.Interactive-3min.mp4

Some Important To-Checks

If you canโ€™t wait for the official ComfyUI release, try the testing versions of the Shot and Film workflows first with commercial GPUs based on quantization and distill Loras: Here. The official one (more stable) might be updated soon. Due to model quantization, the video quality may be affected (Better to try more sampling steps than 4/8).

  • Please ensure that every video clip uses a different seed.
  • SVI-Film uses 5 motion frames (last 5 frames) for i2v, not 1.
  • SVI-Tom shares the workflow with SVI-Film, but uses 1 motion frame.
  • SVI-Shot uses 1 motion frame (last frame) and uses extra VACE-based padding (the given reference image).
  • Use the boat and cat demos for 50s generation and compare them with the reproduced ones to verify correctness.
  • SVI-Shot also supports using different text for clips. See here. Thanks @Taiwan1912๏ผ

Thank you for playing with SVI!

๐Ÿ”ฅ News

[10-31-2025] Official SVI-Shot ComfUI workflow [10-23-2025] Preview of Wan 2.2-5B-SVI and some tips for custom SVI implementation: See DevLog!
[10-21-2025] The error-banking strategy is optimized, further imporving the stability. See details in DevLog!
[10-13-2025] SVI is now fully open-sourced and online!

PS: Wan 2.2-5B-SVI is coming.

โ“ Frequently Asked Questions

Bidirectional or Causal (Self-Forcing)?

Self-Forcing achieves frame-by-frame causality, whereas SVI, a hybrid version, operates with clip-by-clip causality and bidirectional attention within each clip.

Targeting film and creative content production, our SVI design mirrors a director's workflow: (1) Directors repeatedly review clips in both forward and reverse directions to ensure quality, often calling "CUT" and "AGAIN" multiple times during the creative process. SVI maintains bidirectionality within each clip to emulate this process. (2) After that, directors seamlessly connect different clips along the temporal axis with causality (and some scene-transition animation), which aligns with SVI's clip-by-clip causality. The Self-Forcing series is better suited for scenarios prioritizing real-time interaction (e.g., gaming). In contrast, SVI focuses on story content creation, requiring higher standards for both content and visual quality. Intuitively, SVI's paradigm has unique advantages in end-to-end high-quality video content creation.

Pardigm comparisoon

Please Refer to FAQ for More Questions.

๐Ÿ”ง Environment Setup

We have tested the environment with A100 80G, cuda 12.0, and torch 2.8.0. This is our reproduced environment. The following script will automatically install the older version torch==2.5.0. We have also tested with the lower version: torch==2.4.1 and torch==2.5.0. Feel free to let me know if you meet issues.

conda create -n svi python=3.10 
conda activate svi

# For svi family
pip install -e .
pip install flash_attn==2.8.0.post2
# If you encounter issues with flash-attn installation, please refer to the details at https://github.com/vita-epfl/Stable-Video-Infinity/issues/3.

conda install -c conda-forge ffmpeg
conda install -c conda-forge librosa
conda install -c conda-forge libiconv

๐Ÿ“ฆ Model Preparation

Download Wan 2.1 I2V 14B

huggingface-cli download Wan-AI/Wan2.1-I2V-14B-480P --local-dir ./weights/Wan2.1-I2V-14B-480P

Download SVI Family

Model Task Input Output Hugging Face Link Comments
ALL Infinite possibility Image + X X video ๐Ÿค— Folder Family bucket! I want to play with all!
SVI-Shot Single-scene generation Image + Text prompt Long video ๐Ÿค— Model Generate consistent long video with 1 text prompt. (This will never drift or forget in our 20 min test)
SVI-Film-Opt-10212025 (Latest) Multi-scene generation Image + Text prompt stream Film-style video ๐Ÿค— Model Generate creative long video with 1 text prompt stream (5 second per text).
SVI-Film Multi-scene generation Image + Text prompt stream Film-style video ๐Ÿค— Model Generate creative long video with 1 text prompt stream (5 second per text).
SVI-Film (Transition) Multi-scene generation Image + Text prompt stream Film-style video ๐Ÿค— Model Generate creative long video with 1 text prompt stream. (More scene transitions due to the training data)
SVI-Tom&Jerry Cartoon animation Image Cartoon video ๐Ÿค— Model Generate creative long cartoon videos with 1 text prompt stream (This will never drift or forget in our 20 min test)
SVI-Talk Talking head Image + Audio Talking video ๐Ÿค— Model Generate long videos with audio-conditioned human speaking (This will never drift or forget in our 10 min test)
SVI-Dance Dancing animation Image + Skeleton Dance video ๐Ÿค— Model Generate long videos with skeleton-conditioned human dancing

Note: If you want to play with T2V, you can directly use SVI with an image generated by any T2I model!

# login with your fine-grained token
huggingface-cli login

# Option 1: Download SVI Family bucket!
huggingface-cli download vita-video-gen/svi-model --local-dir ./weights/Stable-Video-Infinity --include "version-1.0/*"

# Option 2: Download individual models
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-shot.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-film-opt-10212025.safetensors  --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-film.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-film-transitions.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-tom.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-talk.safetensors --local-dir ./weights/Stable-Video-Infinity
# huggingface-cli download vita-video-gen/svi-model version-1.0/svi-dance.safetensors --local-dir ./weights/Stable-Video-Infinity

Download Multitalk Cross-Attention for SVI-Talk Training/Test

# Download audio encoder
huggingface-cli download TencentGameMate/chinese-wav2vec2-base --local-dir ./weights/chinese-wav2vec2-base 
huggingface-cli download TencentGameMate/chinese-wav2vec2-base model.safetensors --revision refs/pr/1 --local-dir ./weights/chinese-wav2vec2-base

# Download multitalk weight
huggingface-cli download MeiGen-AI/MeiGen-MultiTalk --local-dir ./weights/MeiGen-MultiTalk

# Link Multitalk
ln -s $PWD/weights/MeiGen-MultiTalk/multitalk.safetensors weights/Wan2.1-I2V-14B-480P/

Download UniAnimate-DiT LoRA for SVI-Dance Training

huggingface-cli download ZheWang123/UniAnimate-DiT --local-dir ./weights/UniAnimate-DiT

Check Model

After downloading all the models, your weights/ directory structure should look like this:

weights/
โ”œโ”€โ”€ Wan2.1-I2V-14B-480P/
โ”‚   โ”œโ”€โ”€ diffusion_pytorch_model-00001-of-00007.safetensors
โ”‚   โ”œโ”€โ”€ diffusion_pytorch_model-00002-of-00007.safetensors
โ”‚   โ”œโ”€โ”€ diffusion_pytorch_model-00003-of-00007.safetensors
โ”‚   โ”œโ”€โ”€ diffusion_pytorch_model-00004-of-00007.safetensors
โ”‚   โ”œโ”€โ”€ diffusion_pytorch_model-00005-of-00007.safetensors
โ”‚   โ”œโ”€โ”€ diffusion_pytorch_model-00006-of-00007.safetensors
โ”‚   โ”œโ”€โ”€ diffusion_pytorch_model-00007-of-00007.safetensors
โ”‚   โ”œโ”€โ”€ diffusion_pytorch_model.safetensors.index.json
โ”‚   โ”œโ”€โ”€ models_clip_open-clip-xlm-roberta-large-vit-huge-14.pth
โ”‚   โ”œโ”€โ”€ models_t5_umt5-xxl-enc-bf16.pth
โ”‚   โ”œโ”€โ”€ Wan2.1_VAE.pth
โ”‚   โ”œโ”€โ”€ multitalk.safetensors (symlink)
โ”‚   โ””โ”€โ”€ README.md
โ”œโ”€โ”€ Stable-Video-Infinity/
โ”‚   โ””โ”€โ”€ version-1.0/
โ”‚       โ”œโ”€โ”€ svi-shot.safetensors
โ”‚       โ”œโ”€โ”€ svi-film.safetensors
โ”‚       โ”œโ”€โ”€ svi-film-transitions.safetensors
โ”‚       โ”œโ”€โ”€ svi-tom.safetensors
โ”‚       โ”œโ”€โ”€ svi-talk.safetensors
โ”‚       โ””โ”€โ”€ svi-dance.safetensors
โ”œโ”€โ”€ chinese-wav2vec2-base/ (for SVI-Talk)
โ”‚   โ”œโ”€โ”€ config.json
โ”‚   โ”œโ”€โ”€ model.safetensors
โ”‚   โ”œโ”€โ”€ preprocessor_config.json
โ”‚   โ””โ”€โ”€ README.md
โ”œโ”€โ”€ MeiGen-MultiTalk/ (for SVI-Talk)
โ”‚   โ”œโ”€โ”€ diffusion_pytorch_model.safetensors.index.json
โ”‚   โ”œโ”€โ”€ multitalk.safetensors
โ”‚   โ””โ”€โ”€ README.md
โ””โ”€โ”€ UniAnimate-DiT/ (for SVI-Dance)
    โ”œโ”€โ”€ dw-ll_ucoco_384.onnx
    โ”œโ”€โ”€ UniAnimate-Wan2.1-14B-Lora-12000.ckpt
    โ”œโ”€โ”€ yolox_l.onnx
    โ””โ”€โ”€ README.md

๐ŸŽฎ Play with Official SVI

Inference Scripts

The following scripts will use data in data/demo for inference. You can also use custom data to inference by simply changing the data path.

# SVI-Shot
bash scripts/test/svi_shot.sh 

# SVI-Film
bash scripts/test/svi_film.sh 

# SVI-Talk
bash scripts/test/svi_talk.sh 

# SVI-Dance
bash scripts/test/svi_dance.sh 

# SVI-Tom&Jerry
bash scripts/test/svi_tom.sh 

Gradio Demo

Currently, gradio demo only supports SVI-Shot and SVI-Film.

bash gradio_demo.sh

๐Ÿ”ฅ Train Your Own SVI

We have prepared the toy training data data/toy_train/. You can simply follow the data format to train SVI with your custom data. Please modify --num_nodes if you use more nodes for training. We have tested both 8 and 64 GPUs for training, where larger batch-size gave a better performance.

SVI-Shot

# (Optionally) Use scripts/data_preprocess/process_mixkit.py from CausVid to pre-process data
# start training
bash scripts/train/svi_shot.sh 

SVI-Film

# (Optionally) Use scripts/data_preprocess/process_mixkit.py from CausVid to pre-process data
# start training
bash scripts/train/svi_film.sh 

SVI-Talk

# Preprocess the toy training data
python scripts/data_preprocess/prepare_video_audio.py 

# Start training
bash scripts/train/svi_talk.sh 

SVI-Dance

# Preprocess the toy training data
python scripts/data_preprocess/prepare_video_audio.py 

# Start training
bash scripts/train/svi_dance.sh 

๐Ÿ“ Test Your Trained SVI

Model Post-processing

# Change .pt files to .safetensors files
# zero_to_fp32.py will be automatically generated in your model dir, change $DIR_WITH_SAFETENSORS into your desired DIR
python zero_to_fp32.py . $DIR_WITH_SAFETENSORS --safe_serialization

# (Optionally) Extract and only save LoRA parameters to reduce disk space
python utils/extract_lora.py --checkpoint_dir $DIR_WITH_SAFETENSORS --output_dir $XXX

Inference

Please modify the inference scripts in ./scripts/test/ accordingly by changing the inference samples and your new weight

๐Ÿ—ƒ๏ธ Datasets

You can also use our benchmark datasets made by our Automatic Prompt Stream Engine (see Appendix. A for more details), where you can find images and associated prompt streams according to specific storylines.

Data Use HuggingFace Link Comment
Consistent Video Generation Test ๐Ÿค— Dataset Generate 1 long video using 1 text prompt
Creative Video Generation Test ๐Ÿค— Dataset Generate 1 long video using 1 text prompt stream according to storyline (1 prompt for 5 sec clip)
Creative Video Generation (More prompts) Test ๐Ÿค— Dataset Generate 1 long video using 1 text prompt stream according to storyline (1 prompt for 5 sec clip)

The following is the training data we used for SVI family.

Data Use HuggingFace Link Comment
Customized Datasets Train ๐Ÿค— Dataset You can make your customized datasets using this format
Consistent/Creative Video Generation Train ๐Ÿค— Dataset MixKit Dataset
Consistent/Creative Video Generation Train ๐Ÿค— Dataset UltraVideo Dataset
Human Talking Train ๐Ÿค— Dataset 5k subset from Hallo 3
Human Dancing Train ๐Ÿค— Dataset TikTok
huggingface-cli download --repo-type dataset vita-video-gen/svi-benchmark --local-dir ./data/svi-benchmark

๐Ÿ“‹ TODO List

  • Release everything about SVI

  • Wan 2.2 5B based SVI [Issue #1 #7]

  • Wan 2.2 14B based SVI [Issue #1]

  • Streaming generation model

  • [Call for TODO] Write down your idea in the Issue

๐Ÿ™ Acknowledgement

We greatly appreciate the tremendous effort for the following fantastic projects!

[1] Wan: Open and Advanced Large-Scale Video Generative Models
[2] UniAnimate-DiT: Human Image Animation with Large-Scale Video Diffusion Transformer
[3] Let Them Talk: Audio-Driven Multi-Person Conversational Video Generation

โค๏ธ Citation

If you find our work helpful for your research, please consider citing our paper. Thank you so much!

@article{li2025stable,
  title={Stable Video Infinity: Infinite-Length Video Generation with Error Recycling},
  author={Li, Wuyang and Pan, Wentao and Luan, Po-Chien and Gao, Yang and Alahi, Alexandre},
  journal={arXiv preprint arXiv:2510.09212},
  year={2025}
}

๐Ÿ“Œ Abstract

We propose Stable Video Infinity (SVI) that is able to generate infinite-length videos with high temporal consistency, plausible scene transitions, and controllable streaming storylines. While existing long-video methods attempt to mitigate accumulated errors via handcrafted anti-drifting (e.g., modified noise scheduler, frame anchoring), they remain limited to single-prompt extrapolation, producing homogeneous scenes with repetitive motions. We identify that the fundamental challenge extends beyond error accumulation to a critical discrepancy between the training assumption (seeing clean data) and the test-time autoregressive reality (conditioning on self-generated, error-prone outputs). To bridge this hypothesis gap, SVI incorporates Error-Recycling Fine-Tuning, a new type of efficient training that recycles the Diffusion Transformer (DiT)'s self-generated errors into supervisory prompts, thereby encouraging DiT to actively identify and correct its own errors. This is achieved by injecting, collecting, and banking errors through closed-loop recycling, autoregressively learning from error-injected feedback. Specifically, we (i) inject historical errors made by DiT to intervene on clean inputs, simulating error-accumulated trajectories in flow matching; (ii) efficiently approximate predictions with one-step bidirectional integration and calculate errors with residuals; (iii) dynamically bank errors into replay memory across discretized timesteps, which are resampled for new input. SVI is able to scale videos from seconds to infinite durations with no additional inference cost, while remaining compatible with diverse conditions (e.g., audio, skeleton, and text streams). We evaluate SVI on three benchmarks, including consistent, creative, and conditional settings, thoroughly verifying its versatility and state-of-the-art role.

SVI intro