Skip to content

Commit a9e4883

Browse files
authored
Update Wan Animate Docs (#12658)
* Update the Wan Animate docs to reflect the most recent code * Further explain input preprocessing and link to original Wan Animate preprocessing scripts
1 parent 63dd601 commit a9e4883

File tree

2 files changed

+19
-31
lines changed

2 files changed

+19
-31
lines changed

docs/source/en/api/models/wan_animate_transformer_3d.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,7 @@ The model can be loaded with the following code snippet.
1818
```python
1919
from diffusers import WanAnimateTransformer3DModel
2020

21-
transformer = WanAnimateTransformer3DModel.from_pretrained("Wan-AI/Wan2.2-Animate-14B-720P-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
21+
transformer = WanAnimateTransformer3DModel.from_pretrained("Wan-AI/Wan2.2-Animate-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
2222
```
2323

2424
## WanAnimateTransformer3DModel

docs/source/en/api/pipelines/wan.md

Lines changed: 18 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -281,7 +281,7 @@ For replacement mode, you additionally need:
281281
- **Mask video**: A mask indicating where to generate content (white) vs. preserve original (black)
282282

283283
> [!NOTE]
284-
> The preprocessing tools are available in the original Wan-Animate repository. Integration of these preprocessing steps into Diffusers is planned for a future release.
284+
> Raw videos should not be used for inputs such as `pose_video`, which the pipeline expects to be preprocessed to extract the proper information. Preprocessing scripts to prepare these inputs are available in the [original Wan-Animate repository](https://github.com/Wan-Video/Wan2.2?tab=readme-ov-file#1-preprocessing). Integration of these preprocessing steps into Diffusers is planned for a future release.
285285
286286
The example below demonstrates how to use the Wan-Animate pipeline:
287287

@@ -293,13 +293,10 @@ import numpy as np
293293
import torch
294294
from diffusers import AutoencoderKLWan, WanAnimatePipeline
295295
from diffusers.utils import export_to_video, load_image, load_video
296-
from transformers import CLIPVisionModel
297296

298297
model_id = "Wan-AI/Wan2.2-Animate-14B-Diffusers"
299298
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
300-
pipe = WanAnimatePipeline.from_pretrained(
301-
model_id, vae=vae, torch_dtype=torch.bfloat16
302-
)
299+
pipe = WanAnimatePipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
303300
pipe.to("cuda")
304301

305302
# Load character image and preprocessed videos
@@ -330,11 +327,11 @@ output = pipe(
330327
negative_prompt=negative_prompt,
331328
height=height,
332329
width=width,
333-
num_frames=81,
334-
guidance_scale=5.0,
335-
mode="animation", # Animation mode (default)
330+
segment_frame_length=77,
331+
guidance_scale=1.0,
332+
mode="animate", # Animation mode (default)
336333
).frames[0]
337-
export_to_video(output, "animated_character.mp4", fps=16)
334+
export_to_video(output, "animated_character.mp4", fps=30)
338335
```
339336

340337
</hfoption>
@@ -345,14 +342,10 @@ import numpy as np
345342
import torch
346343
from diffusers import AutoencoderKLWan, WanAnimatePipeline
347344
from diffusers.utils import export_to_video, load_image, load_video
348-
from transformers import CLIPVisionModel
349345

350346
model_id = "Wan-AI/Wan2.2-Animate-14B-Diffusers"
351-
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float16)
352347
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
353-
pipe = WanAnimatePipeline.from_pretrained(
354-
model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
355-
)
348+
pipe = WanAnimatePipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
356349
pipe.to("cuda")
357350

358351
# Load all required inputs for replacement mode
@@ -387,11 +380,11 @@ output = pipe(
387380
negative_prompt=negative_prompt,
388381
height=height,
389382
width=width,
390-
num_frames=81,
391-
guidance_scale=5.0,
392-
mode="replacement", # Replacement mode
383+
segment_frame_lengths=77,
384+
guidance_scale=1.0,
385+
mode="replace", # Replacement mode
393386
).frames[0]
394-
export_to_video(output, "character_replaced.mp4", fps=16)
387+
export_to_video(output, "character_replaced.mp4", fps=30)
395388
```
396389

397390
</hfoption>
@@ -402,14 +395,10 @@ import numpy as np
402395
import torch
403396
from diffusers import AutoencoderKLWan, WanAnimatePipeline
404397
from diffusers.utils import export_to_video, load_image, load_video
405-
from transformers import CLIPVisionModel
406398

407399
model_id = "Wan-AI/Wan2.2-Animate-14B-Diffusers"
408-
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float16)
409400
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
410-
pipe = WanAnimatePipeline.from_pretrained(
411-
model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
412-
)
401+
pipe = WanAnimatePipeline.from_pretrained(model_id, vae=vae, torch_dtype=torch.bfloat16)
413402
pipe.to("cuda")
414403

415404
image = load_image("path/to/character.jpg")
@@ -443,25 +432,24 @@ output = pipe(
443432
negative_prompt=negative_prompt,
444433
height=height,
445434
width=width,
446-
num_frames=81,
435+
segment_frame_length=77,
447436
num_inference_steps=50,
448437
guidance_scale=5.0,
449-
num_frames_for_temporal_guidance=5, # Use 5 frames for temporal guidance (1 or 5 recommended)
438+
prev_segment_conditioning_frames=5, # Use 5 frames for temporal guidance (1 or 5 recommended)
450439
callback_on_step_end=callback_fn,
451440
callback_on_step_end_tensor_inputs=["latents"],
452441
).frames[0]
453-
export_to_video(output, "animated_advanced.mp4", fps=16)
442+
export_to_video(output, "animated_advanced.mp4", fps=30)
454443
```
455444

456445
</hfoption>
457446
</hfoptions>
458447

459448
#### Key Parameters
460449

461-
- **mode**: Choose between `"animation"` (default) or `"replacement"`
462-
- **num_frames_for_temporal_guidance**: Number of frames for temporal guidance (1 or 5 recommended). Using 5 provides better temporal consistency but requires more memory
463-
- **guidance_scale**: Controls how closely the output follows the text prompt. Higher values (5-7) produce results more aligned with the prompt
464-
- **num_frames**: Total number of frames to generate. Should be divisible by `vae_scale_factor_temporal` (default: 4)
450+
- **mode**: Choose between `"animate"` (default) or `"replace"`
451+
- **prev_segment_conditioning_frames**: Number of frames for temporal guidance (1 or 5 recommended). Using 5 provides better temporal consistency but requires more memory
452+
- **guidance_scale**: Controls how closely the output follows the text prompt. Higher values (5-7) produce results more aligned with the prompt. For Wan-Animate, CFG is disabled by default (`guidance_scale=1.0`) but can be enabled to support negative prompts and finer control over facial expressions. (Note that CFG will only target the text prompt and face conditioning.)
465453

466454

467455
## Notes

0 commit comments

Comments
 (0)