Skip to content

Commit 1a8dd43

Browse files
authored
Merge branch 'main' into wan-pipeline
2 parents 0ee299a + d8e4805 commit 1a8dd43

39 files changed

+3881
-120
lines changed

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -387,6 +387,8 @@
387387
title: Transformer2DModel
388388
- local: api/models/transformer_temporal
389389
title: TransformerTemporalModel
390+
- local: api/models/wan_animate_transformer_3d
391+
title: WanAnimateTransformer3DModel
390392
- local: api/models/wan_transformer_3d
391393
title: WanTransformer3DModel
392394
title: Transformers
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# WanAnimateTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D video-like data was introduced in [Wan Animate](https://github.com/Wan-Video/Wan2.2) by the Alibaba Wan Team.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import WanAnimateTransformer3DModel
20+
21+
transformer = WanAnimateTransformer3DModel.from_pretrained("Wan-AI/Wan2.2-Animate-14B-720P-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
22+
```
23+
24+
## WanAnimateTransformer3DModel
25+
26+
[[autodoc]] WanAnimateTransformer3DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput

docs/source/en/api/pipelines/wan.md

Lines changed: 238 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,7 @@ The following Wan models are supported in Diffusers:
4040
- [Wan 2.2 T2V 14B](https://huggingface.co/Wan-AI/Wan2.2-T2V-A14B-Diffusers)
4141
- [Wan 2.2 I2V 14B](https://huggingface.co/Wan-AI/Wan2.2-I2V-A14B-Diffusers)
4242
- [Wan 2.2 TI2V 5B](https://huggingface.co/Wan-AI/Wan2.2-TI2V-5B-Diffusers)
43+
- [Wan 2.2 Animate 14B](https://huggingface.co/Wan-AI/Wan2.2-Animate-14B-Diffusers)
4344

4445
> [!TIP]
4546
> Click on the Wan models in the right sidebar for more examples of video generation.
@@ -95,15 +96,15 @@ pipeline = WanPipeline.from_pretrained(
9596
pipeline.to("cuda")
9697

9798
prompt = """
98-
The camera rushes from far to near in a low-angle shot,
99-
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
100-
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
101-
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
99+
The camera rushes from far to near in a low-angle shot,
100+
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
101+
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
102+
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
102103
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
103104
"""
104105
negative_prompt = """
105-
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality,
106-
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured,
106+
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality,
107+
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured,
107108
misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
108109
"""
109110

@@ -150,15 +151,15 @@ pipeline.transformer = torch.compile(
150151
)
151152

152153
prompt = """
153-
The camera rushes from far to near in a low-angle shot,
154-
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
155-
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
156-
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
154+
The camera rushes from far to near in a low-angle shot,
155+
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
156+
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
157+
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
157158
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
158159
"""
159160
negative_prompt = """
160-
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality,
161-
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured,
161+
Bright tones, overexposed, static, blurred details, subtitles, style, works, paintings, images, static, overall gray, worst quality,
162+
low quality, JPEG compression residue, ugly, incomplete, extra fingers, poorly drawn hands, poorly drawn faces, deformed, disfigured,
162163
misshapen limbs, fused fingers, still picture, messy background, three legs, many people in the background, walking backwards
163164
"""
164165

@@ -249,6 +250,220 @@ The code snippets available in [this](https://github.com/huggingface/diffusers/p
249250

250251
The general rule of thumb to keep in mind when preparing inputs for the VACE pipeline is that the input images, or frames of a video that you want to use for conditioning, should have a corresponding mask that is black in color. The black mask signifies that the model will not generate new content for that area, and only use those parts for conditioning the generation process. For parts/frames that should be generated by the model, the mask should be white in color.
251252

253+
</hfoption>
254+
</hfoptions>
255+
256+
### Wan-Animate: Unified Character Animation and Replacement with Holistic Replication
257+
258+
[Wan-Animate](https://huggingface.co/papers/2509.14055) by the Wan Team.
259+
260+
*We introduce Wan-Animate, a unified framework for character animation and replacement. Given a character image and a reference video, Wan-Animate can animate the character by precisely replicating the expressions and movements of the character in the video to generate high-fidelity character videos. Alternatively, it can integrate the animated character into the reference video to replace the original character, replicating the scene's lighting and color tone to achieve seamless environmental integration. Wan-Animate is built upon the Wan model. To adapt it for character animation tasks, we employ a modified input paradigm to differentiate between reference conditions and regions for generation. This design unifies multiple tasks into a common symbolic representation. We use spatially-aligned skeleton signals to replicate body motion and implicit facial features extracted from source images to reenact expressions, enabling the generation of character videos with high controllability and expressiveness. Furthermore, to enhance environmental integration during character replacement, we develop an auxiliary Relighting LoRA. This module preserves the character's appearance consistency while applying the appropriate environmental lighting and color tone. Experimental results demonstrate that Wan-Animate achieves state-of-the-art performance. We are committed to open-sourcing the model weights and its source code.*
261+
262+
The project page: https://humanaigc.github.io/wan-animate
263+
264+
This model was mostly contributed by [M. Tolga Cangöz](https://github.com/tolgacangoz).
265+
266+
#### Usage
267+
268+
The Wan-Animate pipeline supports two modes of operation:
269+
270+
1. **Animation Mode** (default): Animates a character image based on motion and expression from reference videos
271+
2. **Replacement Mode**: Replaces a character in a background video with a new character while preserving the scene
272+
273+
##### Prerequisites
274+
275+
Before using the pipeline, you need to preprocess your reference video to extract:
276+
- **Pose video**: Contains skeletal keypoints representing body motion
277+
- **Face video**: Contains facial feature representations for expression control
278+
279+
For replacement mode, you additionally need:
280+
- **Background video**: The original video containing the scene
281+
- **Mask video**: A mask indicating where to generate content (white) vs. preserve original (black)
282+
283+
> [!NOTE]
284+
> The preprocessing tools are available in the original Wan-Animate repository. Integration of these preprocessing steps into Diffusers is planned for a future release.
285+
286+
The example below demonstrates how to use the Wan-Animate pipeline:
287+
288+
<hfoptions id="Animate usage">
289+
<hfoption id="Animation mode">
290+
291+
```python
292+
import numpy as np
293+
import torch
294+
from diffusers import AutoencoderKLWan, WanAnimatePipeline
295+
from diffusers.utils import export_to_video, load_image, load_video
296+
from transformers import CLIPVisionModel
297+
298+
model_id = "Wan-AI/Wan2.2-Animate-14B-Diffusers"
299+
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
300+
pipe = WanAnimatePipeline.from_pretrained(
301+
model_id, vae=vae, torch_dtype=torch.bfloat16
302+
)
303+
pipe.to("cuda")
304+
305+
# Load character image and preprocessed videos
306+
image = load_image("path/to/character.jpg")
307+
pose_video = load_video("path/to/pose_video.mp4") # Preprocessed skeletal keypoints
308+
face_video = load_video("path/to/face_video.mp4") # Preprocessed facial features
309+
310+
# Resize image to match VAE constraints
311+
def aspect_ratio_resize(image, pipe, max_area=720 * 1280):
312+
aspect_ratio = image.height / image.width
313+
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
314+
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
315+
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
316+
image = image.resize((width, height))
317+
return image, height, width
318+
319+
image, height, width = aspect_ratio_resize(image, pipe)
320+
321+
prompt = "A person dancing energetically in a studio with dynamic lighting and professional camera work"
322+
negative_prompt = "blurry, low quality, distorted, deformed, static, poorly drawn"
323+
324+
# Generate animated video
325+
output = pipe(
326+
image=image,
327+
pose_video=pose_video,
328+
face_video=face_video,
329+
prompt=prompt,
330+
negative_prompt=negative_prompt,
331+
height=height,
332+
width=width,
333+
num_frames=81,
334+
guidance_scale=5.0,
335+
mode="animation", # Animation mode (default)
336+
).frames[0]
337+
export_to_video(output, "animated_character.mp4", fps=16)
338+
```
339+
340+
</hfoption>
341+
<hfoption id="Replacement mode">
342+
343+
```python
344+
import numpy as np
345+
import torch
346+
from diffusers import AutoencoderKLWan, WanAnimatePipeline
347+
from diffusers.utils import export_to_video, load_image, load_video
348+
from transformers import CLIPVisionModel
349+
350+
model_id = "Wan-AI/Wan2.2-Animate-14B-Diffusers"
351+
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float16)
352+
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
353+
pipe = WanAnimatePipeline.from_pretrained(
354+
model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
355+
)
356+
pipe.to("cuda")
357+
358+
# Load all required inputs for replacement mode
359+
image = load_image("path/to/new_character.jpg")
360+
pose_video = load_video("path/to/pose_video.mp4") # Preprocessed skeletal keypoints
361+
face_video = load_video("path/to/face_video.mp4") # Preprocessed facial features
362+
background_video = load_video("path/to/background_video.mp4") # Original scene
363+
mask_video = load_video("path/to/mask_video.mp4") # Black: preserve, White: generate
364+
365+
# Resize image to match video dimensions
366+
def aspect_ratio_resize(image, pipe, max_area=720 * 1280):
367+
aspect_ratio = image.height / image.width
368+
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
369+
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
370+
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
371+
image = image.resize((width, height))
372+
return image, height, width
373+
374+
image, height, width = aspect_ratio_resize(image, pipe)
375+
376+
prompt = "A person seamlessly integrated into the scene with consistent lighting and environment"
377+
negative_prompt = "blurry, low quality, inconsistent lighting, floating, disconnected from scene"
378+
379+
# Replace character in background video
380+
output = pipe(
381+
image=image,
382+
pose_video=pose_video,
383+
face_video=face_video,
384+
background_video=background_video,
385+
mask_video=mask_video,
386+
prompt=prompt,
387+
negative_prompt=negative_prompt,
388+
height=height,
389+
width=width,
390+
num_frames=81,
391+
guidance_scale=5.0,
392+
mode="replacement", # Replacement mode
393+
).frames[0]
394+
export_to_video(output, "character_replaced.mp4", fps=16)
395+
```
396+
397+
</hfoption>
398+
<hfoption id="Advanced options">
399+
400+
```python
401+
import numpy as np
402+
import torch
403+
from diffusers import AutoencoderKLWan, WanAnimatePipeline
404+
from diffusers.utils import export_to_video, load_image, load_video
405+
from transformers import CLIPVisionModel
406+
407+
model_id = "Wan-AI/Wan2.2-Animate-14B-Diffusers"
408+
image_encoder = CLIPVisionModel.from_pretrained(model_id, subfolder="image_encoder", torch_dtype=torch.float16)
409+
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
410+
pipe = WanAnimatePipeline.from_pretrained(
411+
model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
412+
)
413+
pipe.to("cuda")
414+
415+
image = load_image("path/to/character.jpg")
416+
pose_video = load_video("path/to/pose_video.mp4")
417+
face_video = load_video("path/to/face_video.mp4")
418+
419+
def aspect_ratio_resize(image, pipe, max_area=720 * 1280):
420+
aspect_ratio = image.height / image.width
421+
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
422+
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
423+
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
424+
image = image.resize((width, height))
425+
return image, height, width
426+
427+
image, height, width = aspect_ratio_resize(image, pipe)
428+
429+
prompt = "A person dancing energetically in a studio"
430+
negative_prompt = "blurry, low quality"
431+
432+
# Advanced: Use temporal guidance and custom callback
433+
def callback_fn(pipe, step_index, timestep, callback_kwargs):
434+
# You can modify latents or other tensors here
435+
print(f"Step {step_index}, Timestep {timestep}")
436+
return callback_kwargs
437+
438+
output = pipe(
439+
image=image,
440+
pose_video=pose_video,
441+
face_video=face_video,
442+
prompt=prompt,
443+
negative_prompt=negative_prompt,
444+
height=height,
445+
width=width,
446+
num_frames=81,
447+
num_inference_steps=50,
448+
guidance_scale=5.0,
449+
num_frames_for_temporal_guidance=5, # Use 5 frames for temporal guidance (1 or 5 recommended)
450+
callback_on_step_end=callback_fn,
451+
callback_on_step_end_tensor_inputs=["latents"],
452+
).frames[0]
453+
export_to_video(output, "animated_advanced.mp4", fps=16)
454+
```
455+
456+
</hfoption>
457+
</hfoptions>
458+
459+
#### Key Parameters
460+
461+
- **mode**: Choose between `"animation"` (default) or `"replacement"`
462+
- **num_frames_for_temporal_guidance**: Number of frames for temporal guidance (1 or 5 recommended). Using 5 provides better temporal consistency but requires more memory
463+
- **guidance_scale**: Controls how closely the output follows the text prompt. Higher values (5-7) produce results more aligned with the prompt
464+
- **num_frames**: Total number of frames to generate. Should be divisible by `vae_scale_factor_temporal` (default: 4)
465+
466+
252467
## Notes
253468

254469
- Wan2.1 supports LoRAs with [`~loaders.WanLoraLoaderMixin.load_lora_weights`].
@@ -281,10 +496,10 @@ The general rule of thumb to keep in mind when preparing inputs for the VACE pip
281496

282497
# use "steamboat willie style" to trigger the LoRA
283498
prompt = """
284-
steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot,
285-
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
286-
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
287-
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
499+
steamboat willie style, golden era animation, The camera rushes from far to near in a low-angle shot,
500+
revealing a white ferret on a log. It plays, leaps into the water, and emerges, as the camera zooms in
501+
for a close-up. Water splashes berry bushes nearby, while moss, snow, and leaves blanket the ground.
502+
Birch trees and a light blue sky frame the scene, with ferns in the foreground. Side lighting casts dynamic
288503
shadows and warm highlights. Medium composition, front view, low angle, with depth of field.
289504
"""
290505

@@ -359,6 +574,12 @@ The general rule of thumb to keep in mind when preparing inputs for the VACE pip
359574
- all
360575
- __call__
361576

577+
## WanAnimatePipeline
578+
579+
[[autodoc]] WanAnimatePipeline
580+
- all
581+
- __call__
582+
362583
## WanPipelineOutput
363584

364-
[[autodoc]] pipelines.wan.pipeline_output.WanPipelineOutput
585+
[[autodoc]] pipelines.wan.pipeline_output.WanPipelineOutput

examples/community/README.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -5488,7 +5488,7 @@ Editing at Scale", many thanks to their contribution!
54885488

54895489
This implementation of Flux Kontext allows users to pass multiple reference images. Each image is encoded separately, and the resulting latent vectors are concatenated.
54905490

5491-
As explained in Section 3 of [the paper](https://arxiv.org/pdf/2506.15742), the model's sequence concatenation mechanism can extend its capabilities to handle multiple reference images. However, note that the current version of Flux Kontext was not trained for this use case. In practice, stacking along the first axis does not yield correct results, while stacking along the other two axes appears to work.
5491+
As explained in Section 3 of [the paper](https://huggingface.co/papers/2506.15742), the model's sequence concatenation mechanism can extend its capabilities to handle multiple reference images. However, note that the current version of Flux Kontext was not trained for this use case. In practice, stacking along the first axis does not yield correct results, while stacking along the other two axes appears to work.
54925492

54935493
## Example Usage
54945494

0 commit comments

Comments
 (0)