Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistent Inference Results with Diffusers’ Implementation of WAN 2.1 14B I2V #11160

Closed
matabear-wyx opened this issue Mar 27, 2025 · 7 comments
Labels
bug Something isn't working

Comments

@matabear-wyx
Copy link

matabear-wyx commented Mar 27, 2025

Describe the bug

When using the diffusers library to run inference with WAN 2.1 14B I2V, the generated results do not align with those produced by the official inference code or DiffSynth-Studio. This discrepancy occurs even when using the same GPU, seed, precision, and prompt.

diffusers result:
https://github.com/user-attachments/assets/629b396a-df34-499d-9216-afbae0278c9b

DiffSynth-Studio result:
https://github.com/user-attachments/assets/ecad66c9-2b62-4099-8728-85816255fb31

Reproduction

import torch
import numpy as np
from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
from transformers import CLIPVisionModel
from utils import seed_everything
from PIL import Image
import torch

Available models: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers

model_id = "/data/yuxiong/pretrained-weights/Wan2.1-I2V-14B-720P-Diffusers"
image_encoder = CLIPVisionModel.from_pretrained(
model_id, subfolder="image_encoder", torch_dtype=torch.float32
)
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanImageToVideoPipeline.from_pretrained(
model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
)
pipe.to("cuda:1")

image = load_image(
"/data/yuxiong/DiffSynth-Studio/tmp/set_2/正面全身/sS2103030079966395_00.jpg"
)
max_area = 720 * 1280
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height), Image.LANCZOS)

prompt = "The video begins with a static full-body shot of a model wearing an elegant outfit. The camera gently zooms in, focusing on key elements like fabric texture, stitching, and design details. A soft breeze causes the fabric to move slightly, giving it a natural and dynamic feel. The lighting subtly enhances different angles, making the material appear more luxurious. The video concludes with a slow transition back to the full-frame view."

negative_prompt = "脸部畸变,裸露,色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"

seed = 40561179
seed_everything(seed)

output = pipe(
image=image,
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=50, # default: 50
height=height,
width=width,
num_frames=81,
guidance_scale=5.0, # default: 5.0
generator=torch.Generator("cpu").manual_seed(seed),
).frames[0]
export_to_video(output, "output/output_50.mp4", fps=16)

Logs

System Info

latest diffusers bulid from source
Linux
python 3.11

Who can help?

No response

@matabear-wyx matabear-wyx added the bug Something isn't working label Mar 27, 2025
@pizzapy3
Copy link

pizzapy3 commented Apr 3, 2025

Same:
https://github.com/user-attachments/assets/b4c81d50-3aee-4d82-847a-2e623675c37c

I'm thinking the diffusers version is not the same thing as the official version. Something isn't quite right bout it.

@asomoza
Copy link
Member

asomoza commented Apr 3, 2025

Hi! I could reproduce the results and we're investigating what could be the cause, in the meantime can you check if the same happens with the 480p model?

@a-r-r-o-w
Copy link
Member

It seems like it could be an issue with the default scheduler. @bghira opened the following PRs to fix the mapping in model_index.json:

We've notified the Wan team to take a look. The scheduler configs by themselves are correct (they are that of UniPCMultistepScheduler, which is intended), but the mapping in model_index.json creates FlowMatchEulerDiscreteScheduler's, which seem to generate worse results.

@bghira
Copy link
Contributor

bghira commented Apr 4, 2025

it's resolved now.

@asomoza
Copy link
Member

asomoza commented Apr 4, 2025

nice! I tested it and at least for me the video now looks normal, I used the 480p with FlowMatchEulerDiscreteScheduler all the time (manually setup) before and never happened to me so I'm kind of intrigued what is the trigger to make this bad results.

@bghira
Copy link
Contributor

bghira commented Apr 4, 2025

During initial exploration into adding Wan 2.1 into SimpleTuner, horrible nightmare fuel output was coming from Wan 2.1, and this boils down to a couple reasons:

Not enough steps for inference
    Unless you're using UniPC, you probably need at least 40 steps. UniPC can bring the number down a little, but you'll have to experiment.
Incorrect scheduler configuration
    It was using normal Euler flow matching schedule, but the Betas distribution seems to work best
Incorrect resolution
    Wan 2.1 only really works correctly on the resolutions it was trained on, you get lucky if it works, but it's common for it to be bad results
Bad CFG value
    Wan 2.1 1.3B in particular seems sensitive to CFG values, but a value around 4.0-5.0 seem safe
Bad prompting
    Of course, video models seem to require a team of mystics to spend months in the mountains on a zen retreat to learn the sacred art of prompting, because their datasets and caption style are guarded like the Holy Grail.

@asomoza
Copy link
Member

asomoza commented Apr 4, 2025

thanks a lot! @bghira , with that I have something to work with and probably we will have to update the docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants