-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistent Inference Results with Diffusers’ Implementation of WAN 2.1 14B I2V #11160
Comments
Same: I'm thinking the diffusers version is not the same thing as the official version. Something isn't quite right bout it. |
Hi! I could reproduce the results and we're investigating what could be the cause, in the meantime can you check if the same happens with the 480p model? |
It seems like it could be an issue with the default scheduler. @bghira opened the following PRs to fix the mapping in
We've notified the Wan team to take a look. The scheduler configs by themselves are correct (they are that of UniPCMultistepScheduler, which is intended), but the mapping in |
it's resolved now. |
nice! I tested it and at least for me the video now looks normal, I used the 480p with |
During initial exploration into adding Wan 2.1 into SimpleTuner, horrible nightmare fuel output was coming from Wan 2.1, and this boils down to a couple reasons:
|
thanks a lot! @bghira , with that I have something to work with and probably we will have to update the docs. |
Describe the bug
When using the diffusers library to run inference with WAN 2.1 14B I2V, the generated results do not align with those produced by the official inference code or DiffSynth-Studio. This discrepancy occurs even when using the same GPU, seed, precision, and prompt.
diffusers result:
https://github.com/user-attachments/assets/629b396a-df34-499d-9216-afbae0278c9b
DiffSynth-Studio result:
https://github.com/user-attachments/assets/ecad66c9-2b62-4099-8728-85816255fb31
Reproduction
import torch
import numpy as np
from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
from transformers import CLIPVisionModel
from utils import seed_everything
from PIL import Image
import torch
Available models: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
model_id = "/data/yuxiong/pretrained-weights/Wan2.1-I2V-14B-720P-Diffusers"
image_encoder = CLIPVisionModel.from_pretrained(
model_id, subfolder="image_encoder", torch_dtype=torch.float32
)
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanImageToVideoPipeline.from_pretrained(
model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
)
pipe.to("cuda:1")
image = load_image(
"/data/yuxiong/DiffSynth-Studio/tmp/set_2/正面全身/sS2103030079966395_00.jpg"
)
max_area = 720 * 1280
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height), Image.LANCZOS)
prompt = "The video begins with a static full-body shot of a model wearing an elegant outfit. The camera gently zooms in, focusing on key elements like fabric texture, stitching, and design details. A soft breeze causes the fabric to move slightly, giving it a natural and dynamic feel. The lighting subtly enhances different angles, making the material appear more luxurious. The video concludes with a slow transition back to the full-frame view."
negative_prompt = "脸部畸变,裸露,色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
seed = 40561179
seed_everything(seed)
output = pipe(
image=image,
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=50, # default: 50
height=height,
width=width,
num_frames=81,
guidance_scale=5.0, # default: 5.0
generator=torch.Generator("cpu").manual_seed(seed),
).frames[0]
export_to_video(output, "output/output_50.mp4", fps=16)
Logs
System Info
latest diffusers bulid from source
Linux
python 3.11
Who can help?
No response
The text was updated successfully, but these errors were encountered: