Description
Describe the bug
When using the diffusers library to run inference with WAN 2.1 14B I2V, the generated results do not align with those produced by the official inference code or DiffSynth-Studio. This discrepancy occurs even when using the same GPU, seed, precision, and prompt.
diffusers result:
https://github.com/user-attachments/assets/629b396a-df34-499d-9216-afbae0278c9b
DiffSynth-Studio result:
https://github.com/user-attachments/assets/ecad66c9-2b62-4099-8728-85816255fb31
Reproduction
import torch
import numpy as np
from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
from transformers import CLIPVisionModel
from utils import seed_everything
from PIL import Image
import torch
Available models: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
model_id = "/data/yuxiong/pretrained-weights/Wan2.1-I2V-14B-720P-Diffusers"
image_encoder = CLIPVisionModel.from_pretrained(
model_id, subfolder="image_encoder", torch_dtype=torch.float32
)
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanImageToVideoPipeline.from_pretrained(
model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
)
pipe.to("cuda:1")
image = load_image(
"/data/yuxiong/DiffSynth-Studio/tmp/set_2/正面全身/sS2103030079966395_00.jpg"
)
max_area = 720 * 1280
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height), Image.LANCZOS)
prompt = "The video begins with a static full-body shot of a model wearing an elegant outfit. The camera gently zooms in, focusing on key elements like fabric texture, stitching, and design details. A soft breeze causes the fabric to move slightly, giving it a natural and dynamic feel. The lighting subtly enhances different angles, making the material appear more luxurious. The video concludes with a slow transition back to the full-frame view."
negative_prompt = "脸部畸变,裸露,色调艳丽,过曝,静态,细节模糊不清,字幕,风格,作品,画作,画面,静止,整体发灰,最差质量,低质量,JPEG压缩残留,丑陋的,残缺的,多余的手指,画得不好的手部,画得不好的脸部,畸形的,毁容的,形态畸形的肢体,手指融合,静止不动的画面,杂乱的背景,三条腿,背景人很多,倒着走"
seed = 40561179
seed_everything(seed)
output = pipe(
image=image,
prompt=prompt,
negative_prompt=negative_prompt,
num_inference_steps=50, # default: 50
height=height,
width=width,
num_frames=81,
guidance_scale=5.0, # default: 5.0
generator=torch.Generator("cpu").manual_seed(seed),
).frames[0]
export_to_video(output, "output/output_50.mp4", fps=16)
Logs
System Info
latest diffusers bulid from source
Linux
python 3.11
Who can help?
No response