Inconsistent Inference Results with Diffusers’ Implementation of WAN 2.1 14B I2V

### Describe the bug

When using the diffusers library to run inference with WAN 2.1 14B I2V, the generated results do not align with those produced by the official inference code or DiffSynth-Studio. This discrepancy occurs even when using the same GPU, seed, precision, and prompt.

diffusers result：
https://github.com/user-attachments/assets/629b396a-df34-499d-9216-afbae0278c9b

DiffSynth-Studio result：
https://github.com/user-attachments/assets/ecad66c9-2b62-4099-8728-85816255fb31

### Reproduction

import torch
import numpy as np
from diffusers import AutoencoderKLWan, WanImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
from transformers import CLIPVisionModel
from utils import seed_everything
from PIL import Image
import torch

# Available models: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers
model_id = "/data/yuxiong/pretrained-weights/Wan2.1-I2V-14B-720P-Diffusers"
image_encoder = CLIPVisionModel.from_pretrained(
            model_id, subfolder="image_encoder", torch_dtype=torch.float32
        )
vae = AutoencoderKLWan.from_pretrained(model_id, subfolder="vae", torch_dtype=torch.float32)
pipe = WanImageToVideoPipeline.from_pretrained(
            model_id, vae=vae, image_encoder=image_encoder, torch_dtype=torch.bfloat16
        )
pipe.to("cuda:1")

image = load_image(
    "/data/yuxiong/DiffSynth-Studio/tmp/set_2/正面全身/sS2103030079966395_00.jpg"
)
max_area = 720 * 1280
aspect_ratio = image.height / image.width
mod_value = pipe.vae_scale_factor_spatial * pipe.transformer.config.patch_size[1]
height = round(np.sqrt(max_area * aspect_ratio)) // mod_value * mod_value
width = round(np.sqrt(max_area / aspect_ratio)) // mod_value * mod_value
image = image.resize((width, height), Image.LANCZOS)

prompt = "The video begins with a static full-body shot of a model wearing an elegant outfit. The camera gently zooms in, focusing on key elements like fabric texture, stitching, and design details. A soft breeze causes the fabric to move slightly, giving it a natural and dynamic feel. The lighting subtly enhances different angles, making the material appear more luxurious. The video concludes with a slow transition back to the full-frame view."

negative_prompt = "脸部畸变，裸露，色调艳丽，过曝，静态，细节模糊不清，字幕，风格，作品，画作，画面，静止，整体发灰，最差质量，低质量，JPEG压缩残留，丑陋的，残缺的，多余的手指，画得不好的手部，画得不好的脸部，畸形的，毁容的，形态畸形的肢体，手指融合，静止不动的画面，杂乱的背景，三条腿，背景人很多，倒着走"

seed = 40561179
seed_everything(seed)

output = pipe(
    image=image,
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=50, # default: 50
    height=height,
    width=width,
    num_frames=81,
    guidance_scale=5.0, # default: 5.0
    generator=torch.Generator("cpu").manual_seed(seed),
).frames[0]
export_to_video(output, "output/output_50.mp4", fps=16)

### Logs

```shell

```

### System Info

latest diffusers bulid from source
Linux
python 3.11

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Inconsistent Inference Results with Diffusers’ Implementation of WAN 2.1 14B I2V #11160

Describe the bug

Reproduction

Available models: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Inconsistent Inference Results with Diffusers’ Implementation of WAN 2.1 14B I2V #11160

Description

Describe the bug

Reproduction

Available models: Wan-AI/Wan2.1-I2V-14B-480P-Diffusers, Wan-AI/Wan2.1-I2V-14B-720P-Diffusers

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions