Why Qwen-Image inference speed may slower than comfyui

### Describe the bug

I have tested diffusers and Comfyui with the same parameters and check the input shapes
The parameters is

<img width="788" height="462" alt="Image" src="https://github.com/user-attachments/assets/3cb6630e-2e2a-4bc4-a74f-34a2b8bff1c5" />

<img width="1352" height="716" alt="Image" src="https://github.com/user-attachments/assets/7ea93b46-3396-490d-af19-af608769d8c0" />

<img width="1458" height="420" alt="Image" src="https://github.com/user-attachments/assets/2434f999-e077-490a-bd93-440221dadd3b" />

I check the shapes of text embeds and vae, they are all the same. The attention is use the same pytorch attention. And I use the same bfloat16 version model.
But the speed is 2.39it/s in diffusers vs 2.7it /s in comfyui

I do a log of effort but can not find the place where influence the speed.

My diffusers version is 0.36.0.dev0


### Reproduction

import os
os.environ['CUDA_VISIBLE_DEVICES'] = '0'
import torch
from PIL import Image
from diffusers import QwenImagePipeline
pipeline = QwenImagePipeline.from_pretrained("Qwen/Qwen-Image", torch_dtype=torch.bfloat16, device_map='cuda')
prompt = """女孩"""
inputs = {
    "prompt": prompt,
    # "negative_prompt": " ",
    # "generator": torch.manual_seed(42),
    "generator": torch.Generator(device='cuda').manual_seed(1125488487853216),
    "width": 1216,
    'height': 832,
    "true_cfg_scale": 1,
    "num_inference_steps": 20,
    "guidance_scale": 1.0,
    "num_images_per_prompt": 1,
}
with torch.inference_mode():
    output = pipeline(**inputs)
    output_image = output.images[0]
output_image.save('output.png')

### Logs

```shell

```

### System Info

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

- 🤗 Diffusers version: 0.36.0.dev0
- Platform: Linux-5.15.0-160-generic-x86_64-with-glibc2.35
- Running on Google Colab?: No
- Python version: 3.12.3
- PyTorch version (GPU?): 2.8.0+cu128 (True)
- Flax version (CPU?/GPU?/TPU?): not installed (NA)
- Jax version: not installed
- JaxLib version: not installed
- Huggingface_hub version: 0.34.0
- Transformers version: 4.57.1
- Accelerate version: 1.11.0
- PEFT version: 0.17.1
- Bitsandbytes version: not installed
- Safetensors version: 0.6.2
- xFormers version: not installed
- Accelerator: NVIDIA A800-SXM4-80GB, 81920 MiB
NVIDIA A800-SXM4-80GB, 81920 MiB
NVIDIA A800-SXM4-80GB, 81920 MiB
NVIDIA A800-SXM4-80GB, 81920 MiB
NVIDIA A800-SXM4-80GB, 81920 MiB
NVIDIA A800-SXM4-80GB, 81920 MiB
NVIDIA A800-SXM4-80GB, 81920 MiB
NVIDIA A800-SXM4-80GB, 81920 MiB
- Using GPU in script?: <fill in>
- Using distributed or parallel set-up in script?: <fill in>

### Who can help?

_No response_

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Why Qwen-Image inference speed may slower than comfyui #12645

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Why Qwen-Image inference speed may slower than comfyui #12645

Description

Describe the bug

Reproduction

Logs

System Info

Who can help?

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions