Skip to content

Commit dbe80e3

Browse files
authored
Merge branch 'main' into add-magi-1
2 parents 147fa12 + eeae033 commit dbe80e3

File tree

169 files changed

+12918
-1499
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

169 files changed

+12918
-1499
lines changed

.github/workflows/push_tests.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -76,6 +76,7 @@ jobs:
7676
run: |
7777
uv pip install -e ".[quality]"
7878
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
79+
uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git
7980
- name: Environment
8081
run: |
8182
python utils/print_env.py
@@ -127,6 +128,7 @@ jobs:
127128
uv pip install -e ".[quality]"
128129
uv pip install peft@git+https://github.com/huggingface/peft.git
129130
uv pip uninstall accelerate && uv pip install -U accelerate@git+https://github.com/huggingface/accelerate.git
131+
uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git
130132
131133
- name: Environment
132134
run: |
@@ -178,6 +180,7 @@ jobs:
178180
- name: Install dependencies
179181
run: |
180182
uv pip install -e ".[quality,training]"
183+
uv pip uninstall transformers huggingface_hub && uv pip install --prerelease allow -U transformers@git+https://github.com/huggingface/transformers.git
181184
- name: Environment
182185
run: |
183186
python utils/print_env.py

docs/source/en/_toctree.yml

Lines changed: 18 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -22,6 +22,8 @@
2222
title: Reproducibility
2323
- local: using-diffusers/schedulers
2424
title: Schedulers
25+
- local: using-diffusers/automodel
26+
title: AutoModel
2527
- local: using-diffusers/other-formats
2628
title: Model formats
2729
- local: using-diffusers/push_to_hub
@@ -119,6 +121,8 @@
119121
title: ComponentsManager
120122
- local: modular_diffusers/guiders
121123
title: Guiders
124+
- local: modular_diffusers/custom_blocks
125+
title: Building Custom Blocks
122126
title: Modular Diffusers
123127
- isExpanded: false
124128
sections:
@@ -329,6 +333,8 @@
329333
title: BriaTransformer2DModel
330334
- local: api/models/chroma_transformer
331335
title: ChromaTransformer2DModel
336+
- local: api/models/chronoedit_transformer_3d
337+
title: ChronoEditTransformer3DModel
332338
- local: api/models/cogvideox_transformer3d
333339
title: CogVideoXTransformer3DModel
334340
- local: api/models/cogview3plus_transformer2d
@@ -375,6 +381,8 @@
375381
title: QwenImageTransformer2DModel
376382
- local: api/models/sana_transformer2d
377383
title: SanaTransformer2DModel
384+
- local: api/models/sana_video_transformer3d
385+
title: SanaVideoTransformer3DModel
378386
- local: api/models/sd3_transformer2d
379387
title: SD3Transformer2DModel
380388
- local: api/models/skyreels_v2_transformer_3d
@@ -385,6 +393,8 @@
385393
title: Transformer2DModel
386394
- local: api/models/transformer_temporal
387395
title: TransformerTemporalModel
396+
- local: api/models/wan_animate_transformer_3d
397+
title: WanAnimateTransformer3DModel
388398
- local: api/models/wan_transformer_3d
389399
title: WanTransformer3DModel
390400
title: Transformers
@@ -448,6 +458,8 @@
448458
- sections:
449459
- local: api/pipelines/overview
450460
title: Overview
461+
- local: api/pipelines/auto_pipeline
462+
title: AutoPipeline
451463
- sections:
452464
- local: api/pipelines/audioldm
453465
title: AudioLDM
@@ -460,8 +472,6 @@
460472
- local: api/pipelines/stable_audio
461473
title: Stable Audio
462474
title: Audio
463-
- local: api/pipelines/auto_pipeline
464-
title: AutoPipeline
465475
- sections:
466476
- local: api/pipelines/amused
467477
title: aMUSEd
@@ -525,6 +535,8 @@
525535
title: HiDream-I1
526536
- local: api/pipelines/hunyuandit
527537
title: Hunyuan-DiT
538+
- local: api/pipelines/hunyuanimage21
539+
title: HunyuanImage2.1
528540
- local: api/pipelines/pix2pix
529541
title: InstructPix2Pix
530542
- local: api/pipelines/kandinsky
@@ -567,6 +579,8 @@
567579
title: Sana
568580
- local: api/pipelines/sana_sprint
569581
title: Sana Sprint
582+
- local: api/pipelines/sana_video
583+
title: Sana Video
570584
- local: api/pipelines/self_attention_guidance
571585
title: Self-Attention Guidance
572586
- local: api/pipelines/semantic_stable_diffusion
@@ -628,14 +642,14 @@
628642
- sections:
629643
- local: api/pipelines/allegro
630644
title: Allegro
645+
- local: api/pipelines/chronoedit
646+
title: ChronoEdit
631647
- local: api/pipelines/cogvideox
632648
title: CogVideoX
633649
- local: api/pipelines/consisid
634650
title: ConsisID
635651
- local: api/pipelines/framepack
636652
title: Framepack
637-
- local: api/pipelines/hunyuanimage21
638-
title: HunyuanImage2.1
639653
- local: api/pipelines/hunyuan_video
640654
title: HunyuanVideo
641655
- local: api/pipelines/i2vgenxl

docs/source/en/api/models/auto_model.md

Lines changed: 1 addition & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -12,15 +12,7 @@ specific language governing permissions and limitations under the License.
1212

1313
# AutoModel
1414

15-
The `AutoModel` is designed to make it easy to load a checkpoint without needing to know the specific model class. `AutoModel` automatically retrieves the correct model class from the checkpoint `config.json` file.
16-
17-
```python
18-
from diffusers import AutoModel, AutoPipelineForText2Image
19-
20-
unet = AutoModel.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", subfolder="unet")
21-
pipe = AutoPipelineForText2Image.from_pretrained("stable-diffusion-v1-5/stable-diffusion-v1-5", unet=unet)
22-
```
23-
15+
[`AutoModel`] automatically retrieves the correct model class from the checkpoint `config.json` file.
2416

2517
## AutoModel
2618

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,32 @@
1+
<!-- Copyright 2025 The ChronoEdit Team and HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# ChronoEditTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D video-like data from [ChronoEdit: Towards Temporal Reasoning for Image Editing and World Simulation](https://huggingface.co/papers/2510.04290) from NVIDIA and University of Toronto, by Jay Zhangjie Wu, Xuanchi Ren, Tianchang Shen, Tianshi Cao, Kai He, Yifan Lu, Ruiyuan Gao, Enze Xie, Shiyi Lan, Jose M. Alvarez, Jun Gao, Sanja Fidler, Zian Wang, Huan Ling.
15+
16+
> **TL;DR:** ChronoEdit reframes image editing as a video generation task, using input and edited images as start/end frames to leverage pretrained video models with temporal consistency. A temporal reasoning stage introduces reasoning tokens to ensure physically plausible edits and visualize the editing trajectory.
17+
18+
The model can be loaded with the following code snippet.
19+
20+
```python
21+
from diffusers import ChronoEditTransformer3DModel
22+
23+
transformer = ChronoEditTransformer3DModel.from_pretrained("nvidia/ChronoEdit-14B-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
24+
```
25+
26+
## ChronoEditTransformer3DModel
27+
28+
[[autodoc]] ChronoEditTransformer3DModel
29+
30+
## Transformer2DModelOutput
31+
32+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 36 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
<!-- Copyright 2025 The SANA-Video Authors and HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# SanaVideoTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D data (video) from [SANA-Video: Efficient Video Generation with Block Linear Diffusion Transformer](https://huggingface.co/papers/2509.24695) from NVIDIA and MIT HAN Lab, by Junsong Chen, Yuyang Zhao, Jincheng Yu, Ruihang Chu, Junyu Chen, Shuai Yang, Xianbang Wang, Yicheng Pan, Daquan Zhou, Huan Ling, Haozhe Liu, Hongwei Yi, Hao Zhang, Muyang Li, Yukang Chen, Han Cai, Sanja Fidler, Ping Luo, Song Han, Enze Xie.
15+
16+
The abstract from the paper is:
17+
18+
*We introduce SANA-Video, a small diffusion model that can efficiently generate videos up to 720x1280 resolution and minute-length duration. SANA-Video synthesizes high-resolution, high-quality and long videos with strong text-video alignment at a remarkably fast speed, deployable on RTX 5090 GPU. Two core designs ensure our efficient, effective and long video generation: (1) Linear DiT: We leverage linear attention as the core operation, which is more efficient than vanilla attention given the large number of tokens processed in video generation. (2) Constant-Memory KV cache for Block Linear Attention: we design block-wise autoregressive approach for long video generation by employing a constant-memory state, derived from the cumulative properties of linear attention. This KV cache provides the Linear DiT with global context at a fixed memory cost, eliminating the need for a traditional KV cache and enabling efficient, minute-long video generation. In addition, we explore effective data filters and model training strategies, narrowing the training cost to 12 days on 64 H100 GPUs, which is only 1% of the cost of MovieGen. Given its low cost, SANA-Video achieves competitive performance compared to modern state-of-the-art small diffusion models (e.g., Wan 2.1-1.3B and SkyReel-V2-1.3B) while being 16x faster in measured latency. Moreover, SANA-Video can be deployed on RTX 5090 GPUs with NVFP4 precision, accelerating the inference speed of generating a 5-second 720p video from 71s to 29s (2.4x speedup). In summary, SANA-Video enables low-cost, high-quality video generation.*
19+
20+
The model can be loaded with the following code snippet.
21+
22+
```python
23+
from diffusers import SanaVideoTransformer3DModel
24+
import torch
25+
26+
transformer = SanaVideoTransformer3DModel.from_pretrained("Efficient-Large-Model/SANA-Video_2B_480p_diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
27+
```
28+
29+
## SanaVideoTransformer3DModel
30+
31+
[[autodoc]] SanaVideoTransformer3DModel
32+
33+
## Transformer2DModelOutput
34+
35+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
36+
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!-- Copyright 2025 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# WanAnimateTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D video-like data was introduced in [Wan Animate](https://github.com/Wan-Video/Wan2.2) by the Alibaba Wan Team.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import WanAnimateTransformer3DModel
20+
21+
transformer = WanAnimateTransformer3DModel.from_pretrained("Wan-AI/Wan2.2-Animate-14B-720P-Diffusers", subfolder="transformer", torch_dtype=torch.bfloat16)
22+
```
23+
24+
## WanAnimateTransformer3DModel
25+
26+
[[autodoc]] WanAnimateTransformer3DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput

0 commit comments

Comments
 (0)