Releases: mindspore-lab/mindone
MindONE v0.3.0 release
We are thrilled to announce the release of MindONE 0.3.0, featuring more state-of-the-art multi-modal understanding and generative models and better compatibility with transformers
and diffusers
. MindONE now supports the latest features in diffuers
v0.32.2, including over 160 pipelines, 50 models, and 35 schedulers. It allows users to easily develop new image/video/audio generation models or transfer existing models from torch to mindspore. MindONE 0.3.0 is built on MindSpore2.5 and optimized for Ascend NPUs, ensuring high-performance training for various generative models, such as opensora, cogvideox, and JanusPro from DeepSeek.
Key Features
- Support Diffusers v0.32.2
MindONE now supports the following new pipelines for image and video generation, along with new training scripts:
-
Video Generation Pipelines: CogVideoX, Latte, Mochi-1, Allegro, LTXVideo, HunyuanVideo, and more.
-
Image Generation Pipelines: Cogview3/4, Stable Diffusion 3.5, CogView3, Flux, SANA, Lumina, Kolors, AuraFlow, and more.
-
Training Scripts: CogvideoX SFT & LoRA, Flux SFT & LoRA & ControlNet, and SD3/3.5 SFT & LoRA.
For more details, visit the diffusers documentation.
- Expanded Multi-Modal Generative Models
MindONE v0.3.0 adds various state-of-the-art generative models as examples, ensuring efficient training performance on Ascend NPUs, including:
task | model | inference | finetune | pretrain | institute |
---|---|---|---|---|---|
Image-to-Video | hunyuanvideo-i2v 🔥🔥 | ✅ | ✖️ | ✖️ | Tencent |
Text/Image-to-Video | wan2.1 🔥🔥🔥 | ✅ | ✖️ | ✖️ | Alibaba |
Text-to-Image | cogview4 🔥🔥🔥 | ✅ | ✖️ | ✖️ | Zhipuai |
Text-to-Video | step_video_t2v 🔥🔥 | ✅ | ✖️ | ✖️ | StepFun |
Image-Text-to-Text | qwen2_vl 🔥🔥🔥 | ✅ | ✖️ | ✖️ | Alibaba |
Any-to-Any | janus 🔥🔥🔥 | ✅ | ✅ | ✅ | DeepSeek |
Any-to-Any | emu3 🔥🔥 | ✅ | ✅ | ✅ | BAAI |
Class-to-Image | var🔥🔥 | ✅ | ✅ | ✅ | ByteDance |
Text/Image-to-Video | hpcai open 2.0🔥🔥 | ✅ | ✖️ | ✖️ | HPC-AI Tech |
Text/Image-to-Video | cogvideox 1.5 5B~30B 🔥🔥 | ✅ | ✅ | ✅ | Zhipu |
Text-to-Video | open sora plan 1.3🔥🔥 | ✅ | ✅ | ✅ | PKU |
Text-to-Video | hunyuanvideo🔥🔥 | ✅ | ✅ | ✅ | Tencent |
Text-to-Video | movie gen 30B🔥🔥 | ✅ | ✅ | ✅ | Meta |
Video-Encode-Decode | magvit | ✅ | ✅ | ✅ | |
Text-to-Image | story_diffusion | ✅ | ✖️ | ✖️ | ByteDance |
Image-to-Video | dynamicrafter | ✅ | ✖️ | ✖️ | Tencent |
Video-to-Video | venhancer | ✅ | ✖️ | ✖️ | Shanghai AI Lab |
Text-to-Video | t2v_turbo | ✅ | ✅ | ✅ | |
Text/Image-to-Video | video composer | ✅ | ✅ | ✅ | Alibaba |
Text-to-Image | flux 🔥 | ✅ | ✅ | ✖️ | Black Forest Lab |
Text-to-Image | stable diffusion 3 🔥 | ✅ | ✅ | ✖️ | Stability AI |
Text-to-Image | kohya_sd_scripts | ✅ | ✅ | ✖️ | kohya |
Text-to-Image | t2i-adapter | ✅ | ✅ | ✅ | Shanghai AI Lab |
Text-to-Image | ip adapter | ✅ | ✅ | ✅ | Tencent |
Text-to-3D | mvdream | ✅ | ✅ | ✅ | ByteDance |
Image-to-3D | instantmesh | ✅ | ✅ | ✅ | Tencent |
Image-to-3D | sv3d | ✅ | ✅ | ✅ | Stability AI |
Text/Image-to-3D | hunyuan3d-1.0 | ✅ | ✅ | ✅ | Tencent |
- Support Texto-to-Video Data Curation
MindONE v0.3.0 adds a new pipeline for text-to-video filtering, which supports scene detection and video splitting, de-duplication, aesthetic/ocr/lpips/nsfw scoring, and video captioning.
For more details, visit t2v curation documentation
MindONE 0.2.0
We are excited to announce the official release of MindONE, a state-of-the-art repository dedicated to multi-modal understanding and content generation. Built on MindSpore 2.3.1 and optimized for Ascend NPUs, MindONE provides a comprehensive suite of algorithms and models designed to facilitate advanced content generation across various modalities, including images, audio, videos, and even 3D objects.
Key Features
- diffusers support on MindSpore
We've tried to provide a completely consistent interface and usage with the huggingface/diffusers.
Only necessary changes are made to the huggingface/diffusers to make it seamless for users from torch.
- from diffusers import DiffusionPipeline
+ from mindone.diffusers import DiffusionPipeline
pipe = DiffusionPipeline.from_pretrained(
"stabilityai/stable-diffusion-xl-base-1.0",
- torch_dtype=torch.float16,
+ mindspore_dtype=mindspore.float16
use_safetensors=True
)
prompt = "An astronaut riding a green horse"
images = pipe(prompt=prompt)[0][0]
Important
Due to the huggingface/diffusers is still under active development,
many features are not yet well-supported.
Currently, most functions of huggingface/diffusers v0.29.x are supported.
For details, see MindOne Diffusers.
- MindSpore patch for transformers
This MindSpore patch for huggingface/Transformers enables researchers or developers
in the field of text-to-image (t2i) and text-to-video (t2v) generation to utilize pretrained text and image models
from huggingface/Transformers on MindSpore.
Only the Ascend related modules are modified. Other modules reuse the huggingface/Transformers.
The following lines of code are an example that shows you how to download and use the pretrained models. Remember that the models are from mindone.transformers, and anything else is from huggingface/Transformers.
from mindspore import Tensor
# use tokenizer from huggingface/Transformers
from transformers import AutoTokenizer
# use model from mindone.transformers
-from transformers import CLIPTextModel
+from mindone.transformers import CLIPTextModel
model = CLIPTextModel.from_pretrained("openai/clip-vit-base-patch32")
tokenizer = AutoTokenizer.from_pretrained("openai/clip-vit-base-patch32")
inputs = tokenizer(
["a photo of a cat", "a photo of a dog"],
padding=True,
- return_tensors="pt",
+ return_tensors="np"
)
-outputs = model(**inputs)
+outputs = model(Tensor(inputs.input_ids))
For details, see MindOne Transformers.
- State-of-the-Art generative models
MindONE showcases various state-of-the-art generative models as examples, ensuring efficient training performance on Ascend NPUs, including:
model | features |
---|---|
hpcai open sora | support v1.0/1.1/1.2 large scale training with dp/sp/zero |
open sora plan | support v1.0/1.1/1.2 large scale training with dp/sp/zero |
stable diffusion | support sd 1.5/2.0/2.1, vanilla fine tune, lora, dreambooth, text inversion |
stable diffusion xl | support sai style(stability AI) vanilla fine tune, lora, dreambooth |
dit | support text to image fine tune |
hunyuan_dit | support text to image fine tune |
pixart_sigma | suuport text to image fine tune at different aspect ratio |
latte | support uncondition text to image fine tune |
animate diff | support motion module and lora training |
dynamicrafter | support image to video generation |