Release MindONE v0.3.0 release · mindspore-lab/mindone

We are thrilled to announce the release of MindONE 0.3.0, featuring more state-of-the-art multi-modal understanding and generative models and better compatibility with transformers and diffusers. MindONE now supports the latest features in diffuers v0.32.2, including over 160 pipelines, 50 models, and 35 schedulers. It allows users to easily develop new image/video/audio generation models or transfer existing models from torch to mindspore. MindONE 0.3.0 is built on MindSpore2.5 and optimized for Ascend NPUs, ensuring high-performance training for various generative models, such as opensora, cogvideox, and JanusPro from DeepSeek.

Key Features

Support Diffusers v0.32.2

MindONE now supports the following new pipelines for image and video generation, along with new training scripts:

Video Generation Pipelines: CogVideoX, Latte, Mochi-1, Allegro, LTXVideo, HunyuanVideo, and more.
Image Generation Pipelines: Cogview3/4, Stable Diffusion 3.5, CogView3, Flux, SANA, Lumina, Kolors, AuraFlow, and more.
Training Scripts: CogvideoX SFT & LoRA, Flux SFT & LoRA & ControlNet, and SD3/3.5 SFT & LoRA.

For more details, visit the diffusers documentation.

Expanded Multi-Modal Generative Models

MindONE v0.3.0 adds various state-of-the-art generative models as examples, ensuring efficient training performance on Ascend NPUs, including:

task	model	inference	finetune	pretrain	institute
Image-to-Video	hunyuanvideo-i2v 🔥🔥	✅	✖️	✖️	Tencent
Text/Image-to-Video	wan2.1 🔥🔥🔥	✅	✖️	✖️	Alibaba
Text-to-Image	cogview4 🔥🔥🔥	✅	✖️	✖️	Zhipuai
Text-to-Video	step_video_t2v 🔥🔥	✅	✖️	✖️	StepFun
Image-Text-to-Text	qwen2_vl 🔥🔥🔥	✅	✖️	✖️	Alibaba
Any-to-Any	janus 🔥🔥🔥	✅	✅	✅	DeepSeek
Any-to-Any	emu3 🔥🔥	✅	✅	✅	BAAI
Class-to-Image	var🔥🔥	✅	✅	✅	ByteDance
Text/Image-to-Video	hpcai open 2.0🔥🔥	✅	✖️	✖️	HPC-AI Tech
Text/Image-to-Video	cogvideox 1.5 5B~30B 🔥🔥	✅	✅	✅	Zhipu
Text-to-Video	open sora plan 1.3🔥🔥	✅	✅	✅	PKU
Text-to-Video	hunyuanvideo🔥🔥	✅	✅	✅	Tencent
Text-to-Video	movie gen 30B🔥🔥	✅	✅	✅	Meta
Video-Encode-Decode	magvit	✅	✅	✅	Google
Text-to-Image	story_diffusion	✅	✖️	✖️	ByteDance
Image-to-Video	dynamicrafter	✅	✖️	✖️	Tencent
Video-to-Video	venhancer	✅	✖️	✖️	Shanghai AI Lab
Text-to-Video	t2v_turbo	✅	✅	✅	Google
Text/Image-to-Video	video composer	✅	✅	✅	Alibaba
Text-to-Image	flux 🔥	✅	✅	✖️	Black Forest Lab
Text-to-Image	stable diffusion 3 🔥	✅	✅	✖️	Stability AI
Text-to-Image	kohya_sd_scripts	✅	✅	✖️	kohya
Text-to-Image	t2i-adapter	✅	✅	✅	Shanghai AI Lab
Text-to-Image	ip adapter	✅	✅	✅	Tencent
Text-to-3D	mvdream	✅	✅	✅	ByteDance
Image-to-3D	instantmesh	✅	✅	✅	Tencent
Image-to-3D	sv3d	✅	✅	✅	Stability AI
Text/Image-to-3D	hunyuan3d-1.0	✅	✅	✅	Tencent

Support Texto-to-Video Data Curation

MindONE v0.3.0 adds a new pipeline for text-to-video filtering, which supports scene detection and video splitting, de-duplication, aesthetic/ocr/lpips/nsfw scoring, and video captioning.

For more details, visit t2v curation documentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MindONE v0.3.0 release

Key Features

Uh oh!