STARFlow: Scalable Transformer Auto-Regressive Flow

This is the official open source release of STARFlow and STARFlow-V, state-of-the-art transformer autoregressive flow models for high-quality image and video generation.

📖 Overview

STARFlow introduces a novel transformer autoregressive flow architecture that combines the expressiveness of autoregressive models with the efficiency of normalizing flows. The model achieves state-of-the-art results in both text-to-image and text-to-video generation tasks.

STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis (NeurIPS 2024 Spotlight)
STARFlow-V: End-to-End Video Generative Modeling with Autoregressive Normalizing Flows (arXiv TBD)

🎬 View Video Results Gallery - See examples of generated videos and comparisons (TBD soon)

🚀 Quick Start

Environment Setup

# Clone the repository
git clone https://github.com/apple/ml-starflow
cd ml-starflow

# Set up conda environment (recommended)
bash scripts/setup_conda.sh

# Or install dependencies manually
pip install -r requirements.txt

Model Checkpoints

Important: You'll need to download the pretrained model checkpoints and place them in the ckpts/ directory. For example:

ckpts/starflow_3B_t2i_256x256.pth - For text-to-image generation
ckpts/starflow-v_7B_t2v_caus_480p_v3.pth - For text-to-video generation

📅 Model Release Timeline: Pretrained checkpoints will be released soon. Please check back or watch this repository for updates.

The checkpoint files are not included in this repository due to size constraints.

Text-to-Image Generation

Generate high-quality images from text prompts:

# Basic image generation (256x256)
bash scripts/test_sample_image.sh "a film still of a cat playing piano"

# Custom prompt and settings
torchrun --standalone --nproc_per_node 1 sample.py \
    --model_config_path "configs/starflow_3B_t2i_256x256.yaml" \
    --checkpoint_path "ckpts/starflow_3B_t2i_256x256.pth" \
    --caption "your custom prompt here" \
    --sample_batch_size 8 \
    --cfg 3.6 \
    --aspect_ratio "1:1" \
    --seed 999

Text-to-Video Generation

Generate videos from text descriptions:

# Basic video generation (480p, ~5 seconds)
bash scripts/test_sample_video.sh "a corgi dog looks at the camera"

# With custom input image for TI2V video generation
bash scripts/test_sample_video.sh "a cat playing piano" "/path/to/input/image.jpg"

# Longer video generation (specify target length in frames)
bash scripts/test_sample_video.sh "a corgi dog looks at the camera" "none" 241  # ~15 seconds at 16fps
bash scripts/test_sample_video.sh "a corgi dog looks at the camera" "none" 481  # ~30 seconds at 16fps

# Advanced video generation
torchrun --standalone --nproc_per_node 8 sample.py \
    --model_config_path "configs/starflow-v_7B_t2v_caus_480p.yaml" \
    --checkpoint_path "ckpts/starflow-v_7B_t2v_caus_480p_v3.pth" \
    --caption "your video prompt here" \
    --sample_batch_size 1 \
    --cfg 3.5 \
    --aspect_ratio "16:9" \
    --out_fps 16 \
    --jacobi 1 --jacobi_th 0.001 \
    --target_length 161  # Customize video length

🛠️ Training

Image Training

Train your own STARFlow model for text-to-image generation:

# Quick training test
bash scripts/test_train_image.sh 10 16

# Full training with custom parameters
torchrun --standalone --nproc_per_node 8 train.py \
    --model_config_path "configs/starflow_3B_t2i_256x256.yaml" \
    --epochs 100 \
    --batch_size 1024 \
    --wandb_name "my_starflow_training"

Video Training

Train STARFlow-V for text-to-video generation:

# Quick training test
bash scripts/test_train_video.sh 10 8

# Resume training from checkpoint
torchrun --standalone --nproc_per_node 8 train.py \
    --model_config_path "configs/starflow-v_7B_t2v_caus_480p.yaml" \
    --resume_path "ckpts/starflow-v_7B_t2v_caus_480p_v3.pth" \
    --epochs 100 \
    --batch_size 192

🔧 Utilities

Video Processing

Extract individual frames from multi-video grids:

# Extract frames from a video containing multiple video grids
python scripts/extract_image_from_video.py --input_video path/to/video.mp4 --output_dir output/

# Extract images with custom settings
python scripts/extract_images.py input_file.mp4

📁 Model Architecture

STARFlow (3B Parameters - Text-to-Image)

Resolution: 256×256
Architecture: 6-block deep-shallow architecture
Text Encoder: T5-XL
VAE: SD-VAE
Features: RoPE positional encoding, mixed precision training

STARFlow-V (7B Parameters - Text-to-Video)

Resolution: Up to 640×480 (480p)
Temporal: 81 frames (16 FPS = ~5 seconds)
Architecture: 6-block deep-shallow architecture (full sequence)
Text Encoder: T5-XL
VAE: WAN2.2-VAE
Features: Causal attention, autoregressive generation, variable length support

🔧 Key Features

Autoregressive Flow Architecture: Novel combination of autoregressive models and normalizing flows
High-Quality Generation: Competetive FID scores and visual quality to State-of-the-art Diffusion Models
Flexible Resolution: Support for various aspect ratios and resolutions
Efficient Training: FSDP support for large-scale distributed training
Fast Sampling: Block-wise Jacobi iteration for accelerated inference
Text Conditioning: Advanced text-to-image/video capabilities
Video Generation: Temporal consistency and smooth motion

📊 Configuration

Key Parameters

Image Generation (`starflow_3B_t2i_256x256.yaml`)

img_size: 256 - Output image resolution
txt_size: 128 - Text sequence length
channels: 3072 - Model hidden dimension
cfg: 3.6 - Classifier-free guidance scale
noise_std: 0.3 - Flow noise standard deviation

Video Generation (`starflow-v_7B_t2v_caus_480p.yaml`)

img_size: 640 - Video frame resolution
vid_size: '81:16' - Temporal dimensions (frames:downsampling)
fps_cond: 1 - FPS conditioning enabled
temporal_causal: 1 - Causal temporal attention

Sampling Options

--cfg - Classifier-free guidance scale (higher = more prompt adherence)
--jacobi - Enable Jacobi iteration for faster sampling
--jacobi_th - Jacobi convergence threshold
--jacobi_block_size - Block size for Jacobi iteration
--aspect_ratio - Output aspect ratio ("1:1", "16:9", "4:3", etc.)
--seed - Random seed for reproducible generation

📚 Project Structure

├── train.py               # Main training script
├── sample.py              # Sampling and inference
├── transformer_flow.py    # Core model implementation
├── dataset.py             # Dataset loading and preprocessing
├── finetune_decoder.py    # Decoder fine-tuning script
├── utils/                 # Utility modules
│   ├── common.py         # Core utility functions
│   ├── model_setup.py    # Model configuration and setup
│   ├── training.py       # Training utilities and metrics
│   └── inference.py      # Evaluation and metrics
├── configs/              # Model configuration files
│   ├── starflow_3B_t2i_256x256.yaml
│   └── starflow-v_7B_t2v_caus_480p.yaml
├── scripts/                 # Example training and sampling scripts
│   ├── test_sample_image.sh
│   ├── test_sample_video.sh
│   ├── test_train_image.sh
│   ├── test_train_video.sh
│   ├── setup_conda.sh
│   ├── extract_images.py
│   └── extract_image_from_video.py
└── misc/                  # Additional utilities
    ├── pe.py             # Positional encodings
    ├── lpips.py          # LPIPS loss
    └── wan_vae2.py       # Video VAE implementation

💡 Tips

Image Generation

Use guidance scales between 2.0-5.0 for balanced quality and diversity
Experiment with different aspect ratios for your use case
Enable Jacobi iteration (--jacobi 1) for faster sampling
Use higher resolution models for detailed outputs
The default script uses optimized settings: --jacobi_th 0.001 and --jacobi_block_size 16

Video Generation

Start with shorter sequences (81 frames) and gradually increase length (161, 241, 481+ frames)
Use input images (--input_image) for more controlled generation
Adjust FPS settings based on content type (8-24 FPS)
Consider temporal consistency when crafting prompts
The default script uses --jacobi_block_size 64.
Longer videos: Use --target_length to generate videos beyond the training length (requires --jacobi 1)
Frame reference: 81 frames ≈ 5s, 161 frames ≈ 10s, 241 frames ≈ 15s, 481 frames ≈ 30s (at 16fps)

Training

Use FSDP for efficient large model training
Start with smaller batch sizes and scale up
Monitor loss curves and adjust learning rates accordingly
Use gradient checkpointing to reduce memory usage
The test scripts include --dry_run 1 for validation

🔗 Citation

If you use STARFlow in your research, please cite:

@article{gu2025starflow,
  title={STARFlow: Scaling Latent Normalizing Flows for High-resolution Image Synthesis},
  author={Gu, Jiatao and Chen, Tianrong and Berthelot, David and Zheng, Huangjie and Wang, Yuyang and Zhang, Ruixiang and Dinh, Laurent and Bautista, Miguel Angel and Susskind, Josh and Zhai, Shuangfei},
  journal={NeurIPS},
  year={2025}
}

📄 License

LICENSE: Please check out the repository LICENSE before using the provided code and LICENSE_MODEL for the released models.

🤝 Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

STARFlow: Scalable Transformer Auto-Regressive Flow

📖 Overview

🚀 Quick Start

Environment Setup

Model Checkpoints

Text-to-Image Generation

Text-to-Video Generation

🛠️ Training

Image Training

Video Training

🔧 Utilities

Video Processing

📁 Model Architecture

STARFlow (3B Parameters - Text-to-Image)

STARFlow-V (7B Parameters - Text-to-Video)

🔧 Key Features

📊 Configuration

Key Parameters

Image Generation (`starflow_3B_t2i_256x256.yaml`)

Video Generation (`starflow-v_7B_t2v_caus_480p.yaml`)

Sampling Options

📚 Project Structure

💡 Tips

Image Generation

Video Generation

Training

🔗 Citation

📄 License

🤝 Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.github/workflows		.github/workflows
assets		assets
configs		configs
docs		docs
misc		misc
scripts		scripts
utils		utils
.gitignore		.gitignore
ACKNOWLEDGMENTS		ACKNOWLEDGMENTS
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
LICENSE_MODEL		LICENSE_MODEL
README.md		README.md
dataset.py		dataset.py
finetune_decoder.py		finetune_decoder.py
requirements.txt		requirements.txt
sample.py		sample.py
train.py		train.py
transformer_flow.py		transformer_flow.py

License

apple/ml-starflow

Folders and files

Latest commit

History

Repository files navigation

STARFlow: Scalable Transformer Auto-Regressive Flow

📖 Overview

🚀 Quick Start

Environment Setup

Model Checkpoints

Text-to-Image Generation

Text-to-Video Generation

🛠️ Training

Image Training

Video Training

🔧 Utilities

Video Processing

📁 Model Architecture

STARFlow (3B Parameters - Text-to-Image)

STARFlow-V (7B Parameters - Text-to-Video)

🔧 Key Features

📊 Configuration

Key Parameters

Image Generation (starflow_3B_t2i_256x256.yaml)

Video Generation (starflow-v_7B_t2v_caus_480p.yaml)

Sampling Options

📚 Project Structure

💡 Tips

Image Generation

Video Generation

Training

🔗 Citation

📄 License

🤝 Contributing

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Image Generation (`starflow_3B_t2i_256x256.yaml`)

Video Generation (`starflow-v_7B_t2v_caus_480p.yaml`)

Packages