Skip to content

Releases: huggingface/diffusers

Patch Release: Fix community vs. hub pipelines revision

06 Nov 14:47
Compare
Choose a tag to compare
  • [Custom Pipelines] Make sure that community pipelines can use repo revision by @patrickvonplaten

v0.22.0: LCM, PixArt-Alpha, AnimateDiff, PEFT integration for LoRA, and more

06 Nov 13:03
Compare
Choose a tag to compare

Latent Consistency Models (LCM)

Untitled

LCMs enable a significantly fast inference process for diffusion models. They require far fewer inference steps to produce high-resolution images without compromising the image quality too much. Below is a usage example:

import torch
from diffusers import DiffusionPipeline

pipe = DiffusionPipeline.from_pretrained("SimianLuo/LCM_Dreamshaper_v7", torch_dtype=torch.float32)

# To save GPU memory, torch.float16 can be used, but it may compromise image quality.
pipe.to(torch_device="cuda", torch_dtype=torch.float32)

prompt = "Self-portrait oil painting, a beautiful cyborg with golden hair, 8k"

# Can be set to 1~50 steps. LCM support fast inference even <= 4 steps. Recommend: 1~8 steps.
num_inference_steps = 4 

images = pipe(prompt=prompt, num_inference_steps=num_inference_steps, guidance_scale=8.0).images

Refer to the documentation to learn more.

LCM comes with both text-to-image and image-to-image pipelines and they were contributed by @luosiallen, @nagolinc, and @dg845.

PixArt-Alpha

header_collage

PixArt-Alpha is a Transformer-based text-to-image diffusion model that rivals the quality of the existing state-of-the-art ones, such as Stable Diffusion XL, Imagen, and DALL-E 2, while being more efficient.

It was trained T5 text embeddings and has a maximum sequence length of 120. Thus, it allows for more detailed prompt inputs, unlocking better quality generations.

Despite the large text encoder, with model offloading, it takes a little under 11GBs of VRAM to run the PixArtAlphaPipeline:

from diffusers import PixArtAlphaPipeline
import torch 

pipeline_id = "PixArt-alpha/PixArt-XL-2-1024-MS"
pipeline = PixArtAlphaPipeline.from_pretrained(pipeline_id, torch_dtype=torch.float16)
pipeline.enable_model_cpu_offload()

prompt = "A small cactus with a happy face in the Sahara desert."
image = pipe(prompt).images[0]
image.save("sahara.png")

Check out the docs to learn more.

AnimateDiff

animatediff-doc

AnimateDiff is a modelling framework that allows you to create videos using pre-existing Stable Diffusion text-to-image models. It achieves this by inserting motion module layers into a frozen text-to-image model and training it on video clips to extract a motion prior.

These motion modules are applied after the ResNet and Attention blocks in the Stable Diffusion UNet. Their purpose is to introduce coherent motion across image frames. To support these modules, we introduce the concepts of a MotionAdapter and a UNetMotionModel. These serve as a convenient way to use these motion modules with existing Stable Diffusion models.

The following example demonstrates how you can utilize the motion modules with an existing Stable Diffusion text-to-image model.

import torch
from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
from diffusers.utils import export_to_gif

# Load the motion adapter
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")

# load SD 1.5 based finetuned model
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter)
scheduler = DDIMScheduler.from_pretrained(
    model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler

# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()

output = pipe(
    prompt=(
        "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
        "orange sky, warm lighting, fishing boats, ocean waves seagulls, "
        "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
        "golden hour, coastal landscape, seaside scenery"
    ),
    negative_prompt="bad quality, worse quality",
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=25,
    generator=torch.Generator("cpu").manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")

You can convert an existing 2D UNet into a UNetMotionModel:

from diffusers import MotionAdapter, UNetMotionModel, UNet2DConditionModel

unet = UNetMotionModel()

# Load from an existing 2D UNet and MotionAdapter
unet2D = UNet2DConditionModel.from_pretrained("SG161222/Realistic_Vision_V5.1_noVAE", subfolder="unet")
motion_adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")

# load motion adapter here
unet_motion = UNetMotionModel.from_unet2d(unet2D, motion_adapter: Optional = None)

# Or load motion modules after init
unet_motion.load_motion_modules(motion_adapter)

# freeze all 2D UNet layers except for the motion modules for finetuning
unet_motion.freeze_unet2d_params()

# Save only motion modules
unet_motion.save_motion_module(<path to save model>, push_to_hub=True)

AnimateDiff also comes with motion LoRA modules, letting you control subtleties:

import torch
from diffusers import MotionAdapter, AnimateDiffPipeline, DDIMScheduler
from diffusers.utils import export_to_gif

# Load the motion adapter
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2")
# load SD 1.5 based finetuned model
model_id = "SG161222/Realistic_Vision_V5.1_noVAE"
pipe = AnimateDiffPipeline.from_pretrained(model_id, motion_adapter=adapter)
pipe.load_lora_weights("guoyww/animatediff-motion-lora-zoom-out", adapter_name="zoom-out")

scheduler = DDIMScheduler.from_pretrained(
    model_id, subfolder="scheduler", clip_sample=False, timestep_spacing="linspace", steps_offset=1
)
pipe.scheduler = scheduler

# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_model_cpu_offload()

output = pipe(
    prompt=(
        "masterpiece, bestquality, highlydetailed, ultradetailed, sunset, "
        "orange sky, warm lighting, fishing boats, ocean waves seagulls, "
        "rippling water, wharf, silhouette, serene atmosphere, dusk, evening glow, "
        "golden hour, coastal landscape, seaside scenery"
    ),
    negative_prompt="bad quality, worse quality",
    num_frames=16,
    guidance_scale=7.5,
    num_inference_steps=25,
    generator=torch.Generator("cpu").manual_seed(42),
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")

animatediff-zoom-out-lora

Check out the documentation to learn more.

PEFT 🤝 Diffusers

There are many adapters (LoRA, for example) trained in different styles to achieve different effects. You can even combine multiple adapters to create new and unique images. With the 🤗 PEFT integration in 🤗 Diffusers, it is really easy to load and manage adapters for inference.

Here is an example of combining multiple LoRAs using this new integration:

from diffusers import DiffusionPipeline
import torch

pipe_id = "stabilityai/stable-diffusion-xl-base-1.0"
pipe = DiffusionPipeline.from_pretrained(pipe_id, torch_dtype=torch.float16).to("cuda")

# Load LoRA 1.
pipe.load_lora_weights("CiroN2022/toy-face", weight_name="toy_face_sdxl.safetensors", adapter_name="toy")
# Load LoRA 2.
pipe.load_lora_weights("nerijs/pixel-art-xl", weight_name="pixel-art-xl.safetensors", adapter_name="pixel")

# Combine the adapters.
pipe.set_adapters(["pixel", "toy"], adapter_weights=[0.5, 1.0])

# Perform inference.
prompt = "toy_face of a hacker with a hoodie, pixel art"
image = pipe(
    prompt, num_inference_steps=30, cross_attention_kwargs={"scale": 1.0}, generator=torch.manual_seed(0)
).images[0]
image

Untitled 1

Refer to the documentation to learn more.

Community components with community pipelines

We have had support for community pipelines for a while now. This enables fast integration for pipelines we cannot directly integrate within the core codebase of the library. However, community pipelines always rely on the building blocks from Diffusers, which can be restrictive for advanced use cases.

To elevate this, we’re elevating community pipelines with community components starting this release 🤗 By specifying trust_remote_code=True and writing the pipeline repository in a specific way, users can customize their pipeline and component code as flexibly as possible:

from diffusers import DiffusionPipeline
import torch

pipeline = DiffusionPipeline.from_pretrained(
    "<change-username>/<change-id>", trust_remote_code=True, torch_dtype=torch.float16
).to("cuda")

prompt = "hello"

# Text embeds
prompt_embeds, negative_embeds = pipeline.encode_prompt(prompt)

# Keyframes generation (8x64x40, 2fps)
video_frames = pipeline(
    prompt_embeds=prompt_embeds,
    negative_prompt_embeds=negative_embeds,
    num_frames=8,
    height=40,
    width=64,
    num_inference_steps=2,
    guidance_scale=9.0,
    output_type="pt"
).frames

Refer to the [documentation](https://huggingface.co/docs/diffusers/main/en/using-diffusers/custom_pipeline_overview#commu...

Read more

Patch Release: Fix Lora fusing/unfusing

29 Sep 15:52
Compare
Choose a tag to compare

Patch Release: Fix LoRA attention processor for xformers.

27 Sep 16:49
Compare
Choose a tag to compare

Patch Release: CPU offloading + Lora load/Text inv load & Multi Adapter

18 Sep 22:14
Compare
Choose a tag to compare

Patch Release v0.21.1: Fix import and config loading for `from_single_file`

14 Sep 11:19
Compare
Choose a tag to compare
  • Fix model offload bug when key isn't present by @DN6 in #5030
  • [Import] Don't force transformers to be installed by @patrickvonplaten in #5035
  • allow loading of sd models from safetensors without online lookups using local config files by @vladmandic in #5019
  • [Import] Add missing settings / Correct some dummy imports by @patrickvonplaten in #5036

v0.21.0: Würstchen, Faster LoRA loading, Faster imports, T2I Adapters for SDXL, and more

13 Sep 15:48
Compare
Choose a tag to compare

Würstchen

Würstchen is a diffusion model, whose text-conditional model works in a highly compressed latent space of images, allowing cheaper and faster inference.

Here is how to use the Würstchen as a pipeline:

import torch
from diffusers import AutoPipelineForText2Image
from diffusers.pipelines.wuerstchen import DEFAULT_STAGE_C_TIMESTEPS

pipeline = AutoPipelineForText2Image.from_pretrained("warp-ai/wuerstchen", torch_dtype=torch.float16).to("cuda")

caption = "Anthropomorphic cat dressed as a firefighter"
images = pipeline(
	caption,
	height=1024,
	width=1536,
	prior_timesteps=DEFAULT_STAGE_C_TIMESTEPS,
	prior_guidance_scale=4.0,
	num_images_per_prompt=4,
).images

To learn more about the pipeline, check out the official documentation.

This pipeline was contributed by one of the authors of Würstchen, @dome272, with help from @kashif and @patrickvonplaten.

👉 Try out the model here: https://huggingface.co/spaces/warp-ai/Wuerstchen

T2I Adapters for Stable Diffusion XL (SDXL)

T2I-Adapter is an efficient plug-and-play model that provides extra guidance to pre-trained text-to-image models while freezing the original large text-to-image models.

In collaboration with the Tencent ARC researchers, we trained T2I Adapters on various conditions: sketch, canny, lineart, depth, and openpose.

Below is an how to use the StableDiffusionXLAdapterPipeline.

First ensure, the controlnet_aux is installed:

pip install -U controlnet_aux==0.0.7

Then we can initialize the pipeline:

import torch
from controlnet_aux.lineart import LineartDetector
from diffusers import (AutoencoderKL, EulerAncestralDiscreteScheduler,
                       StableDiffusionXLAdapterPipeline, T2IAdapter)
from diffusers.utils import load_image, make_image_grid

# load adapter
adapter = T2IAdapter.from_pretrained(
    "TencentARC/t2i-adapter-lineart-sdxl-1.0", torch_dtype=torch.float16, varient="fp16"
).to("cuda")

# load pipeline
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
euler_a = EulerAncestralDiscreteScheduler.from_pretrained(
    model_id, subfolder="scheduler"
)
vae = AutoencoderKL.from_pretrained(
    "madebyollin/sdxl-vae-fp16-fix", torch_dtype=torch.float16
)
pipe = StableDiffusionXLAdapterPipeline.from_pretrained(
    model_id,
    vae=vae,
    adapter=adapter,
    scheduler=euler_a,
    torch_dtype=torch.float16,
    variant="fp16",
).to("cuda")

# load lineart detector
line_detector = LineartDetector.from_pretrained("lllyasviel/Annotators").to("cuda")

We then load an image to compute the lineart conditionings:

url = "https://huggingface.co/Adapter/t2iadapter/resolve/main/figs_SDXLV1.0/org_lin.jpg"
image = load_image(url)
image = line_detector(image, detect_resolution=384, image_resolution=1024)

Then we generate:

prompt = "Ice dragon roar, 4k photo"
negative_prompt = "anime, cartoon, graphic, text, painting, crayon, graphite, abstract, glitch, deformed, mutated, ugly, disfigured"
gen_images = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    image=image,
    num_inference_steps=30,
    adapter_conditioning_scale=0.8,
    guidance_scale=7.5,
).images[0]

Refer to the official documentation to learn more about StableDiffusionXLAdapterPipeline.

This blog post summarizes our experiences and provides all the resources (including the pre-trained T2I Adapter checkpoints) to get started using T2I Adapters for SDXL.

We’re also releasing a training script for training your custom T2I Adapters on SDXL. Check out the documentation to learn more.

Thanks to @MC-E (one of the authors of T2I Adapters) for contributing the StableDiffusionXLAdapterPipeline in #4696.

Faster imports

We introduced “lazy imports” (#4829) to significantly improve the time it takes to import our modules (such as pipelines, models, and so on). Below is a comparison of the timings with and without lazy imports on import diffusers.

With lazy imports:

real    0m0.417s
user    0m0.714s
sys     0m0.499s

Without lazy imports:

real    0m5.391s
user    0m5.299s
sys     0m1.273s

Faster LoRA loading

Previously, loading LoRA parameters using the load_lora_weights() used to be time-consuming as reported in #4975. To this end, we introduced a low_cpu_mem_usage argument to the load_lora_weights() method in #4994 which should speed up the loading time significantly. Just pass low_cpu_mem_usage=True to take the benefits.

LoRA fusing

LoRA weights can now be fused into the model weights, thus allowing models that have loaded LoRA weights to run as fast as models without. It also enables to fuse multiple LoRAs into the same model.

For more information, have a look at the documentation and the original PR: #4473.

More support for LoRAs

Almost all LoRA formats out there for SDXL are now supported. For a more details, please check the documentation.

All commits

Read more

Patch Release 0.20.2 - Correct SDXL Inpaint Strength Default

31 Aug 18:43
Compare
Choose a tag to compare

Stable Diffusion XL's strength default was accidentally set to 1.0 when creating the pipeline. The default should be set to 0.9999 instead. This patch release fixes that.

All commits

Patch Release: Fix `torch.compile()` support for ControlNets

28 Aug 04:47
Compare
Choose a tag to compare

3eb498e#r125606630 introduced a 🐛 that broke the torch.compile() support for ControlNets. This patch release fixes that.

All commits

v0.20.0: SDXL ControlNets with MultiControlNet, GLIGEN, Tiny Autoencoder, SDXL DreamBooth LoRA in free-tier Colab, and more

17 Aug 08:46
Compare
Choose a tag to compare

SDXL ControlNets 🚀

The 🧨 diffusers team has trained two ControlNets on Stable Diffusion XL (SDXL):

image_grid_controlnet_sdxl

You can find all the SDXL ControlNet checkpoints here, including some smaller ones (5 to 7x smaller).

To know more about how to use these ControlNets to perform inference, check out the respective model cards and the documentation. To train custom SDXL ControlNets, you can try out our training script.

MultiControlNet for SDXL

This release also introduces support for combining multiple ControlNets trained on SDXL and performing inference with them. Refer to the documentation to learn more.

GLIGEN

The GLIGEN model was developed by researchers and engineers from University of Wisconsin-Madison, Columbia University, and Microsoft. The StableDiffusionGLIGENPipeline can generate photorealistic images conditioned on grounding inputs. Along with text and bounding boxes, if input images are given, this pipeline can insert objects described by text at the region defined by bounding boxes. Otherwise, it’ll generate an image described by the caption/prompt and insert objects described by text at the region defined by bounding boxes. It’s trained on COCO2014D and COCO2014CD datasets, and the model uses a frozen CLIP ViT-L/14 text encoder to condition itself on grounding inputs.

gligen_gif

(GIF from the official website)

Grounded inpainting

import torch
from diffusers import StableDiffusionGLIGENPipeline
from diffusers.utils import load_image

# Insert objects described by text at the region defined by bounding boxes
pipe = StableDiffusionGLIGENPipeline.from_pretrained(
    "masterful/gligen-1-4-inpainting-text-box", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

input_image = load_image(
    "https://hf.co/datasets/huggingface/documentation-images/resolve/main/diffusers/gligen/livingroom_modern.png"
)
prompt = "a birthday cake"
boxes = [[0.2676, 0.6088, 0.4773, 0.7183]]
phrases = ["a birthday cake"]

images = pipe(
    prompt=prompt,
    gligen_phrases=phrases,
    gligen_inpaint_image=input_image,
    gligen_boxes=boxes,
    gligen_scheduled_sampling_beta=1,
    output_type="pil",
    num_inference_steps=50,
).images
images[0].save("./gligen-1-4-inpainting-text-box.jpg")

Grounded generation

import torch
from diffusers import StableDiffusionGLIGENPipeline
from diffusers.utils import load_image

# Generate an image described by the prompt and
# insert objects described by text at the region defined by bounding boxes
pipe = StableDiffusionGLIGENPipeline.from_pretrained(
    "masterful/gligen-1-4-generation-text-box", variant="fp16", torch_dtype=torch.float16
)
pipe = pipe.to("cuda")

prompt = "a waterfall and a modern high speed train running through the tunnel in a beautiful forest with fall foliage"
boxes = [[0.1387, 0.2051, 0.4277, 0.7090], [0.4980, 0.4355, 0.8516, 0.7266]]
phrases = ["a waterfall", "a modern high speed train running through the tunnel"]

images = pipe(
    prompt=prompt,
    gligen_phrases=phrases,
    gligen_boxes=boxes,
    gligen_scheduled_sampling_beta=1,
    output_type="pil",
    num_inference_steps=50,
).images
images[0].save("./gligen-1-4-generation-text-box.jpg")

Refer to the documentation to learn more.

Thanks to @nikhil-masterful for contributing GLIGEN in #4441.

Tiny Autoencoder

@madebyollin trained two Autoencoders (on Stable Diffusion and Stable Diffusion XL, respectively) to dramatically cut down the image decoding time. The effects are especially pronounced when working with larger-resolution images. You can use AutoencoderTiny to take advantage of it.

Here’s the example usage for Stable Diffusion:

import torch
from diffusers import DiffusionPipeline, AutoencoderTiny

pipe = DiffusionPipeline.from_pretrained(
    "stabilityai/stable-diffusion-2-1-base", torch_dtype=torch.float16
)
pipe.vae = AutoencoderTiny.from_pretrained("madebyollin/taesd", torch_dtype=torch.float16)
pipe = pipe.to("cuda")

prompt = "slice of delicious New York-style berry cheesecake"
image = pipe(prompt, num_inference_steps=25).images[0]
image.save("cheesecake.png")

Refer to the documentation to learn more. Refer to this material to understand the implications of using this Autoencoder in terms of inference latency and memory footprint.

Fine-tuning Stable Diffusion XL with DreamBooth and LoRA on a free-tier Colab Notebook

Stable Diffusion XL’s (SDXL) high memory requirements often seem restrictive when it comes to using it for downstream applications. Even if one uses parameter-efficient fine-tuning techniques like LoRA, fine-tuning just the UNet component of SDXL can be quite memory-intensive. So, running it on a free-tier Colab Notebook (that usually has a 16 GB T4 GPU attached) seems impossible.

Now, with better support for gradient checkpointing and other recipes like 8 Bit Adam (via bitsandbytes), it is possible to fine-tune the UNet of SDXL with DreamBooth and LoRA on a free-tier Colab Notebook.

Check out the Colab Notebook to learn more.

Thanks to @ethansmith2000 for improving the gradient checkpointing support in #4474.

Support of push_to_hub for models, schedulers, and pipelines

Our models, schedulers, and pipelines now support an option of push_to_hub via the save_pretrained() and also come with a push_to_hub() method. Below are some examples of usage.

Models

from diffusers import ControlNetModel

controlnet = ControlNetModel(
    block_out_channels=(32, 64),
    layers_per_block=2,
    in_channels=4,
    down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
    cross_attention_dim=32,
    conditioning_embedding_out_channels=(16, 32),
)
controlnet.push_to_hub("my-controlnet-model")
# or controlnet.save_pretrained("my-controlnet-model", push_to_hub=True)

Schedulers

from diffusers import DDIMScheduler

scheduler = DDIMScheduler(
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule="scaled_linear",
    clip_sample=False,
    set_alpha_to_one=False,
)
scheduler.push_to_hub("my-controlnet-scheduler")

Pipelines

from diffusers import (
    UNet2DConditionModel,
    AutoencoderKL,
    DDIMScheduler,
    StableDiffusionPipeline,
)
from transformers import CLIPTextModel, CLIPTextConfig, CLIPTokenizer

unet = UNet2DConditionModel(
    block_out_channels=(32, 64),
    layers_per_block=2,
    sample_size=32,
    in_channels=4,
    out_channels=4,
    down_block_types=("DownBlock2D", "CrossAttnDownBlock2D"),
    up_block_types=("CrossAttnUpBlock2D", "UpBlock2D"),
    cross_attention_dim=32,
)

scheduler = DDIMScheduler(
    beta_start=0.00085,
    beta_end=0.012,
    beta_schedule="scaled_linear",
    clip_sample=False,
    set_alpha_to_one=False,
)

vae = AutoencoderKL(
    block_out_channels=[32, 64],
    in_channels=3,
    out_channels=3,
    down_block_types=["DownEncoderBlock2D", "DownEncoderBlock2D"],
    up_block_types=["UpDecoderBlock2D", "UpDecoderBlock2D"],
    latent_channels=4,
)

text_encoder_config = CLIPTextConfig(
    bos_token_id=0,
    eos_token_id=2,
    hidden_size=32,
    intermediate_size=37,
    layer_norm_eps=1e-05,
    num_attention_heads=4,
    num_hidden_layers=5,
    pad_token_id=1,
    vocab_size=1000,
)
text_encoder = CLIPTextModel(text_encoder_config)
tokenizer = CLIPTokenizer.from_pretrained("hf-internal-testing/tiny-random-clip")

components = {
    "unet": unet,
    "scheduler": scheduler,
    "vae": vae,
    "text_encoder": text_encoder,
    "tokenizer": tokenizer,
    "safety_checker": None,
    "feature_extractor": None,
}
pipeline = StableDiffusionPipeline(**components)
pipeline.push_to_hub("my-pipeline")

Refer to the documentation to know more.

Thanks to @Wauplin for his generous and constructive feedback (refer to this #4218) on this feature.

Better support for loading Kohya-trained LoRA checkpoints

Providing seamless support for loading Kohya-trained LoRA checkpoints from diffusers is important for us. This is wh...

Read more