Skip to content

Using FLUX.1 Kontext for creating segmentation masks for objects absent from images, enabling workflows in inpainting and virtual try-ons.

License

Notifications You must be signed in to change notification settings

jroessler/bfl-kontext-hackathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 

Repository files navigation

Dreaming Masks with FLUX.1 Kontext

💡 Using FLUX.1 Kontext for creating segmentation masks for objects absent from images, enabling workflows in inpainting and virtual try-ons.

This project is part of the Black Forest Labs Hackathon

Project Description

Image segmentation is a very challenging task. Over the years, many image segmentation models have been implemented to tackle it, including models such as YOLO, SAM / SAM2, Sapiens, and even promptable pipelines like GroundingDino + SAM / SAM2. While effective in certain settings, these methods often struggle with out-of-distribution samples and unreliable results if complex masks are necessary, they are difficult to prompt (e.g. you have to provide points or bounding boxes), and more critically, they cannot generate masks for objects that don’t exist in the given image. For example, consider the scenario of putting glasses on a model who is not wearing any glasses. Before even thinking about inpainting the glasses, the first step is to generate a mask that defines exactly where the glasses should appear. With current segmentation models, this is nearly impossible to automate reliably. Another example is putting long socks on a model that is wearing short socks. With current segmentation models, one would have to cobble together several different masks (i.e. short sock mask + calf mask - (optionally) a shoe mask), and then attempt to merge them using a mix of post-processing tricks like mask dilation or filling in holes. In short, existing segmentation models fall short because they only identify what’s already in the image - like the motto says: “Getting blood out from a stone is impossible” - until now!

To address these challenges, our proof-of-concept leverages FLUX.1 Kontext. Instead of relying on fixed segmentation models, we train multiple lightweight LoRAs to generate complex segmentation masks even for objects absent from the original image. Each LoRA is tailored to generate one specific type of mask. For example, a dedicated LoRA can be trained to segment a long sock region (defined as the area below the knee, excluding parts from clothing and shoes). Additionally, we have found that as few as 10 training samples are sufficient for FLUX.1 Kontext to learn the mask and produce highly consistent results. Our approach generalizes well and achieves strong results on out-of-sample data, opening the door to new workflows in tasks like inpainting and virtual try-on. Honestly, the final results blew us away!

But first things first …

Motivation: Using Generative AI in Virtual Try-On Tasks

Virtual try-on tasks focus on transferring clothing items - such as t-shirts, socks, bras, or trousers - from flatlay images (i.e. images that show the product itself; usually on a white background) onto on-model images (i.e. images that show a model wearing a specific outfit from a specific brand; usually either on a neutral background such as white or gray-ish or as a mood where the model is placed in an appealing setting). Latter can also be AI-generated. Technically, the workflow always requires (a) a flatlay image and (b) a reference on-model image (see examples below). The workflow then proceeds as follows:

  1. Extract a mask around the target area in the reference on-model image that is to be modified (for example, isolating the black trousers worn by the model).
  2. Supply both the flatlay image (e.g., displaying leo trousers) and the reference on-model image, along with its corresponding mask from step 1, to a specialized virtual try-on model - for instance, catvton
  3. Ask the specialized virtual try-on model to replace the outfit in the on-model image (e.g. the black trousers) highlighted by the mask from step 1 with the product from the flatlay image (e.g. the leo trousers). See results below.
Flatlay Image Reference On-Model Image Generated On-Model Image
Flatlay Image Reference On-Model Image Generated On-Model Image

One notoriously difficult but very important step in this process is automatically constructing the correct mask for the area to be modified (step 1). Traditional image segmentation models can only outline objects already present; they cannot accurately predict a mask for new or complex items. For example, changing short socks to long socks in a given image requires the segmentation model to mask the foot, the sock, and the calves; another example is replacing a t-shirt with a hoodie, where additionally to the t-shirt we have to mask the upper and lower arms of the model in the on-model image. See examples below.

Typical limitations of current segmentation models:

  1. It’s impossible to mask an object that is not present in the given image (e.g. glasses when the model is not wearing any glasses yet)
  2. It’s really hard to construct complex masks because of the limited knowledge of objects in a given image (e.g., a long sock mask that is necessary to replace no socks / short socks with long socks)
    • For example, SAM2 is really good in identifying a person as a whole but getting a mask let’s say only for the calves is really difficult (you can try to use a bounding box but even that comes with some limitations; where do the points from the bounding box come from?)
  3. Most of the segmentation models struggle with out-of-distribution samples. For example, Sapiens is really good in identifying different parts of the human body + different clothing types. But as soon as we only see a part of the human body (e.g. lower legs), Sapiens fails to recognize the different body parts and clothing types.
1. Problem 2. Problem
img1 img2
How can we automatically built such a complex mask (green area in the image)? How can we automatically built such a complex mask (green area in the image)?

Some Notes on FLUX.1 Kontext Capabilities (or should we say Incapabilities)

  • The first limitation we observed is the difficulty of generating specific new objects in images. We initially assumed FLUX.1 Kontext could directly create modifications such as turning short socks into long socks. In practice, this turned out to be unreliable. There are many variations of what long socks could look like, and text prompts alone are too ambiguous to consistently control length or style. Misunderstandings can also occur, for example when the model replaces shoes with only socks, even though the shoes should remain in the image. This makes it difficult to depend on inherent generative capabilities for precise object modifications.
  • Another point concerns sampling efficiency. Standard workflows with FLUX.1 Kontext typically require 20 or more sampling steps to achieve strong results. With our approach, however, high-quality masks can often be produced with as few as 10 steps, sometimes even fewer. This significantly reduces computation time and makes the pipeline much more efficient for downstream segmentation tasks.
  • The third limitation is consistency of image structure. When directly modifying images, such as adding long socks, FLUX.1 Kontext occasionally introduces unintended alterations, including slight positional shifts of the person or changes in the proportions of elements. That makes downstream segmentation much harder, if not even impossible. With our LoRA-based mask generation approach, these structural inconsistencies did not appear in testing. This stability is important for automation, since it ensures the masks are produced without unwanted changes.

Goal


To tackle this problem, the idea is to train multiple, lightweight FLUX.1 Kontext LoRAs that are capable of generating complex segmentation masks even for objects absent from the original image. Each LoRA is tailored to generate one specific type of mask. For example, a dedicated LoRA can be trained to segment a long sock region (defined as the area below the knee, excluding parts from clothing and shoes).

Process

We used Ostris’ AI Toolkit to train the FLUX.1 Kontext LoRAs. You can access it via HuggingFace or launch a RunPod Template. Basically, we followed the instructions in this video. Settings different from default:

  • Prompt / Caption: “put a green mask for [OBJECT] on the person”, where [OBJECT] refers to the area-of-interest (e.g. glasses, socks). We added the caption to each target image.
  • Linear Rank: 64
  • Steps: 1000-2000
  • Resolutions: [512,768,1024]

Overall, we trained 4 different LoRAs that can be used in virtual try-on tasks, namely: Socks-LoRA, Hat-LoRA, Glasses-LoRA, and Sweatshirt-LoRA. For each of these LoRAs, we collected ~10 training samples in the form of (a) a control image; image that serves as a starting point for FLUX.1 Kontext and (b) the target image; image that contains a green mask for the to-be modified area. Note that we created these masks manually using Photoshop. For some examples, see below.

Control Target
socks-07.png socks-07-target.png
glasses-03.jpg glasses-03-target.jpg
hats-03.jpg hats-03-target.jpg
sweatshirt-04.jpeg sweatshirt-04-target.jpeg

You can find the training (control + target images) and testing (+ results) data here:

  • Socks-LoRA. Link
    • Prompt / Caption: put a green mask for socks and lower leg skin on the person
  • Hat-LoRA. Link
    • Prompt / Caption: put a green mask for a hat on the person
  • Glasses-LoRA. Link
    • Prompt / Caption: put a green mask for glasses on the person
  • Sweatshirt-LoRA. Link
    • Prompt / Caption: put a green mask for a sweatshirt on the person

Notes on the training data

  • Select reference on-model images that represent your target distribution in terms of size, shape, perspective, crop etc.
  • Before creating your masks, ask yourself: What does the ideal mask look like? For example in the socks case, we first assumed that we were given only barefoot models but later we changed that to barefoot models, models wearings shoes, models wearing short socks, models wearing long socks etc.

Results


You can find the final LoRAs (saved at different steps; 1000 steps was usually enough) here:

We (visually) evaluated the performance of each LoRA on 4-5 different testing images and the results are just mind-blowing. See yourself:

Socks-LoRA

socks-10-results.png

socks-11-results.png

socks-12-results.png

socks-13-results.png

Hats-LoRA

hats-11-results.png

hats-12-results.png

hats-13-results.png

hats-14-results.png

Glasses-LoRA

glasses-10-results.png

glasses-11-results.png

glasses-12-results.png

glasses-13-results.png

Sweatshirt-LoRA

sweatshirt-11-results.png

sweatshirt-12-results.png

sweatshirt-13-results.png

sweatshirt-14-results.png

Inference


We provide a ComfyUI workflow that can be used to extract the mask from the given FLUX.1 Kontext prediction. The green nodes are adjustable. Everything else should stay the same.

flux-kontext-segmentation.json

About

Using FLUX.1 Kontext for creating segmentation masks for objects absent from images, enabling workflows in inpainting and virtual try-ons.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published