This notebook demonstrates how to finetune the type of CLIP models used for Stable Diffusion with huggingface libs on a self-defined dataset.
Based on the huggingface transformers CLIP example.
The dataset should be provided as a collection of images as .jpg or .jpeg files. For each file, there should be a .txt file with the same name that contains the caption:
fluffy-dog.jpgfluffy-dog.txt- caption forfluffy-dog.jpg, for examplea picture of a fluffy dog.
In the huggingface_finetune_clip_runner.ipynb is a code cell that outputs a .json file in a format that huggingface datasets can understand for such a collection of files.
Load huggingface_finetune_clip_runner.ipynb in an environment that already has PyTorch and torchvision installed. Work through the cells one by one - you will need to change the root_folder and out_json to match your needs:
root_folder = "/Users/damian/2.current/stablediffusion/buzzybee/fullsize"
out_json = "/Users/damian/2.current/stablediffusion/buzzybee.json"