WACV 2025
Anh-Quan Cao1 Maximilian Jaritz2 Matthieu Guillaumin2 Raoul de Charette1 Loris Bazzani2
If you find this work or code useful, please cite our paper and give this repo a star:
@InProceedings{cao2024latteclip,
title={LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts},
author={Anh-Quan Cao and Maximilian Jaritz and Matthieu Guillaumin and Raoul de Charette and Loris Bazzani},
year={2024},
booktitle = {arXiv}
}
- 17/12/2024: code is released.
- 14/10/2024: code will be available soon.
Follow these steps to install the necessary dependencies:
Create a new conda environment and install the dependencies:
conda create -n latteclip python=3.10
conda activate latteclip
Navigate to the latteclip
directory and run the following command:
make install
make install-training
Follow the official instructions here.
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .
Create a folder to store the data and set the path in the bash variable $LATTECLIP_DATA_DIR
:
mkdir -p /path/to/data
export LATTECLIP_DATA_DIR=/path/to/data
Download the data from this link and extract all files into the $LATTECLIP_DATA_DIR
.
Navigate to the latteclip
directory and run the preprocess script to create the webdataset, tarfiles, and extract the clip features:
cd latteclip
bash scripts/preprocess/preprocess.sh
To generate image descriptions, follow these steps:
Run the following command:
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh $MACHINE_ID $NUM_MACHINE classname_dtd dtd $NUM_PROCESSES_PER_GPU $NUM_GPUS
Assume you have 2 machines, 1 GPU per machine, and 5 generation processes per Tesla V100 32g GPU:
Machine 0:
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 2 classname_dtd dtd 5 1
Machine 1:
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 1 2 classname_dtd dtd 5 1
Use the following commands:
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_dtd dtd 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_eurosat eurosat 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_scene sun397 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_flower flower102 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_food101 food101 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_pets oxford_pets 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_car stanford_cars 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_ufc ucf101 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_caltech caltech101 5 1
The process is similar to generating image descriptions. Use the following commands:
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 dtd_describe_common_v3 dtd 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 eurosat_describe_common_v3 eurosat 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 sun397_describe_common_v3 sun397 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 flower102_describe_common_v3 flower102 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 food101_describe_common_v3 food101 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 pets_describe_common_v3 oxford_pets 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 car_describe_common_v3 stanford_cars 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 ufc_describe_common_v3 ucf101 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 caltech_describe_common_v3 caltech101 5 1
To train the model on dtd
, run:
bash scripts/unsupervised/dtd/dtd_fine_tune_multiclass.sh $lr $class_per_image $device $port $seed $exp_name
$lr
: Learning rate$class_per_image
: Number of classes per image (always set to 1)$device
: Device ID$port
: Port for the job (Not used)$seed
: Random seed$exp_name
: Experiment name
To train with learning rate 1e-7, on device 0, with port 25680, random seed 3, and experiment name exp_dtd
:
bash scripts/unsupervised/dtd_fine_tune_multiclass.sh 1e-7 1 0 25680 1 exp_dtd
bash scripts/unsupervised/eurosat_fine_tune_multiclass.sh 1e-7 1 0 25666 1 exp_eurosat
bash scripts/unsupervised/caltech101_fine_tune_multiclass.sh 1e-7 1 0 25665 1 exp_caltech101
bash scripts/unsupervised/fgvc_aircraft/fgvc_aircraft_fine_tune_multiclass.sh 1e-7 1 0 25667 1 exp_fgvc_aircraft
bash scripts/unsupervised/flower102_fine_tune_multiclass.sh 1e-7 1 0 25668 1 exp_flower102
bash scripts/unsupervised/food101_fine_tune_multiclass.sh 1e-7 1 0 25669 1 exp_food101
bash scripts/unsupervised/oxford_pets_fine_tune_multiclass.sh 1e-7 1 0 25670 1 exp_oxford_pets
bash scripts/unsupervised/stanford_cars/stanford_cars_fine_tune_multiclass.sh 1e-7 1 0 25671 1 exp_stanford_cars
bash scripts/unsupervised/sun397_fine_tune_multiclass.sh 1e-7 1 0 25672 1 exp_sun397
bash scripts/unsupervised/ucf101_fine_tune_multiclass.sh 1e-7 1 0 25673 1 exp_ucf101
Note
Logs will be stored in the logs
folder.
This repository is built upon OpenCLIP and LLaVA.
The research was conducted mainly during Quan’s internship at Amazon. The research was also supported by the ANR project SIGHT (ANR-20-CE23-0016) and SAMBA collaborative project co-funded by BpiFrance in the Investissement d’Avenir Program. Computation was performed partly using HPC resources from GENCI–IDRIS (AD011012808R2, AD011014102R1). We thank Ajanthan Thalaiyasingam and Mohammad Fahes for their insightful suggestions. We also extend our gratitude to Mohammad Fahes and Ivan Lopes for their thorough proofreading.