Skip to content

[WACV 2025] LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

License

Notifications You must be signed in to change notification settings

astra-vision/LatteCLIP

Repository files navigation

LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts

WACV 2025

Anh-Quan Cao1    Maximilian Jaritz2    Matthieu Guillaumin2    Raoul de Charette1    Loris Bazzani2   

1 Inria 2 Amazon

arXiv

If you find this work or code useful, please cite our paper and give this repo a star:

@InProceedings{cao2024latteclip,
      title={LatteCLIP: Unsupervised CLIP Fine-Tuning via LMM-Synthetic Texts}, 
      author={Anh-Quan Cao and Maximilian Jaritz and Matthieu Guillaumin and Raoul de Charette and Loris Bazzani},
      year={2024},
      booktitle = {arXiv}
}

News

  • 17/12/2024: code is released.
  • 14/10/2024: code will be available soon.

Table of Contents

  1. Installation
  2. Data Preparation
  3. Generate Descriptions
  4. Training
  5. Acknowledgement

Installation

Follow these steps to install the necessary dependencies:

1. Install OpenCLIP's Dependencies

Create a new conda environment and install the dependencies:

conda create -n latteclip python=3.10
conda activate latteclip

Navigate to the latteclip directory and run the following command:

make install
make install-training

2. Install LLaVA

Follow the official instructions here.

git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
pip install -e .

Data Preparation

1. Create the Data Directory

Create a folder to store the data and set the path in the bash variable $LATTECLIP_DATA_DIR:

mkdir -p /path/to/data
export LATTECLIP_DATA_DIR=/path/to/data

2. Download the Data

Download the data from this link and extract all files into the $LATTECLIP_DATA_DIR.

3. Run the Preprocess Script

Navigate to the latteclip directory and run the preprocess script to create the webdataset, tarfiles, and extract the clip features:

cd latteclip
bash scripts/preprocess/preprocess.sh

Generate Descriptions

1. Generate Image Descriptions

To generate image descriptions, follow these steps:

Example with dtd Dataset

Run the following command:

bash scripts/unsupervised/extract_captions_llava_multiprocess.sh $MACHINE_ID $NUM_MACHINE classname_dtd dtd $NUM_PROCESSES_PER_GPU $NUM_GPUS

If You Have Multiple Machines

Assume you have 2 machines, 1 GPU per machine, and 5 generation processes per Tesla V100 32g GPU:

Machine 0:

bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 2 classname_dtd dtd 5 1

Machine 1:

bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 1 2 classname_dtd dtd 5 1

Generate Image Descriptions for Other Datasets

Use the following commands:

bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_dtd dtd 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_eurosat eurosat 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_scene sun397 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_flower flower102 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_food101 food101 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_pets oxford_pets 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_car stanford_cars 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_ufc ucf101 5 1
bash scripts/unsupervised/extract_captions_llava_multiprocess.sh 0 1 classname_caltech caltech101 5 1

2. Generate Group Descriptions

The process is similar to generating image descriptions. Use the following commands:

bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 dtd_describe_common_v3 dtd 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 eurosat_describe_common_v3 eurosat 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 sun397_describe_common_v3 sun397 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 flower102_describe_common_v3 flower102 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 food101_describe_common_v3 food101 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 pets_describe_common_v3 oxford_pets 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 car_describe_common_v3 stanford_cars 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 ufc_describe_common_v3 ucf101 5 1
bash scripts/unsupervised/extract_captions_llava_compare.sh 0 1 caltech_describe_common_v3 caltech101 5 1

Training

To train the model on dtd, run:

bash scripts/unsupervised/dtd/dtd_fine_tune_multiclass.sh $lr $class_per_image $device $port $seed $exp_name
  • $lr: Learning rate
  • $class_per_image: Number of classes per image (always set to 1)
  • $device: Device ID
  • $port: Port for the job (Not used)
  • $seed: Random seed
  • $exp_name: Experiment name

Example

To train with learning rate 1e-7, on device 0, with port 25680, random seed 3, and experiment name exp_dtd:

bash scripts/unsupervised/dtd_fine_tune_multiclass.sh 1e-7 1 0 25680 1 exp_dtd

Train on Other Datasets

bash scripts/unsupervised/eurosat_fine_tune_multiclass.sh 1e-7 1 0 25666 1 exp_eurosat
bash scripts/unsupervised/caltech101_fine_tune_multiclass.sh 1e-7 1 0 25665 1 exp_caltech101
bash scripts/unsupervised/fgvc_aircraft/fgvc_aircraft_fine_tune_multiclass.sh 1e-7 1 0 25667 1 exp_fgvc_aircraft
bash scripts/unsupervised/flower102_fine_tune_multiclass.sh 1e-7 1 0 25668 1 exp_flower102
bash scripts/unsupervised/food101_fine_tune_multiclass.sh 1e-7 1 0 25669 1 exp_food101
bash scripts/unsupervised/oxford_pets_fine_tune_multiclass.sh 1e-7 1 0 25670 1 exp_oxford_pets
bash scripts/unsupervised/stanford_cars/stanford_cars_fine_tune_multiclass.sh 1e-7 1 0 25671 1 exp_stanford_cars
bash scripts/unsupervised/sun397_fine_tune_multiclass.sh 1e-7 1 0 25672 1 exp_sun397
bash scripts/unsupervised/ucf101_fine_tune_multiclass.sh 1e-7 1 0 25673 1 exp_ucf101

Note

Logs will be stored in the logs folder.


Acknowledgement

This repository is built upon OpenCLIP and LLaVA.

The research was conducted mainly during Quan’s internship at Amazon. The research was also supported by the ANR project SIGHT (ANR-20-CE23-0016) and SAMBA collaborative project co-funded by BpiFrance in the Investissement d’Avenir Program. Computation was performed partly using HPC resources from GENCI–IDRIS (AD011012808R2, AD011014102R1). We thank Ajanthan Thalaiyasingam and Mohammad Fahes for their insightful suggestions. We also extend our gratitude to Mohammad Fahes and Ivan Lopes for their thorough proofreading.