Authors: Sanghwan Kim, Rui Xiao, Mariana-Iuliana Georgescu, Stephan Alaniz, Zeynep Akata
Vision-Language Models (VLMs) trained with contrastive loss have achieved significant advancements in various vision and language tasks. However, the global nature of contrastive loss makes VLMs focus predominantly on foreground objects, neglecting other crucial information in the image, which limits their effectiveness in downstream tasks. To address these challenges, we propose COSMOS: CrOSs-MOdality Self-distillation for vision-language pre-training that integrates a novel text-cropping strategy and cross-attention module into a self-supervised learning framework. We create global and local views of images and texts (i.e., multi-modal augmentations), which are essential for self-distillation in VLMs. We further introduce a cross-attention module, enabling COSMOS to learn comprehensive cross-modal representations optimized via a cross-modality self-distillation loss. COSMOS consistently outperforms previous strong baselines on various zero-shot downstream tasks, including retrieval, classification, and semantic segmentation. Additionally, it surpasses CLIP-based models trained on larger datasets in visual perception and contextual understanding tasks.
We released the pre-trained COSMOS models on Huggingface. Our pre-trained models and their corresponding performances on COCO (I2T R@1 and T2I R@1), Flickr (I2T R@1 and T2I R@1) and ImageNet (Top-1) are reported below. For the full results, please refer to our paper.
Checkpoints | Arch. | Datasets | COCO I2T | COCO T2I | Flickr I2T | Flickr T2I | IN Top-1 |
---|---|---|---|---|---|---|---|
cosmos_vitb16_cc3m | ViT-B/16 | CC3M-recap | 53.1 | 40.1 | 84.1 | 68.6 | 37.1 |
cosmos_vitb16_cc12m | ViT-B/16 | CC12M-recap | 64.2 | 48.9 | 91.4 | 76.2 | 51.4 |
cosmos_vitb16_yfcc15m | ViT-B/16 | YFCC15M-recap | 67.5 | 50.9 | 92.6 | 79.6 | 52.4 |
cosmos_vitb16_merged30m | ViT-B/16 | Merged30M | 68.0 | 52.5 | 92.9 | 80.3 | 57.6 |
cosmos_vitb16_pixelprose | ViT-B/16 | PixelProse | 62.4 | 43.4 | 89.9 | 73.6 | 59.6 |
cosmos_vitb32_cc3m | ViT-B/32 | CC3M-recap | 47.6 | 33.5 | 74.3 | 59.2 | 33.0 |
cosmos_vitb32_cc12m | ViT-B/32 | CC12M-recap | 59.6 | 43.0 | 86.5 | 69.8 | 46.7 |
cosmos_vitb32_yfcc15m | ViT-B/32 | YFCC15M-recap | 64.5 | 46.0 | 90.2 | 73.3 | 48.1 |
cosmos_vitb32_merged30m | ViT-B/32 | Merged30M | 64.3 | 48.4 | 89.9 | 76.1 | 53.4 |
cosmos_vitb32_pixelprose | ViT-B/32 | PixelProse | 57.2 | 38.9 | 85.6 | 66.3 | 54.3 |
--huggingface-model-name
and --huggingface-repo-name
during inference.
Optionally, you could download each weight separately and set --resume path/to/pretrained_weights
flag in inference code.
You can set up your virtual environment following the below instructions. We built our code repository upon OpenCLIP, which is still updated frequently. We recommend you to check their repo for a detailed tutorial on creating an environment that is best suited for your system. A conda environment is also possible with the same Python and PyTorch version.
First, download the COSMOS github repo and navigate to the project’s root directory cosmos/
.
git clone https://github.com/ExplainableML/cosmos.git
cd cosmos/
Create a virtual environment using Python 3.12 and activate the virtual environment.
python3.12 -m venv cosmos_env
source cosmos_env/bin/activate
Install all requirements via pip.
pip install --upgrade pip
pip install -r requirements.txt
If you want to conduct semantic segmentation tasks, please follow SCLIP to install their dependencies as well. We wrote down their command below for completeness.
pip install openmim
mim install mmcv==2.0.1 mmengine==0.8.4 mmsegmentation==1.1.1
pip install ftfy regex yapf==0.40.1
One can optionally use anaconda to set up the environment.
conda create --name cosmos_env python=3.12
conda activate cosmos_env
Then, install all dependencies as follows.
pip install --upgrade pip
pip install -r requirements.txt
Check datasets/README.md to prepare all the inference datasets for retrieval, classification, and segmentation tasks.
To reproduce the results of downstream tasks (image-text retrieval, image classification, semantic segmentation) in the COSMOS paper, we provide an example inference bash script for each task: src/inference_retrieval.sh
, src/inference_classification.sh
, and src/inference_segmentation.sh
.
Here are detailed explanations of important flags.
--huggingface-repo-name
: Name of the Huggingface repo where the pre-trained models are stored. Should be fixed assankim2/cosmos
.--huggingface-model-name
: Name of the pretrained models. Options includecosmos_vitb16_cc3m.pt, cosmos_vitb16_cc12m.pt, cosmos_vitb16_yfcc15m.pt, cosmos_vitb16_merged30m.pt, cosmos_vitb16_pixelprose.pt
for ViT-B/16 andcosmos_vitb32_cc3m.pt, cosmos_vitb32_cc12m.pt, cosmos_vitb32_yfcc15m.pt, cosmos_vitb32_merged30m.pt, cosmos_vitb32_pixelprose.pt
for ViT-B/32.--model
: Model architecture should be matched with--huggingface-model-name
. Options includeViT-B-16
andViT-B-32
.--precision
: Defualt asamp
in our paper.--workers
: Adjustable according to your system.
--data-root-dir
should denote your directory which contains COCO and Flickr30k validation set. Please refer to /src/inference_retrieval.sh for running inference on retrieval task.
--imagenet-val
should denote your directory which contains ImageNet validation set. Please refer to /src/inference_classification.sh for running inference on classification task.
--seg-w-background
denotes a flag whether to evaluate on segmentation benchmarks with background. If --use-csa
is included, the model will use Correlative Self-Attention (CSA) block from SCLIP for segmentation. Please refer to /src/inference_segmentation.sh for running inference on segmentation task.
In order to train COSMOS from scratch, synthetic long caption datasets should be downloaded from DreamLIP's recaptioned CC3M-recap, CC12M-recap, YFCC15M-recap and combined (Merged-30M), and PixelProse. Notably, COSMOS requires all pre-training dataset to be processed into the webdataset format, to achieve higher I/O efficiency for large-scale training. In the pre-training dataset preparation step, we take CC3M-recap as an example to demonstrate how to prepare the pretraining data. The preparation for other datasets should be similar. We share the same pre-training dataset as FLAIR. Please check their repo as well if you find it interesting!
- Download DreamLIP's annotations for CC3M-recap:
wget https://huggingface.co/datasets/qidouxiong619/dreamlip_long_captions/resolve/main/cc3m_3long_3short_1raw_captions_url.csv
- Scrape the images based on the url links using img2dataset.
COSMOS is trained with Slurm GPU Cluster on 16 NVIDIA A100s 40GB (on CC3M) or 128 NVIDIA A100s 40GB (on other larger datasets). In src/
, we provide example slurm training scripts for each of the datasets: train_cc3m.sh, train_cc12m.sh, train_yfcc15m.sh, train_merged30m.sh, train_pixelprose.sh
.
Important flags are described below:
--train-data
: Root dir of where the training data (shards) is stored.--train-num-samples
: the total number of training samples. This should be adjustable based on your available data.--use-imagecrop-aug
: Using multi-crop image augmentation described in the paper.--global-crops-number
: Number of global crop of image. Fixed as 2.--local-crops-number
: Number of local crop of image.--crop-scale
: Determine the scale s of global and local crop images. (0.05, s) for local crops and (s, 1.0) for global crops. Fixed as 0.4--caption-sampling-mode
: Determine how captions are sampled. Fixed astextcrop
ortextcrop_pixelprose
.--num-sampled-captions
: Total number of captions (global+local)--momentum-teacher
: Initial momentum value. This should be adjusted based on batch size. We used 0.999 for 1k batch and 0.99 for 4k batch.--fix-momentum
: Fix momentum value during training.--output-all
: Output both patch (or word) tokens and [cls] (or [eot]) tokens.--attentional-pool
: Set cross-attention module in model.--cosmos
: Use COSMOS loss during training.
We visualize the attention weights of image and text cross-attention modules. Patch-wise (image) and token-wise (caption) attention weights are both normalized between 0 and 1.
We thank OpenCLIP for providing the amazing code base. Meanwhile, we acknowledge DreamLIP and PixelProse for providing us with various pre-training datasets with captions from MLLMs. We are also greateful for SCLIP for providing the detailed scheme for semantic segmentation task.
If you find our work useful, please star this repo and cite:
@article{kim2025cosmos,
title={COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training},
author={Kim, Sanghwan and Xiao, Rui and Georgescu, Mariana-Iuliana and Alaniz, Stephan and Akata, Zeynep},
journal={CVPR},
year={2025}
}