CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation

Accepted to ICCV 2025 Oral

📄 Overview

We reveal that inter-class correlations impairs CLIP's segmentation performance. Accordingly, we propose CorrCLIP, which reconstructs the scope and value of patch correlations to reduce inter-class correlations. Additionally, we leverage two additional branches to strengthen final patch features. Finally, we update segmentation maps with generated masks to improve spatial consistency. CorrCLIP achieves superior performance across eight benchmarks.

📦 Dependencies

# git clone this repository
git clone https://github.com/zdk258/CorrCLIP.git
cd CorrCLIP

# create new anaconda env
conda create -n CorrCLIP python=3.10
conda activate CorrCLIP

# install dependencies
pip install torch==2.1.2 torchvision==0.16.2 torchaudio==2.1.2 --index-url https://download.pytorch.org/whl/cu121
pip install -U openmim
mim install mmengine==0.10.7
mim install mmcv==2.1.0
pip install mmsegmentation==1.2.2
pip install -r requirements.txt

⚙️ Mask Generator Configuration

Modify mask_generator in base_config.py to use different mask generators. To accelerate, you can use the smaller model and adjust the corresponding parameters in set_mask_generator.

SAM2

To replicate the results from our paper, we recommend using the pre-generated SAM2 masks, where relevant parameters can be seen in the paper.

Set mask_generator to None.
Download region masks.
Extract to the data/ directory.

If you prefer to generate masks dynamically,

Set mask_generator to sam2.
Download sam2_hiera_large weights.

Mask2Former

Set mask_generator to mask2former.
The first time you run the code, it will automatically download the mask2former-swin-large-coco-panoptic weights from Hugging Face.

EoMT

Set mask_generator to eomt.
The first time you run the code, it will automatically download the coco_panoptic_eomt_large_640 weights from Hugging Face.

EntitySeg

Set mask_generator to entityseg.

Install relevant dependencies:

# install Detectron2 
python -m pip install 'git+https://github.com/facebookresearch/detectron2.git'

# install CropFormer and EntityAPI
cd CropFormer/entity_api/PythonAPI
make
cd ../../..
cd CropFormer/mask2former/modeling/pixel_decoder/ops
python setup.py build install
cd ../../../../../

Download Mask2Former_hornet_3x weights.

🚀 Evaluation

1. Datasets

With background class: PASCAL VOC (VOC21), PASCAL Context (PC60), and COCO Object (Object),

Without background class: VOC20, PC59 (i.e., VOC21 and PC60 without the background category), Cityscapes (City), ADE20k (ADE), and COCO Stuff164k (Stuff).

Please follow the data preparation document of MMSeg to download and pre-process the datasets. Move the datasets to the data/ directory. The COCO Object dataset can be converted from COCO Stuff164k by executing the following command:

python datasets/cvt_coco_object.py PATH_TO_COCO_STUFF164K -o PATH_TO_COCO164K

2. Running

# single-GPU:
python eval.py --config config/cfg_DATASET.py 

# multi-GPU:
bash dist_test.sh config/cfg_DATASET.py NUM_GPU

# evaluation on all datasets:
python eval_all.py

3. Results

The performance of CorrCLIP can be enhanced as the Mask Generator improves. The following presents the results using different Mask Generators across eight benchmark datasets:

Mask Generator	VOC21	VOC20	PC59	PC60	City	ADE	Stuff	Object	Avg
ViT-B
SAM2₃₂	74.8	88.8	48.8	44.2	49.4	26.9	31.6	43.7	51.0
SAM2₈	73.9	87.6	48.0	43.7	47.9	26.5	31.8	43.6	50.4
Mask2Former	73.9	87.8	48.2	43.7	44.3	24.6	33.9	46.2	50.3
EoMT	76.0	90.6	50.4	45.4	48.0	26.7	34.5	46.6	52.3
EntitySeg	76.2	89.6	50.7	45.7	51.6	28.6	32.4	44.5	52.4
ViT-L
SAM2₃₂	76.7	91.5	50.8	44.9	51.1	30.7	34.0	49.4	53.6
SAM2₈	76.2	91.2	49.9	44.2	48.9	29.8	33.7	49.0	52.9
Mask2Former	76.3	90.8	50.2	44.6	45.4	26.9	35.6	52.2	52.7
EoMT	78.0	92.2	52.8	46.4	50.2	30.1	36.3	52.9	54.9
EntitySeg	78.9	92.0	53.0	46.8	53.7	32.6	34.9	51.0	55.4

🤖 Gradio Inference

We provide a Gradio demo to perform segmentation on images with custom category names. You can run it on your own machine.

The demo offers two optional mask generators: SAM2 and EntitySeg. Using them requires their respective model weights and dependencies.

python demo_gradio.py

✍️ Citation

@article{zhang2024corrclip,
  title={Corrclip: Reconstructing patch correlations in clip for open-vocabulary semantic segmentation},
  author={Zhang, Dengke and Liu, Fagui and Tang, Quan},
  journal={arXiv preprint arXiv:2411.10086},
  year={2024}
}

🙏 Acknowledgement

Our implementation is based on ClearCLIP, ProxyCLIP, DINO, SAM2, Mask2Former, EoMT, and EntitySeg. Thanks for their awesome work!

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
CropFormer		CropFormer
configs		configs
datasets		datasets
eomt		eomt
images		images
open_clip		open_clip
prompts		prompts
sam2		sam2
.gitignore		.gitignore
README.md		README.md
corrclip_demo.ipynb		corrclip_demo.ipynb
corrclip_segmentor.py		corrclip_segmentor.py
custom_datasets.py		custom_datasets.py
demo_colab.py		demo_colab.py
demo_gradio.py		demo_gradio.py
dist_test.sh		dist_test.sh
eval.py		eval.py
eval_all.py		eval_all.py
myutils.py		myutils.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation

📄 Overview

📦 Dependencies

⚙️ Mask Generator Configuration

SAM2

Mask2Former

EoMT

EntitySeg

🚀 Evaluation

1. Datasets

2. Running

3. Results

🤖 Gradio Inference

✍️ Citation

🙏 Acknowledgement

About

Uh oh!

Languages

zdk258/CorrCLIP

Folders and files

Latest commit

History

Repository files navigation

CorrCLIP: Reconstructing Patch Correlations in CLIP for Open-Vocabulary Semantic Segmentation

📄 Overview

📦 Dependencies

⚙️ Mask Generator Configuration

SAM2

Mask2Former

EoMT

EntitySeg

🚀 Evaluation

1. Datasets

2. Running

3. Results

🤖 Gradio Inference

✍️ Citation

🙏 Acknowledgement

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Languages