|
1 |
| -# TCL: Text-grounded Contrastive Learning for Unsupervised Open-world Semantic Segmentation |
| 1 | +# TCL: Text-grounded Contrastive Learning (CVPR'23) |
2 | 2 |
|
3 |
| -[**Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs**](https://arxiv.org/abs/2212.00785) |
| 3 | +Official PyTorch implementation of [**Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs**](https://arxiv.org/abs/2212.00785), *Junbum Cha, Jonghwan Mun, Byungseok Roh*, CVPR 2023. |
4 | 4 |
|
5 |
| -Junbum Cha, Jonghwan Mun, and Byungseok Roh. |
| 5 | +**T**ext-grounded **C**ontrastive **L**earning (TCL) is an open-world semantic segmentation framework using only image-text pairs. TCL enables a model to learn region-text alignment without train-test discrepancy. |
6 | 6 |
|
7 |
| -The code will be released soon. |
| 7 | +We will release a demo soon. |
8 | 8 |
|
| 9 | +<!-- <div align="center"> --> |
| 10 | +<!-- <figure> --> |
| 11 | +<!-- <img alt="" src="./assets/radar_chart.jpg" width="480"> --> |
| 12 | +<!-- </figure> --> |
| 13 | +<!-- </div> --> |
9 | 14 | <div align="center">
|
10 | 15 | <figure>
|
11 |
| - <img alt="" src="./assets/radar_chart.jpg" width="480"> |
| 16 | + <img alt="" src="./assets/method.jpg"> |
12 | 17 | </figure>
|
13 | 18 | </div>
|
14 | 19 |
|
15 | 20 |
|
16 |
| -## Visual examples |
| 21 | +## Results |
17 | 22 |
|
18 |
| -- Qualitative examples in PASCAL VOC |
| 23 | +TCL can perform segmentation on both (a, c) existing segmentation benchmarks and (b) arbitrary concepts, such as proper nouns and free-form text, in the wild images. |
19 | 24 |
|
| 25 | +<div align="center"> |
| 26 | +<figure> |
| 27 | + <img alt="" src="./assets/main.jpg"> |
| 28 | +</figure> |
| 29 | +</div> |
| 30 | + |
| 31 | +<br/> |
| 32 | + |
| 33 | +<details> |
| 34 | +<summary> Additional examples in PASCAL VOC </summary> |
20 | 35 | <p align="center">
|
21 | 36 | <img src="./assets/examples-voc.jpg" width="800" />
|
22 | 37 | </p>
|
| 38 | +</details> |
23 | 39 |
|
24 |
| -- Qualitative examples in the wild |
25 |
| - |
| 40 | +<details> |
| 41 | +<summary> Additional examples in the wild </summary> |
26 | 42 | <p align="center">
|
27 | 43 | <img src="./assets/examples-in-the-wild.jpg" width="800" />
|
28 | 44 | </p>
|
| 45 | +</details> |
| 46 | + |
| 47 | + |
| 48 | +## Dependencies |
| 49 | + |
| 50 | +We used pytorch 1.12.1 and torchvision 0.13.1. |
| 51 | + |
| 52 | +```bash |
| 53 | +pip install -U openmim |
| 54 | +mim install mmcv-full==1.6.2 mmsegmentation==0.27.0 |
| 55 | +pip install -r requirements.txt |
| 56 | +``` |
| 57 | + |
| 58 | +Note that the order of requirements roughly represents the importance of the version. |
| 59 | +We recommend using the same version for at least `webdataset`, `mmsegmentation`, and `timm`. |
| 60 | + |
| 61 | + |
| 62 | +## Datasets |
| 63 | + |
| 64 | +Note that much of this section is adapted from the [data preparation section of GroupViT README](https://github.com/NVlabs/GroupViT#data-preparation). |
| 65 | + |
| 66 | +We use [webdataset](https://webdataset.github.io/webdataset/) as scalable data format in training and [mmsegmentation](https://github.com/open-mmlab/mmsegmentation) for semantic segmentation evaluation. |
| 67 | + |
| 68 | +The overall file structure is as follows: |
| 69 | + |
| 70 | +```shell |
| 71 | +TCL |
| 72 | +├── data |
| 73 | +│ ├── gcc3m |
| 74 | +│ │ ├── gcc-train-000000.tar |
| 75 | +│ │ ├── ... |
| 76 | +│ ├── gcc12m |
| 77 | +│ │ ├── cc-000000.tar |
| 78 | +│ │ ├── ... |
| 79 | +│ ├── cityscapes |
| 80 | +│ │ ├── leftImg8bit |
| 81 | +│ │ │ ├── train |
| 82 | +│ │ │ ├── val |
| 83 | +│ │ ├── gtFine |
| 84 | +│ │ │ ├── train |
| 85 | +│ │ │ ├── val |
| 86 | +│ ├── VOCdevkit |
| 87 | +│ │ ├── VOC2012 |
| 88 | +│ │ │ ├── JPEGImages |
| 89 | +│ │ │ ├── SegmentationClass |
| 90 | +│ │ │ ├── ImageSets |
| 91 | +│ │ │ │ ├── Segmentation |
| 92 | +│ │ ├── VOC2010 |
| 93 | +│ │ │ ├── JPEGImages |
| 94 | +│ │ │ ├── SegmentationClassContext |
| 95 | +│ │ │ ├── ImageSets |
| 96 | +│ │ │ │ ├── SegmentationContext |
| 97 | +│ │ │ │ │ ├── train.txt |
| 98 | +│ │ │ │ │ ├── val.txt |
| 99 | +│ │ │ ├── trainval_merged.json |
| 100 | +│ │ ├── VOCaug |
| 101 | +│ │ │ ├── dataset |
| 102 | +│ │ │ │ ├── cls |
| 103 | +│ ├── ade |
| 104 | +│ │ ├── ADEChallengeData2016 |
| 105 | +│ │ │ ├── annotations |
| 106 | +│ │ │ │ ├── training |
| 107 | +│ │ │ │ ├── validation |
| 108 | +│ │ │ ├── images |
| 109 | +│ │ │ │ ├── training |
| 110 | +│ │ │ │ ├── validation |
| 111 | +│ ├── coco_stuff164k |
| 112 | +│ │ ├── images |
| 113 | +│ │ │ ├── train2017 |
| 114 | +│ │ │ ├── val2017 |
| 115 | +│ │ ├── annotations |
| 116 | +│ │ │ ├── train2017 |
| 117 | +│ │ │ ├── val2017 |
| 118 | +``` |
| 119 | + |
| 120 | +The instructions for preparing each dataset are as follows. |
| 121 | + |
| 122 | +### Training datasets |
| 123 | + |
| 124 | +In training, we use Conceptual Caption 3m and 12m. We use [img2dataset](https://github.com/rom1504/img2dataset) tool to download and preprocess the datasets. |
| 125 | + |
| 126 | +#### GCC3M |
| 127 | + |
| 128 | +Please download the training split annotation file from [Conceptual Caption 3M](https://ai.google.com/research/ConceptualCaptions/download) and name it as `gcc3m.tsv`. |
| 129 | + |
| 130 | +Then run `img2dataset` to download the image text pairs and save them in the webdataset format. |
| 131 | +``` |
| 132 | +sed -i '1s/^/caption\turl\n/' gcc3m.tsv |
| 133 | +img2dataset --url_list gcc3m.tsv --input_format "tsv" \ |
| 134 | + --url_col "url" --caption_col "caption" --output_format webdataset \ |
| 135 | + --output_folder data/gcc3m \ |
| 136 | + --processes_count 16 --thread_count 64 \ |
| 137 | + --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \ |
| 138 | + --enable_wandb True --save_metadata False --oom_shard_count 6 |
| 139 | +rename -d 's/^/gcc-train-/' data/gcc3m/* |
| 140 | +``` |
| 141 | +Please refer to [img2dataset CC3M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md) for more details. |
| 142 | + |
| 143 | +#### GCC12M |
| 144 | + |
| 145 | +Please download the annotation file from [Conceptual Caption 12M](https://github.com/google-research-datasets/conceptual-12m) and name it as `gcc12m.tsv`. |
| 146 | + |
| 147 | +Then run `img2dataset` to download the image text pairs and save them in the webdataset format. |
| 148 | +``` |
| 149 | +sed -i '1s/^/caption\turl\n/' gcc12m.tsv |
| 150 | +img2dataset --url_list gcc12m.tsv --input_format "tsv" \ |
| 151 | + --url_col "url" --caption_col "caption" --output_format webdataset \ |
| 152 | + --output_folder data/gcc12m \ |
| 153 | + --processes_count 16 --thread_count 64 \ |
| 154 | + --image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \ |
| 155 | + --enable_wandb True --save_metadata False --oom_shard_count 6 |
| 156 | +rename -d 's/^/cc-/' data/gcc12m/* |
| 157 | +``` |
| 158 | +Please refer to [img2dataset CC12M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc12m.md) for more details. |
| 159 | + |
| 160 | + |
| 161 | +### Evaluation datasets |
| 162 | + |
| 163 | +In the paper, we use 8 benchmarks; (i) w/ background: PASCAL VOC20, PASCAL Context59, and COCO-Object, and (ii) w/o background: PASCAL VOC, PASCAL Context, COCO-Stuff, Cityscapes, and ADE20k. |
| 164 | +Since some benchmarks share the data sources (e.g., VOC20 and VOC), we need to prepare 5 datasets: PASCAL VOC, PASCAL Context, COCO-Stuff164k, Cityscapes, and ADE20k. |
| 165 | + |
| 166 | +Please download and setup [PASCAL VOC](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-voc), [PASCAL Context](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-context), [COCO-Stuff164k](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#coco-stuff-164k), [Cityscapes](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#cityscapes), and [ADE20k](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#ade20k) datasets following [MMSegmentation data preparation document](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md). |
| 167 | + |
| 168 | +#### COCO Object |
| 169 | + |
| 170 | +COCO-Object dataset uses only object classes from COCO-Stuff164k dataset by collecting instance semgentation annotations. |
| 171 | +Run the following command to convert instance segmentation annotations to semantic segmentation annotations: |
| 172 | + |
| 173 | +```shell |
| 174 | +python convert_dataset/convert_coco.py data/coco_stuff164k/ -o data/coco_stuff164k/ |
| 175 | +``` |
| 176 | + |
| 177 | + |
| 178 | +## Training |
| 179 | + |
| 180 | +We use 16 and 8 NVIDIA V100 GPUs for the main and ablation experiments, respectively. |
| 181 | + |
| 182 | +### Single node |
| 183 | + |
| 184 | +``` |
| 185 | +torchrun --rdzv_endpoint=localhost:5 --nproc_per_node=auto main.py --cfg ./configs/tcl.yml |
| 186 | +``` |
| 187 | + |
| 188 | +### Multi node |
| 189 | + |
| 190 | +``` |
| 191 | +torchrun --rdzv_endpoint=$HOST:$PORT --nproc_per_node=auto --nnodes=$NNODES --node_rank=$RANK main.py --cfg ./configs/tcl.yml |
| 192 | +``` |
| 193 | + |
| 194 | +## Evaluation |
| 195 | + |
| 196 | +Zero-shot transfer to semantic segmentation: |
| 197 | + |
| 198 | +``` |
| 199 | +torchrun --rdzv_endpoint=localhost:5 --nproc_per_node=auto main.py --resume checkpoints/tcl.pth --eval |
| 200 | +``` |
29 | 201 |
|
30 | 202 |
|
31 | 203 | ## Citation
|
32 | 204 |
|
33 | 205 | ```bibtex
|
34 |
| -@article{cha2022tcl, |
| 206 | +@inproceedings{cha2022tcl, |
35 | 207 | title={Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs},
|
36 | 208 | author={Cha, Junbum and Mun, Jonghwan and Roh, Byungseok},
|
37 |
| - journal={arXiv preprint arXiv:2212.00785}, |
38 |
| - year={2022} |
| 209 | + booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)}, |
| 210 | + year={2023} |
39 | 211 | }
|
40 | 212 | ```
|
| 213 | + |
| 214 | + |
| 215 | +## License |
| 216 | + |
| 217 | +This project is released under [MIT license](./LICENSE). |
0 commit comments