Skip to content

Commit e643d68

Browse files
committed
Release code
1 parent ceb5b28 commit e643d68

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

63 files changed

+5631
-12
lines changed

.gitignore

+9
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
*.pyc
2+
.vscode
3+
__pycache__
4+
output
5+
.ipynb_checkpoints
6+
notebooks
7+
tcp-checker
8+
checkpoints/
9+
data/

LICENSE

+21
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
MIT License
2+
3+
Copyright (c) 2023 Kakao Brain Corp.
4+
5+
Permission is hereby granted, free of charge, to any person obtaining a copy
6+
of this software and associated documentation files (the "Software"), to deal
7+
in the Software without restriction, including without limitation the rights
8+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9+
copies of the Software, and to permit persons to whom the Software is
10+
furnished to do so, subject to the following conditions:
11+
12+
The above copyright notice and this permission notice shall be included in all
13+
copies or substantial portions of the Software.
14+
15+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21+
SOFTWARE.

README.md

+189-12
Original file line numberDiff line numberDiff line change
@@ -1,40 +1,217 @@
1-
# TCL: Text-grounded Contrastive Learning for Unsupervised Open-world Semantic Segmentation
1+
# TCL: Text-grounded Contrastive Learning (CVPR'23)
22

3-
[**Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs**](https://arxiv.org/abs/2212.00785)
3+
Official PyTorch implementation of [**Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs**](https://arxiv.org/abs/2212.00785), *Junbum Cha, Jonghwan Mun, Byungseok Roh*, CVPR 2023.
44

5-
Junbum Cha, Jonghwan Mun, and Byungseok Roh.
5+
**T**ext-grounded **C**ontrastive **L**earning (TCL) is an open-world semantic segmentation framework using only image-text pairs. TCL enables a model to learn region-text alignment without train-test discrepancy.
66

7-
The code will be released soon.
7+
We will release a demo soon.
88

9+
<!-- <div align="center"> -->
10+
<!-- <figure> -->
11+
<!-- <img alt="" src="./assets/radar_chart.jpg" width="480"> -->
12+
<!-- </figure> -->
13+
<!-- </div> -->
914
<div align="center">
1015
<figure>
11-
<img alt="" src="./assets/radar_chart.jpg" width="480">
16+
<img alt="" src="./assets/method.jpg">
1217
</figure>
1318
</div>
1419

1520

16-
## Visual examples
21+
## Results
1722

18-
- Qualitative examples in PASCAL VOC
23+
TCL can perform segmentation on both (a, c) existing segmentation benchmarks and (b) arbitrary concepts, such as proper nouns and free-form text, in the wild images.
1924

25+
<div align="center">
26+
<figure>
27+
<img alt="" src="./assets/main.jpg">
28+
</figure>
29+
</div>
30+
31+
<br/>
32+
33+
<details>
34+
<summary> Additional examples in PASCAL VOC </summary>
2035
<p align="center">
2136
<img src="./assets/examples-voc.jpg" width="800" />
2237
</p>
38+
</details>
2339

24-
- Qualitative examples in the wild
25-
40+
<details>
41+
<summary> Additional examples in the wild </summary>
2642
<p align="center">
2743
<img src="./assets/examples-in-the-wild.jpg" width="800" />
2844
</p>
45+
</details>
46+
47+
48+
## Dependencies
49+
50+
We used pytorch 1.12.1 and torchvision 0.13.1.
51+
52+
```bash
53+
pip install -U openmim
54+
mim install mmcv-full==1.6.2 mmsegmentation==0.27.0
55+
pip install -r requirements.txt
56+
```
57+
58+
Note that the order of requirements roughly represents the importance of the version.
59+
We recommend using the same version for at least `webdataset`, `mmsegmentation`, and `timm`.
60+
61+
62+
## Datasets
63+
64+
Note that much of this section is adapted from the [data preparation section of GroupViT README](https://github.com/NVlabs/GroupViT#data-preparation).
65+
66+
We use [webdataset](https://webdataset.github.io/webdataset/) as scalable data format in training and [mmsegmentation](https://github.com/open-mmlab/mmsegmentation) for semantic segmentation evaluation.
67+
68+
The overall file structure is as follows:
69+
70+
```shell
71+
TCL
72+
├── data
73+
│ ├── gcc3m
74+
│ │ ├── gcc-train-000000.tar
75+
│ │ ├── ...
76+
│ ├── gcc12m
77+
│ │ ├── cc-000000.tar
78+
│ │ ├── ...
79+
│ ├── cityscapes
80+
│ │ ├── leftImg8bit
81+
│ │ │ ├── train
82+
│ │ │ ├── val
83+
│ │ ├── gtFine
84+
│ │ │ ├── train
85+
│ │ │ ├── val
86+
│ ├── VOCdevkit
87+
│ │ ├── VOC2012
88+
│ │ │ ├── JPEGImages
89+
│ │ │ ├── SegmentationClass
90+
│ │ │ ├── ImageSets
91+
│ │ │ │ ├── Segmentation
92+
│ │ ├── VOC2010
93+
│ │ │ ├── JPEGImages
94+
│ │ │ ├── SegmentationClassContext
95+
│ │ │ ├── ImageSets
96+
│ │ │ │ ├── SegmentationContext
97+
│ │ │ │ │ ├── train.txt
98+
│ │ │ │ │ ├── val.txt
99+
│ │ │ ├── trainval_merged.json
100+
│ │ ├── VOCaug
101+
│ │ │ ├── dataset
102+
│ │ │ │ ├── cls
103+
│ ├── ade
104+
│ │ ├── ADEChallengeData2016
105+
│ │ │ ├── annotations
106+
│ │ │ │ ├── training
107+
│ │ │ │ ├── validation
108+
│ │ │ ├── images
109+
│ │ │ │ ├── training
110+
│ │ │ │ ├── validation
111+
│ ├── coco_stuff164k
112+
│ │ ├── images
113+
│ │ │ ├── train2017
114+
│ │ │ ├── val2017
115+
│ │ ├── annotations
116+
│ │ │ ├── train2017
117+
│ │ │ ├── val2017
118+
```
119+
120+
The instructions for preparing each dataset are as follows.
121+
122+
### Training datasets
123+
124+
In training, we use Conceptual Caption 3m and 12m. We use [img2dataset](https://github.com/rom1504/img2dataset) tool to download and preprocess the datasets.
125+
126+
#### GCC3M
127+
128+
Please download the training split annotation file from [Conceptual Caption 3M](https://ai.google.com/research/ConceptualCaptions/download) and name it as `gcc3m.tsv`.
129+
130+
Then run `img2dataset` to download the image text pairs and save them in the webdataset format.
131+
```
132+
sed -i '1s/^/caption\turl\n/' gcc3m.tsv
133+
img2dataset --url_list gcc3m.tsv --input_format "tsv" \
134+
--url_col "url" --caption_col "caption" --output_format webdataset \
135+
--output_folder data/gcc3m \
136+
--processes_count 16 --thread_count 64 \
137+
--image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
138+
--enable_wandb True --save_metadata False --oom_shard_count 6
139+
rename -d 's/^/gcc-train-/' data/gcc3m/*
140+
```
141+
Please refer to [img2dataset CC3M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc3m.md) for more details.
142+
143+
#### GCC12M
144+
145+
Please download the annotation file from [Conceptual Caption 12M](https://github.com/google-research-datasets/conceptual-12m) and name it as `gcc12m.tsv`.
146+
147+
Then run `img2dataset` to download the image text pairs and save them in the webdataset format.
148+
```
149+
sed -i '1s/^/caption\turl\n/' gcc12m.tsv
150+
img2dataset --url_list gcc12m.tsv --input_format "tsv" \
151+
--url_col "url" --caption_col "caption" --output_format webdataset \
152+
--output_folder data/gcc12m \
153+
--processes_count 16 --thread_count 64 \
154+
--image_size 512 --resize_mode keep_ratio --resize_only_if_bigger True \
155+
--enable_wandb True --save_metadata False --oom_shard_count 6
156+
rename -d 's/^/cc-/' data/gcc12m/*
157+
```
158+
Please refer to [img2dataset CC12M tutorial](https://github.com/rom1504/img2dataset/blob/main/dataset_examples/cc12m.md) for more details.
159+
160+
161+
### Evaluation datasets
162+
163+
In the paper, we use 8 benchmarks; (i) w/ background: PASCAL VOC20, PASCAL Context59, and COCO-Object, and (ii) w/o background: PASCAL VOC, PASCAL Context, COCO-Stuff, Cityscapes, and ADE20k.
164+
Since some benchmarks share the data sources (e.g., VOC20 and VOC), we need to prepare 5 datasets: PASCAL VOC, PASCAL Context, COCO-Stuff164k, Cityscapes, and ADE20k.
165+
166+
Please download and setup [PASCAL VOC](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-voc), [PASCAL Context](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#pascal-context), [COCO-Stuff164k](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#coco-stuff-164k), [Cityscapes](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#cityscapes), and [ADE20k](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md#ade20k) datasets following [MMSegmentation data preparation document](https://github.com/open-mmlab/mmsegmentation/blob/master/docs/en/dataset_prepare.md).
167+
168+
#### COCO Object
169+
170+
COCO-Object dataset uses only object classes from COCO-Stuff164k dataset by collecting instance semgentation annotations.
171+
Run the following command to convert instance segmentation annotations to semantic segmentation annotations:
172+
173+
```shell
174+
python convert_dataset/convert_coco.py data/coco_stuff164k/ -o data/coco_stuff164k/
175+
```
176+
177+
178+
## Training
179+
180+
We use 16 and 8 NVIDIA V100 GPUs for the main and ablation experiments, respectively.
181+
182+
### Single node
183+
184+
```
185+
torchrun --rdzv_endpoint=localhost:5 --nproc_per_node=auto main.py --cfg ./configs/tcl.yml
186+
```
187+
188+
### Multi node
189+
190+
```
191+
torchrun --rdzv_endpoint=$HOST:$PORT --nproc_per_node=auto --nnodes=$NNODES --node_rank=$RANK main.py --cfg ./configs/tcl.yml
192+
```
193+
194+
## Evaluation
195+
196+
Zero-shot transfer to semantic segmentation:
197+
198+
```
199+
torchrun --rdzv_endpoint=localhost:5 --nproc_per_node=auto main.py --resume checkpoints/tcl.pth --eval
200+
```
29201

30202

31203
## Citation
32204

33205
```bibtex
34-
@article{cha2022tcl,
206+
@inproceedings{cha2022tcl,
35207
title={Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs},
36208
author={Cha, Junbum and Mun, Jonghwan and Roh, Byungseok},
37-
journal={arXiv preprint arXiv:2212.00785},
38-
year={2022}
209+
booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
210+
year={2023}
39211
}
40212
```
213+
214+
215+
## License
216+
217+
This project is released under [MIT license](./LICENSE).

assets/main.jpg

296 KB
Loading

assets/method.jpg

270 KB
Loading

configs/default.yml

+90
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,90 @@
1+
_base_: "eval.yml"
2+
3+
data:
4+
batch_size: 256
5+
pin_memory: true
6+
num_workers: 6
7+
seed: ${train.seed}
8+
dataset:
9+
meta:
10+
gcc3m:
11+
type: img_txt_pair
12+
path: ./data/gcc3m
13+
prefix: gcc-train-{000000..00347}.tar
14+
length: 2881393
15+
gcc12m:
16+
type: img_txt_pair
17+
path: ./data/gcc12m
18+
prefix: cc-{000000..001175}.tar
19+
length: 11286526
20+
train:
21+
- gcc3m
22+
- gcc12m
23+
24+
img_aug:
25+
deit_aug: true
26+
img_size: 224
27+
img_scale: [0.08, 1.0]
28+
interpolation: bilinear
29+
color_jitter: 0.4
30+
auto_augment: 'rand-m9-mstd0.5-inc1'
31+
re_prob: 0.25
32+
re_mode: 'pixel'
33+
re_count: 1
34+
text_aug: null
35+
36+
train:
37+
start_step: 0
38+
total_steps: 50000
39+
warmup_steps: 20000
40+
ust_steps: 0
41+
base_lr: 1.6e-3
42+
weight_decay: 0.05
43+
min_lr: 4e-5
44+
clip_grad: 5.0
45+
fp16: true
46+
fp16_comm: true # use fp16 grad compression for multi-node training
47+
seed: 0
48+
49+
lr_scheduler:
50+
name: cosine
51+
52+
optimizer:
53+
name: adamw
54+
eps: 1e-8
55+
betas: [0.9, 0.999]
56+
57+
58+
evaluate:
59+
pamr: false
60+
kp_w: 0.0
61+
bg_thresh: 0.5
62+
63+
save_logits: null
64+
65+
eval_only: false
66+
eval_freq: 5000
67+
template: simple
68+
task:
69+
- voc
70+
- voc20
71+
- context
72+
- context59
73+
- coco_stuff
74+
- coco_object
75+
- cityscapes
76+
- ade20k
77+
78+
79+
checkpoint:
80+
resume: ''
81+
save_topk: 0
82+
save_all: false # if true, save every evaluation step
83+
84+
85+
model_name: "default" # display name in the logger
86+
output: ???
87+
tag: default
88+
print_freq: 20
89+
seed: 0
90+
wandb: false

configs/eval.yml

+30
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
evaluate:
2+
pamr: true
3+
bg_thresh: 0.4
4+
kp_w: 0.3
5+
6+
eval_only: true
7+
template: custom
8+
task:
9+
- voc
10+
- voc20
11+
- context
12+
- context59
13+
- coco_stuff
14+
- coco_object
15+
- cityscapes
16+
- ade20k
17+
18+
# training splits
19+
t_voc20: segmentation/configs/_base_/datasets/t_pascal_voc12_20.py
20+
t_context59: segmentation/configs/_base_/datasets/t_pascal_context59.py
21+
22+
# evaluation
23+
voc: segmentation/configs/_base_/datasets/pascal_voc12.py
24+
voc20: segmentation/configs/_base_/datasets/pascal_voc12_20.py
25+
context: segmentation/configs/_base_/datasets/pascal_context.py
26+
context59: segmentation/configs/_base_/datasets/pascal_context59.py
27+
coco_stuff: segmentation/configs/_base_/datasets/stuff.py
28+
coco_object: segmentation/configs/_base_/datasets/coco.py
29+
cityscapes: segmentation/configs/_base_/datasets/cityscapes.py
30+
ade20k: segmentation/configs/_base_/datasets/ade20k.py

0 commit comments

Comments
 (0)