1010The official implementation of the paper "[ Vision Transformer Adapter for Dense Predictions] ( https://arxiv.org/abs/2205.08534 ) ".
1111
1212## News
13-
14- (2022/06/04) Segmentation is released.\
15- (2022/06/02) Detection is released and segmentation will come soon.\
16- (2022/05/17) ViT-Adapter-L yields 60.1 box AP and 52.1 mask AP on COCO test-dev.\
13+ (2022/06/09) ViT-Adapter-L yields 60.4 box AP and 52.5 mask AP on COCO test-dev.\
14+ (2022/06/04) Code and models are released.\
15+ (2022/05/17) ~~ ViT-Adapter-L yields 60.1 box AP and 52.1 mask AP on COCO test-dev.~~ \
1716(2022/05/12) ViT-Adapter-L reaches 85.2 mIoU on Cityscapes test set without coarse data.\
1817(2022/05/05) ViT-Adapter-L achieves the SOTA on ADE20K val set with 60.5 mIoU!
1918
@@ -29,35 +28,73 @@ This work investigates a simple yet powerful adapter for Vision Transformer (ViT
2928
3029## SOTA Model Zoo
3130
32- COCO test-dev
33-
34- | Method | Framework | Pre-train | Lr schd | box AP | mask AP | #Param |
35- | :------------------:| :---------:| :---------:| :-------:| :------------------------------------------------------------------------------------------:| :------------------------------------------------------------------------------------------:| :------:|
36- | ViT-Adapter-L | HTC++ | BEiT | 3x | [ 58.5] ( https://drive.google.com/file/d/11zpPSvmuAn7aP5brxzHE8naObnOfFxby/view?usp=sharing ) | [ 50.8] ( https://drive.google.com/file/d/1wIbtzfHfPqkvZaSivzcsh4HWu1oSiun6/view?usp=sharing ) | 401M |
37- | ViT-Adapter-L (MS) | HTC++ | BEiT | 3x | [ 60.1] ( https://drive.google.com/file/d/1i-qjgUK4CMwZcmu5pkndldwfVbdkw5sU/view?usp=sharing ) | [ 52.1] ( https://drive.google.com/file/d/16mlEOPY7K-Xpx_CL650A-LWbVDm2vl4X/view?usp=sharing ) | 401M |
38-
39- ADE20K val
31+ ** COCO mini-val test-dev**
32+
33+
34+ <table >
35+ <tr align =center >
36+ <td rowspan="2" align=center><b>Method</b></td>
37+ <td rowspan="2" align=center><b>Framework</b></td>
38+ <td rowspan="2" align=center><b>Pre-train</b></td>
39+ <td rowspan="2" align=center><b>Schd</b></td>
40+ <td colspan="2" align=center><b>mini-val</b></td>
41+ <td colspan="2" align=center><b>test-dev</b></td>
42+ <td rowspan="2" align=center><b>#Param</b></td>
43+ </tr >
44+ <tr >
45+ <td>box AP</td>
46+ <td>mask AP</td>
47+ <td>box AP</td>
48+ <td>mask AP</td>
49+ </tr >
50+ <tr align =center >
51+ <td>ViT-Adapter-L</td>
52+ <td>HTC++</td>
53+ <td>BEiT</td>
54+ <td>3x</td>
55+ <td>58.4</td>
56+ <td>50.8</td>
57+ <td><a href="https://drive.google.com/file/d/1lXQxf5PJ0g0bQNkMMrhG63jal0NsmYjb/view?usp=sharing">58.9</a></td>
58+ <td><a href="https://drive.google.com/file/d/1nyuONJcHHXki0Cn8dCgbPZ9D_MURh47t/view?usp=sharing">51.3</a></td>
59+ <td>401M</td>
60+ </tr >
61+ <tr align =center >
62+ <td>ViT-Adapter-L$^\dagger$</td>
63+ <td>HTC++</td>
64+ <td>BEiT</td>
65+ <td>3x</td>
66+ <td>60.2</td>
67+ <td>52.2</td>
68+ <td><a href="https://drive.google.com/file/d/15t2Oc3FiNeLr6RnKOJ-0IbI7b2LalxbX/view?usp=sharing">60.4</a></td>
69+ <td><a href="https://drive.google.com/file/d/1TIPOJC6ieZS_ZRNCbo_AW4UqYAkQIjyN/view?usp=sharing">52.5</a></td>
70+ <td>401M</td>
71+ </tr >
72+ </table >
73+
74+ $\dagger$ demotes multi-scale testing.
75+
76+ ** ADE20K val**
4077
4178| Method | Framework | Pre-train | Iters | Crop Size | mIoU | +MS | #Param |
4279| :-------------:| :-----------:| :---------------:| :-----:| :---------:| :------------------------------------------------------------------------------------------:| :------------------------------------------------------------------------------------------:| :------:|
4380| ViT-Adapter-L | UperNet | BEiT | 160k | 640 | [ 58.0] ( https://drive.google.com/file/d/1KsV4QPfoRi5cj2hjCzy8VfWih8xCTrE3/view?usp=sharing ) | [ 58.4] ( https://drive.google.com/file/d/1haeTUvQhKCM7hunVdK60yxULbRH7YYBK/view?usp=sharing ) | 451M |
4481| ViT-Adapter-L | Mask2Former | BEiT | 160k | 640 | [ 58.3] ( https://drive.google.com/file/d/1jj56lSbc2s4ZNc-Hi-w6o-OSS99oi-_g/view?usp=sharing ) | [ 59.0] ( https://drive.google.com/file/d/1hgpZB5gsyd7LTS7Aay2CbHmlY10nafCw/view?usp=sharing ) | 568M |
4582| ViT-Adapter-L | Mask2Former | COCO-Stuff-164k | 80k | 896 | [ 59.4] ( https://drive.google.com/file/d/1B_1XSwdnLhjJeUmn1g_nxfvGJpYmYWHa/view?usp=sharing ) | [ 60.5] ( https://drive.google.com/file/d/1UtjmgcYKR-2h116oQXklUYOVcTw15woM/view?usp=sharing ) | 571M |
4683
47- Cityscapes val/test
84+ ** Cityscapes val/test**
4885
4986| Method | Framework | Pre-train | Iters | Crop Size | val mIoU | val/test +MS | #Param |
5087| :-------------:| :-----------:| :---------:| :-----:| :---------:| :------------------------------------------------------------------------------------------:| :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------:| :------:|
5188| ViT-Adapter-L | Mask2Former | Mapillary | 80k | 896 | [ 84.9] ( https://drive.google.com/file/d/1LKy0zz-brCBbKGmUWquadILaBHdDLR6s/view?usp=sharing ) | [ 85.8] ( https://drive.google.com/file/d/1LSJvK1BPSbzm9eWpKL8Xo7RmYBrd2xux/view?usp=sharing ) /[ 85.2] ( https://www.cityscapes-dataset.com/anonymous-results/?id=0ca6821dc3183ff970bd5266f812df2eaa4519ecb1973ca1308d65a3b546bf27 ) | 571M |
5289
53- COCO-Stuff-10K
90+ ** COCO-Stuff-10K**
5491
5592| Method | Framework | Pre-train | Iters | Crop Size | mIoU | +MS | #Param |
5693| :-------------:| :-----------:| :---------:| :-----:| :---------:| :------------------------------------------------------------------------------------------:| :------------------------------------------------------------------------------------------:| :------:|
5794| ViT-Adapter-L | UperNet | BEiT | 80k | 512 | [ 51.0] ( https://drive.google.com/file/d/1xZodiAvOLGaLtMGx_btYVZIMC2VKrDhI/view?usp=sharing ) | [ 51.4] ( https://drive.google.com/file/d/1bmFG9GA4bRqOEJfqXcO7nWYPwG3wSk2J/view?usp=sharing ) | 451M |
5895| ViT-Adapter-L | Mask2Former | BEiT | 40k | 512 | [ 53.2] ( https://drive.google.com/file/d/1Buewc1n7GBAcBDXeia-QarujrDZqc_Sx/view?usp=sharing ) | [ 54.2] ( https://drive.google.com/file/d/1kQgJUHDeQoO3pPY6QoXRKwyF7heT7wCJ/view?usp=sharing ) | 568M |
5996
60- Pascal Context
97+ ** Pascal Context**
6198
6299| Method | Framework | Pre-train | Iters | Crop Size | mIoU | +MS | #Param |
63100| :-------------:| :-----------:| :---------:| :-----:| :---------:| :------------------------------------------------------------------------------------------:| :------------------------------------------------------------------------------------------:| :------:|
@@ -68,7 +105,7 @@ Pascal Context
68105
69106### COCO mini-val
70107
71- Baseline Detectors
108+ ** Baseline Detectors**
72109
73110| Method | Framework | Pre-train | Lr schd | Aug | box AP | mask AP | #Param |
74111| :-------------:| :----------:| :---------:| :-------:| :---:| :------:| :-------:| :------:|
@@ -77,7 +114,7 @@ Baseline Detectors
77114| ViT-Adapter-B | Mask R-CNN | DeiT | 3x | Yes | 49.6 | 43.6 | 120M |
78115| ViT-Adapter-L | Mask R-CNN | AugReg | 3x | Yes | 50.9 | 44.8 | 348M |
79116
80- Advanced Detectors
117+ ** Advanced Detectors**
81118
82119| Method | Framework | Pre-train | Lr schd | Aug | box AP | mask AP | #Param |
83120| :-------------:| :-------------------:| :---------:| :-------:| :---:| :------:| :-------:| :------:|
@@ -88,7 +125,7 @@ Advanced Detectors
88125| ViT-Adapter-B | Upgraded Mask R-CNN | MAE | 25ep | LSJ | 50.3 | 44.7 | 122M |
89126| ViT-Adapter-B | Upgraded Mask R-CNN | MAE | 50ep | LSJ | 50.8 | 45.1 | 122M |
90127
91- ADE20K val
128+ ** ADE20K val**
92129
93130| Method | Framework | Pre-train | Iters | Crop Size | mIoU | +MS | #Param |
94131| :-------------:| :---------:| :---------:| :-----:| :---------:| :----:| :----:| :------:|
0 commit comments