A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning (CVPR 2025)
By
Xin Wen,
Bingchen Zhao,
Yilun Chen,
Jiangmiao Pang, and
Xiaojuan Qi.
An overview of this paper. (a) We conduct a comprehensive study evaluating pre-trained vision models (PVMs) on visuomotor control and perception tasks, analyzing how different pretraining (model, data) combinations affect performance. Our analysis reveals that DINO/iBOT excels while MAE underperforms. (b) We investigate the performance drop of DINO/iBOT when trained on non-(single-)object-centric (NOC) data, discovering they struggle to learn objectness from NOC data—a capability that strongly correlates with robot manipulation performance. (c) We introduce SlotMIM, which incorporates explicit objectness guidance during training to effectively learn object-centric representations from NOC data. (d) Through scaled-up pre-training and evaluation across six tasks, we demonstrate that SlotMIM adaptively learns different types of objectness based on the pre-training dataset characteristics, outperforming existing methods.
The following is an example of setting up the experimental environment:
- Create the environment
conda create -n slotmim python=3.9 -y
conda activate slotmim
- Install pytorch & torchvision (you can also pick your favorite version)
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia
- Clone our repo
git clone https://github.com/CVMI-Lab/SlotMIM && cd ./SlotMIM
- (Optional) Create a soft link for the datasets
mkdir datasets
ln -s ${PATH_TO_COCO} ./datasets/coco
ln -s ${PATH_TO_IMAGENET} ./datasets/imagenet
At this stage, we have provided the code for pre-training SlotMIM, and evaluation scripts for object discovery, classification, object detection, and segmentation. For training please check ./scripts/, and for evaluation please check ./transfer/ and eval_voc.py and eval_knn.py. We also have released pre-trained checkpoints of our model and other re-implemented baselines here. Please feel free to explore them for now and we will continue to update the readme with more instructions, and integrate evaluation scripts for robotics tasks in the future.
If you find this repo useful for your research, please consider citing our paper:
@inproceedings{wen2022slotcon,
title={Self-Supervised Visual Representation Learning with Semantic Grouping},
author={Wen, Xin and Zhao, Bingchen and Zheng, Anlin and Zhang, Xiangyu and Qi, Xiaojuan},
booktitle={Advances in Neural Information Processing Systems},
year={2022}
}
@inproceedings{wen2025slotmim,
title={A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning},
author={Wen, Xin and Zhao, Bingchen and Chen, Yilun and Pang, Jiangmiao and Qi, Xiaojuan},
booktitle={CVPR},
year={2025}
}
Our codebase builds upon several existing publicly available codes. Specifically, we have modified and integrated the following repos into this project: iBOT, MAE, SlotCon, PixPro, DINO, and Slot Attention.
This project is licensed under the Apache License 2.0 - see the LICENSE file for details.