A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning (CVPR 2025)
By Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, and Xiaojuan Qi.

Introduction

An overview of this paper. (a) We conduct a comprehensive study evaluating pre-trained vision models (PVMs) on visuomotor control and perception tasks, analyzing how different pretraining (model, data) combinations affect performance. Our analysis reveals that DINO/iBOT excels while MAE underperforms. (b) We investigate the performance drop of DINO/iBOT when trained on non-(single-)object-centric (NOC) data, discovering they struggle to learn objectness from NOC data—a capability that strongly correlates with robot manipulation performance. (c) We introduce SlotMIM, which incorporates explicit objectness guidance during training to effectively learn object-centric representations from NOC data. (d) Through scaled-up pre-training and evaluation across six tasks, we demonstrate that SlotMIM adaptively learns different types of objectness based on the pre-training dataset characteristics, outperforming existing methods.

Getting started

Requirements

The following is an example of setting up the experimental environment:

Create the environment

conda create -n slotmim python=3.9 -y
conda activate slotmim

Install pytorch & torchvision (you can also pick your favorite version)

conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia

Clone our repo

git clone https://github.com/CVMI-Lab/SlotMIM && cd ./SlotMIM

(Optional) Create a soft link for the datasets

mkdir datasets
ln -s ${PATH_TO_COCO} ./datasets/coco
ln -s ${PATH_TO_IMAGENET} ./datasets/imagenet

At this stage, we have provided the code for pre-training SlotMIM, and evaluation scripts for object discovery, classification, object detection, and segmentation. For training please check ./scripts/, and for evaluation please check ./transfer/ and eval_voc.py and eval_knn.py. We also have released pre-trained checkpoints of our model and other re-implemented baselines here. Please feel free to explore them for now and we will continue to update the readme with more instructions, and integrate evaluation scripts for robotics tasks in the future.

Citing this work

If you find this repo useful for your research, please consider citing our paper:

@inproceedings{wen2022slotcon,
  title={Self-Supervised Visual Representation Learning with Semantic Grouping},
  author={Wen, Xin and Zhao, Bingchen and Zheng, Anlin and Zhang, Xiangyu and Qi, Xiaojuan},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}

@inproceedings{wen2025slotmim,
  title={A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning},
  author={Wen, Xin and Zhao, Bingchen and Chen, Yilun and Pang, Jiangmiao and Qi, Xiaojuan},
  booktitle={CVPR},
  year={2025}
}

Acknowledgment

Our codebase builds upon several existing publicly available codes. Specifically, we have modified and integrated the following repos into this project: iBOT, MAE, SlotCon, PixPro, DINO, and Slot Attention.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
assets		assets
data		data
models		models
scripts		scripts
transfer		transfer
utils		utils
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
convert_pretrain.py		convert_pretrain.py
eval_knn.py		eval_knn.py
eval_voc.py		eval_voc.py
main_pretrain_mim.py		main_pretrain_mim.py
run_with_submitit.py		run_with_submitit.py
viz_slots_per_img.py		viz_slots_per_img.py
viz_slots_retrieval.py		viz_slots_retrieval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

Introduction

Getting started

Requirements

Citing this work

Acknowledgment

License

About

Uh oh!

Releases

Packages

Languages

License

CVMI-Lab/SlotMIM

Folders and files

Latest commit

History

Repository files navigation

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

Introduction

Getting started

Requirements

Citing this work

Acknowledgment

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages