Skip to content

(CVPR 2025) A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

License

Notifications You must be signed in to change notification settings

CVMI-Lab/SlotMIM

Repository files navigation

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning (CVPR 2025)
By Xin Wen, Bingchen Zhao, Yilun Chen, Jiangmiao Pang, and Xiaojuan Qi.

Introduction

framework

An overview of this paper. (a) We conduct a comprehensive study evaluating pre-trained vision models (PVMs) on visuomotor control and perception tasks, analyzing how different pretraining (model, data) combinations affect performance. Our analysis reveals that DINO/iBOT excels while MAE underperforms. (b) We investigate the performance drop of DINO/iBOT when trained on non-(single-)object-centric (NOC) data, discovering they struggle to learn objectness from NOC data—a capability that strongly correlates with robot manipulation performance. (c) We introduce SlotMIM, which incorporates explicit objectness guidance during training to effectively learn object-centric representations from NOC data. (d) Through scaled-up pre-training and evaluation across six tasks, we demonstrate that SlotMIM adaptively learns different types of objectness based on the pre-training dataset characteristics, outperforming existing methods.

Getting started

Requirements

The following is an example of setting up the experimental environment:

  • Create the environment
conda create -n slotmim python=3.9 -y
conda activate slotmim
  • Install pytorch & torchvision (you can also pick your favorite version)
conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=12.1 -c pytorch -c nvidia
  • Clone our repo
git clone https://github.com/CVMI-Lab/SlotMIM && cd ./SlotMIM
  • (Optional) Create a soft link for the datasets
mkdir datasets
ln -s ${PATH_TO_COCO} ./datasets/coco
ln -s ${PATH_TO_IMAGENET} ./datasets/imagenet

At this stage, we have provided the code for pre-training SlotMIM, and evaluation scripts for object discovery, classification, object detection, and segmentation. For training please check ./scripts/, and for evaluation please check ./transfer/ and eval_voc.py and eval_knn.py. We also have released pre-trained checkpoints of our model and other re-implemented baselines here. Please feel free to explore them for now and we will continue to update the readme with more instructions, and integrate evaluation scripts for robotics tasks in the future.

Citing this work

If you find this repo useful for your research, please consider citing our paper:

@inproceedings{wen2022slotcon,
  title={Self-Supervised Visual Representation Learning with Semantic Grouping},
  author={Wen, Xin and Zhao, Bingchen and Zheng, Anlin and Zhang, Xiangyu and Qi, Xiaojuan},
  booktitle={Advances in Neural Information Processing Systems},
  year={2022}
}

@inproceedings{wen2025slotmim,
  title={A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning},
  author={Wen, Xin and Zhao, Bingchen and Chen, Yilun and Pang, Jiangmiao and Qi, Xiaojuan},
  booktitle={CVPR},
  year={2025}
}

Acknowledgment

Our codebase builds upon several existing publicly available codes. Specifically, we have modified and integrated the following repos into this project: iBOT, MAE, SlotCon, PixPro, DINO, and Slot Attention.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

About

(CVPR 2025) A Data-Centric Revisit of Pre-Trained Vision Models for Robot Learning

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published