π This is the official PyTorch implementation of the work:
VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection
Wuyang Li 1 , Zhu Yu 2 , Alexandre Alahi 1
1 Γcole Polytechnique FΓ©dΓ©rale de Lausanne (EPFL); 2 Zhejiang University
π§ Contact: [email protected]
VoxDet addresses semantic occupancy prediction with an instance-centric formulation inspired by dense object detection, which uses a Voxel-to-Instance (VoxNT) trick for freely transferring voxel-level class labels to instance-level offset labels.
- Versatile: Adaptable to various voxel-based scenarios, such as camera and LiDAR settings.
- Powerful: Achieves joint state-of-the-art performance on both camera-based and LiDAR-based SSC benchmarks.
- Efficient: Fast (~1.3Γ speed-up) and lightweight (~57.9% parameter reduction).
- Leaderboard Topper: Achieves 63.0 IoU (single-frame model), securing 1st place on the SemanticKITTI leaderboard.
Note that VoxDet is a single-frame single-model method without extra data and labels.
Please refer to docs/install.md for detailed. This work is built on the CGFormer codebase. The installation, data preparation, training, and inference are consistent with CGFormer. If something is missing, you can check that codebase :)
Please refer to docs/dataset.md for detailed dataset preparation instructions. Remember to change the data_root, ann_file and stereo_depth_root in every config file with your data path.
Download the depth pretraining model onedrive, and then change load_from in all confige files accordingly. This pre-training is consistent with CGFormer using the config configs/pretrain.py.
Please refer to the reproduced log in the logs folder after code cleaning to ensure that every step is correct.
2Γ A100 40G
CUDA_VISIBLE_DEVICES=0,1 python main.py \
--config_path configs/voxdet-semantickitti-cam.py \
--log_folder voxdet-semantickitti-cam \
--seed 42 \
--log_every_n_steps 100or with 4 GPUs (24GB memory)
CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \
--config_path configs/4gpu-semantickitti-cam.py \
--log_folder voxdet-semantickitti-cam \
--seed 42 \
--log_every_n_steps 100CUDA_VISIBLE_DEVICES=0,1,2,3 python main.py \
--config_path configs/voxdet-semnatickitti-lidar.py \
--log_folder voxdet-semnatickitti-lidar \
--seed 42 \
--log_every_n_steps 1002Γ A100 80G
CUDA_VISIBLE_DEVICES=0,1 python main.py \
--config_path configs/voxdet-kitt360-cam.py \
--log_folder voxdet-kitt360-cam \
--seed 42 \
--log_every_n_steps 100Download the pretrained models and place them in the ckpts/ folder, then run:
python main.py \
--eval --ckpt_path ./ckpts/voxdet-semantickitti-cam.ckpt \
--config_path configs/voxdet-semantickitti-cam.py \
--log_folder voxdet-semantickitti-cam-eval \
--seed 42 \
--log_every_n_steps 100 python main.py \
--eval --ckpt_path ./ckpts/voxdet-semantickitti-lidar.ckpt \
--config_path configs/voxdet-semnatickitti-lidar.py \
--log_folder voxdet-semantickitti-lidar-eval \
--seed 42 \
--log_every_n_steps 100Add --save_path pred to save prediction results:
python main.py \
--eval --ckpt_path ./ckpts/voxdet-semantickitti-cam.ckpt \
--config_path configs/voxdet-semantickitti-cam.py \
--log_folder voxdet-semantickitti-cam-eval \
--seed 42 \
--log_every_n_steps 100 \
--save_path predFor official SemanticKITTI leaderboard submission:
python main.py \
--eval --ckpt_path ./ckpts/voxdet-semantickitti-cam.ckpt \
--config_path configs/voxdet-semantickitti-cam-submit.py \
--log_folder voxdet-semantickitti-cam-submission \
--seed 42 \
--log_every_n_steps 100 \
--save_path submission \
--test_mappingNote that after using naive temporal fusion, VoxDet is able to achieve 20+ mIoU on SemanticKITTI test set (see logs folder).
We provide all reproduced information (models, configs, logs, everything) after the code cleaning onedrive. I did not test them on test set. So the performance might be slightly higer/lower than the paper, but should be very similar according to the tensorboard log.
We provide pretrained models for different configurations (Test set).
| Method | Dataset | Modality | IoU | mIoU | Config |
|---|---|---|---|---|---|
| VoxDet | SemanticKITTI | Camera | 47.81 | 18.67 | config |
| VoxDet | SemanticKITTI | LiDAR | 63.0 | 26.0 | config |
| VoxDet | KITTI-360 | Camera | 48.59 | 21.40 | config |
Please refer to docs/visualization.md.
VoxDet provides multiple configuration files for different scenarios:
configs/voxdet-semantickitti-cam.py: Camera-based SemanticKITTI trainingconfigs/voxdet-semnatickitti-lidar.py: LiDAR-based SemanticKITTI trainingconfigs/voxdet-kitt360-cam.py: Camera-based KITTI-360 trainingconfigs/4gpu-semantickitti-cam.py: 4-GPU optimized SemanticKITTI trainingconfigs/baseline-dev-semantickitti-cam.py: Improved baseline with engineering tricksconfigs/pretrain.py: first-stage depth pretraining. You need to use organize_ckpt.py to process checkpoint for model loading if you want to re-do this step by yourself. onedrive is out trained model, which is suggested to use directly.
VoxDet (blue curve) is significantly more efficient and effective than the previous state-of-the-art method, CGFormer (gray color).
- Release the arXiv paper
- Release the unified codebase, including both camera-based and LiDAR-based implementations
- Release all models
If you find our work helpful for your research, please consider citing our paper:
@inproceedings{li2025voxdet,
title={VoxDet: Rethinking 3D Semantic Occupancy Prediction as Dense Object Detection},
author={Li, Wuyang and Yu, Zhu and Alahi, Alexandre},
journal={NeurIPS},
year={2025}
}Greatly appreciate the tremendous effort for the following projects!
- FCOS: Fully Convolutional One-Stage Object Detection
- Context and Geometry Aware Voxel Transformer for Semantic Scene Completion
- SIGMA: Semantic-complete Graph Matching For Domain Adaptive Object Detection
- Revisiting the Sibling Head in Object Detector
- VoxFormer: a Cutting-edge Baseline for 3D Semantic Occupancy Prediction



