Description
Bug Report: Different Batch Normalization (BN) Statistics After PreciseBN Hook
Steps to Reproduce:
Clone the repository:
bash
Copy code
git clone https://github.com/facebookresearch/moco.git
cd moco/detection
Run the following command:
bash
Copy code
python train_net.py
--config-file configs/pascal_voc_R_50_C4_24k.yaml
--num-gpus 8
OUTPUT_DIR "temp/train"
SEED 0
SOLVER.MAX_ITER 1
Observations:
During update_bn_stats, the batch sizes differ across Distributed Data Parallel (DDP) ranks.
This lack of synchronization causes inconsistent BN statistics on different ranks after the PreciseBN hook.
Expected Behavior:
The BN statistics should be consistent across all ranks after the PreciseBN hook.
Environment:
OS: Linux
Python Version: 3.8.0
Framework: Detectron2 v0.6
PyTorch: 2.4.1+cu124
GPU: NVIDIA GeForce RTX 4090 (8 GPUs)
CUDA: Version 12.4
Other dependencies:
fvcore: 0.1.5.post20221221
iopath: 0.1.9
Command to Collect Full Environment Details:
bash
Copy code
wget -nc -q https://github.com/facebookresearch/detectron2/raw/main/detectron2/utils/collect_env.py
python collect_env.py
Note:
No private dataset is required to reproduce this bug. All required resources are publicly available.