Skip to content

Different DDP ranks have varying batch normalization (BN) statistics after the PreciseBN hook because precise_bn in fvcore does not synchronize the batch size across ranks. #5409

Open
@jobmart

Description

@jobmart

Bug Report: Different Batch Normalization (BN) Statistics After PreciseBN Hook
Steps to Reproduce:

Clone the repository:
bash
Copy code
git clone https://github.com/facebookresearch/moco.git
cd moco/detection
Run the following command:
bash
Copy code
python train_net.py
--config-file configs/pascal_voc_R_50_C4_24k.yaml
--num-gpus 8
OUTPUT_DIR "temp/train"
SEED 0
SOLVER.MAX_ITER 1
Observations:

During update_bn_stats, the batch sizes differ across Distributed Data Parallel (DDP) ranks.
This lack of synchronization causes inconsistent BN statistics on different ranks after the PreciseBN hook.
Expected Behavior:

The BN statistics should be consistent across all ranks after the PreciseBN hook.
Environment:

OS: Linux
Python Version: 3.8.0
Framework: Detectron2 v0.6
PyTorch: 2.4.1+cu124
GPU: NVIDIA GeForce RTX 4090 (8 GPUs)
CUDA: Version 12.4
Other dependencies:
fvcore: 0.1.5.post20221221
iopath: 0.1.9
Command to Collect Full Environment Details:

bash
Copy code
wget -nc -q https://github.com/facebookresearch/detectron2/raw/main/detectron2/utils/collect_env.py
python collect_env.py
Note:
No private dataset is required to reproduce this bug. All required resources are publicly available.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions