Hello! I tried to reproduce stage-2 training with datasets from HF and encountered a problem with DeepSpeed and the SlowFast network. It outputs NaN values during training. When using vanilla torchrun without DeepSpeed, the problem doesn't occur.
I'm using
deepspeed==0.14.2
torch==2.1.2
accelerate==1.2.0
transformers==4.44.0
flash-attn==2.5.2
bitsandbytes==0.41.0
Using scripts/zero3.json config.