SlowFast outputs NaN with deepspeed

Hello! I tried to reproduce stage-2 training with datasets from HF and encountered a problem with DeepSpeed and the SlowFast network. It outputs NaN values during training. When using vanilla torchrun without DeepSpeed, the problem doesn't occur.

I'm using 
deepspeed==0.14.2
torch==2.1.2
accelerate==1.2.0
transformers==4.44.0
flash-attn==2.5.2
bitsandbytes==0.41.0

Using scripts/zero3.json config. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SlowFast outputs NaN with deepspeed #5

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

SlowFast outputs NaN with deepspeed #5

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions