Training detail #65

zhlhlhlhl · 2024-12-10T08:24:41Z

Hi, really impressive work! I'm curious why using zero3 in pretraining but zero2 in instruction training?

jpthu17 · 2024-12-11T08:30:05Z

We find using zero3 is very easy to hang because the lengths of the video and image data vary greatly.

zhlhlhlhl · 2024-12-11T13:13:40Z

Yes, it's true. Besides, in the instruction training phase, I met CUDA out of memory error in 5258/8056. Are your experiments conducted on A100 GPUs with 40GB or 80GB

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training detail #65

Training detail #65

zhlhlhlhl commented Dec 10, 2024 •

edited

Loading

jpthu17 commented Dec 11, 2024

zhlhlhlhl commented Dec 11, 2024

Training detail #65

Training detail #65

Comments

zhlhlhlhl commented Dec 10, 2024 • edited Loading

jpthu17 commented Dec 11, 2024

zhlhlhlhl commented Dec 11, 2024

zhlhlhlhl commented Dec 10, 2024 •

edited

Loading