-
Notifications
You must be signed in to change notification settings - Fork 186
Description
First of all, thank you for your great work on this project. It's a very valuable contribution.
The Issue
When num_workers > 0, it appears that all worker processes will load highly duplicated or identical batches of data within each epoch. This can significantly reduce training efficiency and potentially harm model performance, as the model sees the same few samples repeatedly in each batch instead of a diverse set of data.
This is because each worker starts with the same RNG seed (from the main process). As a result, when each worker calls random.randint and the subsequent random.choice functions, they generate the exact same sequence of "random" numbers. This leads them to load and process the exact same data samples, defeating the purpose of parallel data loading.
Although it seems the original LISA training scripts might not set num_workers > 0, I enabled it to speed up my data loading pipeline, and that's when I noticed this potential issue.
I'm not entirely certain if my understanding of the problem is correct, so I wanted to raise it here for discussion. I would appreciate it if you, or anyone else in the community with similar experiences, could weigh in.