[BUG?] Potential Data Loading Issue with num_workers > 0 due to randomness in __getitem__

First of all, thank you for your great work on this project. It's a very valuable contribution.

**The Issue**
When `num_workers` > 0, it appears that all worker processes will load highly duplicated or identical batches of data within each epoch. This can significantly reduce training efficiency and potentially harm model performance, as the model sees the same few samples repeatedly in each batch instead of a diverse set of data.

This is because each worker starts with the same RNG seed (from the main process). As a result, when each worker calls `random.randint` and the subsequent `random.choice` functions, they generate the exact same sequence of "random" numbers. This leads them to load and process the exact same data samples, defeating the purpose of parallel data loading. 

Although it seems the original LISA training scripts might not set `num_workers` > 0, I enabled it to speed up my data loading pipeline, and that's when I noticed this potential issue.

I'm not entirely certain if my understanding of the problem is correct, so I wanted to raise it here for discussion. I would appreciate it if you, or anyone else in the community with similar experiences, could weigh in.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[BUG?] Potential Data Loading Issue with num_workers > 0 due to randomness in getitem #206

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG?] Potential Data Loading Issue with num_workers > 0 due to randomness in __getitem__ #206

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

[BUG?] Potential Data Loading Issue with num_workers > 0 due to randomness in getitem #206