Skip to content

[BUG?] Potential Data Loading Issue with num_workers > 0 due to randomness in __getitem__ #206

@jiazhen-code

Description

@jiazhen-code

First of all, thank you for your great work on this project. It's a very valuable contribution.

The Issue
When num_workers > 0, it appears that all worker processes will load highly duplicated or identical batches of data within each epoch. This can significantly reduce training efficiency and potentially harm model performance, as the model sees the same few samples repeatedly in each batch instead of a diverse set of data.

This is because each worker starts with the same RNG seed (from the main process). As a result, when each worker calls random.randint and the subsequent random.choice functions, they generate the exact same sequence of "random" numbers. This leads them to load and process the exact same data samples, defeating the purpose of parallel data loading.

Although it seems the original LISA training scripts might not set num_workers > 0, I enabled it to speed up my data loading pipeline, and that's when I noticed this potential issue.

I'm not entirely certain if my understanding of the problem is correct, so I wanted to raise it here for discussion. I would appreciate it if you, or anyone else in the community with similar experiences, could weigh in.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions