Skip to content

Fix dataset generation: deterministic per-index seeding and collate-compatible image format #574

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 4 commits into from

Conversation

dipeshbabu
Copy link

What does this PR do?

Issues Addressed

  1. Fixes Error in the Getting Started Example -- default_collate cannot handle PIL.Images #573: Two critical bugs in the Getting Started example:
    • Duplicate data when using optimize with num_workers > 1 due to unseeded randomness.
    • default_collate error caused by returning PIL Images incompatible with PyTorch batching.

Root Causes

  • Duplicate Data: Workers shared the global numpy random state, leading to identical random values across processes.
  • Collate Error: PIL Images cannot be batched by PyTorch’s default_collate.

Changes

  1. Deterministic Data Generation:
    • Seed numpy’s RNG uniquely per index using np.random.default_rng(seed=index).
    • Replace np.random.randint with the seeded generator’s rng.integers(...).
  2. Collate Compatibility:
    • Return images as numpy arrays instead of PIL Images.
  3. Documentation Updates:
    • Updated the Getting Started example to reflect both fixes.

Result

  • No duplicate data across workers.
  • StreamingDataLoader now works out-of-the-box with the example.
  • Improved efficiency (no runtime PIL-to-tensor conversions).

PR review

Anyone in the community is free to review the PR once the tests have passed.

Did you have fun?

Make sure you had fun coding 🙃

Copy link

codecov bot commented Apr 28, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79%. Comparing base (e789fb6) to head (4bb0eef).
Report is 17 commits behind head on main.

Additional details and impacted files
@@         Coverage Diff         @@
##           main   #574   +/-   ##
===================================
  Coverage    79%    79%           
===================================
  Files        40     40           
  Lines      6098   6098           
===================================
  Hits       4818   4818           
  Misses     1280   1280           
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@bhimrazy
Copy link
Collaborator

bhimrazy commented May 2, 2025

Hi @dipeshbabu,
The provided sample is just a minimal demo — in real-world use cases, you’d typically work with actual .jpg or .png images, which get optimized and loaded as tensors directly.

If you’re saving PIL images directly, you can handle them either by subclassing the streaming dataset to apply transforms or by passing a custom collate_fn like:

def collate_fn(batch):
    return {
        "image": [sample["image"] for sample in batch],
        "class": [sample["class"] for sample in batch],
    }

train_dataloader = ld.StreamingDataLoader(train_dataset, collate_fn=collate_fn)

If you’re interested in making further contributions to litdata, we’d be happy to discuss and collaborate on our Discord — join us in the #litdata channel!
Or please feel free to discuss directly over the issues.

@Borda Borda added the waiting on author Waiting for user input or feedback. label May 14, 2025
@deependujha
Copy link
Collaborator

Hi @dipeshbabu
IMO, it makes more sense to add collate_fn code in getting_started/stream.py that makes example complete and doesn't raise error immediately.

@bhimrazy
Copy link
Collaborator

Hey @dipeshbabu, just following up — if you're interested, adding the collate_fn directly in getting_started/stream.py and any other related places would be a great addition to round out the fix. Would be awesome to have that as part of your first contribution! 🙌

@deependujha
Copy link
Collaborator

Closing this since there’s been no response from the author.

@bhimrazy
Copy link
Collaborator

Let's include the collate_fn then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
waiting on author Waiting for user input or feedback.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Error in the Getting Started Example -- default_collate cannot handle PIL.Images
4 participants