Cycle option for StreamingDataLoader #524

Aceticia · 2025-03-24T06:58:41Z

🚀 Feature

A function or an argument in StreamingDataLoader to cycle the passed in StreamingDataset.

Motivation

Many training scenarios in CV involve training models with multiple epochs, while wanting to control the exact number of steps being trained, independent of the underlying dataset size. E.g., given a CombinedStreamingDataset of some length, restart its iterations when it is exhausted.

Pitch

I'm not quite sure how this should be done - maybe in iter method of StreamingDataLoader, we can catch the final iteration and restart it?

tchaton · 2025-03-26T08:14:41Z

You could check PyTorch Lightning Cycle Loaders: https://lightning.ai/docs/pytorch/stable/api/lightning.pytorch.utilities.combined_loader.html

Or create your own wrapper that iterates for a given number of steps.

bhimrazy · 2025-05-26T08:01:57Z

Hi @Aceticia, ParallelStreamingDataset #576 has just been merged into litdata. Feel free to give it a try! You can find more details in the README under the sections Parallel Streaming and Cycle Datasets.

philgzl · 2025-05-26T09:40:44Z

As mentioned by @lantiga in #576, the ability to cycle datasets using ParallelStreamingDataset is a nice option as it is, but this should probably be upstreamed to StreamingDataset in the future.

Idk if this issue should stay open so we don't forget.

deependujha · 2025-05-26T10:25:54Z

Hi @philgzl, are you suggesting something like:

sd = ld.StreamingDataset("..", cycle=True)

and so when iter raises StopIteration, we don't increase epoch count, and just restart iter?

philgzl · 2025-05-26T10:35:45Z

Mmh I was thinking of a similar solution to what was implemented in ParallelStreamingDataset:

dset = ld.StreamingDataset("..", length=100)

Iterating over the dataset once then yields 100 samples. If the dataset has less than 100 samples, we cycle and shuffle internally. If we iterate over the dataset a second time, we resume from where we left off without re-shuffling, and yield 100 samples again.

This way we can disentangle the epoch length (as in the number of items yielded by iter) from the actual number of samples in the dataset. I believe this is what OP meant.

philgzl · 2025-05-26T10:40:52Z

I realize now maybe what you meant with cycle=True is the same with length=float("inf").

deependujha · 2025-05-26T10:41:35Z

thanks for the clarification.

Similar to parallelSD, pass int or "inf". Sounds good to me.

philgzl · 2025-05-26T10:46:10Z

Yes and then I guess this feature should be removed from ParallelStreamingDataset since we would just pass StreamingDataset instances which were already configured to cycle.

Aceticia added the enhancement New feature or request label Mar 24, 2025

bhimrazy added the waiting on author Waiting for user input or feedback. label Mar 29, 2025

philgzl mentioned this issue May 1, 2025

Add ParallelStreamingDataset #576

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Cycle option for StreamingDataLoader #524

Cycle option for StreamingDataLoader #524

Aceticia commented Mar 24, 2025

tchaton commented Mar 26, 2025 •

edited

Loading

Uh oh!

bhimrazy commented May 26, 2025

Uh oh!

philgzl commented May 26, 2025

Uh oh!

deependujha commented May 26, 2025

Uh oh!

philgzl commented May 26, 2025

Uh oh!

philgzl commented May 26, 2025

Uh oh!

deependujha commented May 26, 2025

Uh oh!

philgzl commented May 26, 2025

Uh oh!

Cycle option for StreamingDataLoader #524

Cycle option for StreamingDataLoader #524

Comments

Aceticia commented Mar 24, 2025

🚀 Feature

Motivation

Pitch

tchaton commented Mar 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bhimrazy commented May 26, 2025

Uh oh!

philgzl commented May 26, 2025

Uh oh!

deependujha commented May 26, 2025

Uh oh!

philgzl commented May 26, 2025

Uh oh!

philgzl commented May 26, 2025

Uh oh!

deependujha commented May 26, 2025

Uh oh!

philgzl commented May 26, 2025

Uh oh!

tchaton commented Mar 26, 2025 •

edited

Loading