Add repeat() for iterable datasets #7192

alex-hh · 2024-10-02T17:48:13Z

Feature request

It would be useful to be able to straightforwardly repeat iterable datasets indefinitely, to provide complete control over starting and ending of iteration to the user.

An IterableDataset.repeat(n) function could do this automatically

Motivation

This feature was discussed in this issue #7147, and would resolve the need to use the hack of interleave datasets with probability 0 as a simple way to achieve this functionality.

An additional benefit might be the simplification of the use of iterable datasets in a distributed setting:
If the user can assume that datasets will repeat indefinitely, then issues around different numbers of samples appearing on different devices (e.g. #6437, #6594, #6623, #6719) can potentially be straightforwardly resolved by simply doing:

ids.repeat(None).take(n_samples_per_epoch)

Your contribution

I'm not familiar enough with the codebase to assess how straightforward this would be to implement.

If it might be very straightforward, I could possibly have a go.

alex-hh · 2024-10-03T09:59:16Z

perhaps concatenate_datasets can already be used to achieve almost the same effect?

lhoestq · 2024-10-03T12:53:33Z

concatenate_datasets does the job when there is a finite number of repetitions, but in case of .repeat() forever we need a new logic in iterable_dataset.py

lhoestq · 2025-03-18T10:48:32Z

done in #7198

alex-hh added the enhancement New feature or request label Oct 2, 2024

alex-hh mentioned this issue Oct 2, 2024

streaming datasets doesn't work properly with multi-node #6623

Open

lhoestq closed this as completed Mar 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add repeat() for iterable datasets #7192

Add repeat() for iterable datasets #7192

alex-hh commented Oct 2, 2024 •

edited

Loading

alex-hh commented Oct 3, 2024

lhoestq commented Oct 3, 2024

lhoestq commented Mar 18, 2025

Add repeat() for iterable datasets #7192

Add repeat() for iterable datasets #7192

Comments

alex-hh commented Oct 2, 2024 • edited Loading

Feature request

Motivation

Your contribution

alex-hh commented Oct 3, 2024

lhoestq commented Oct 3, 2024

lhoestq commented Mar 18, 2025

alex-hh commented Oct 2, 2024 •

edited

Loading