Skip to content

Add repeat() for iterable datasets #7192

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
alex-hh opened this issue Oct 2, 2024 · 3 comments
Closed

Add repeat() for iterable datasets #7192

alex-hh opened this issue Oct 2, 2024 · 3 comments
Labels
enhancement New feature or request

Comments

@alex-hh
Copy link
Contributor

alex-hh commented Oct 2, 2024

Feature request

It would be useful to be able to straightforwardly repeat iterable datasets indefinitely, to provide complete control over starting and ending of iteration to the user.

An IterableDataset.repeat(n) function could do this automatically

Motivation

This feature was discussed in this issue #7147, and would resolve the need to use the hack of interleave datasets with probability 0 as a simple way to achieve this functionality.

An additional benefit might be the simplification of the use of iterable datasets in a distributed setting:
If the user can assume that datasets will repeat indefinitely, then issues around different numbers of samples appearing on different devices (e.g. #6437, #6594, #6623, #6719) can potentially be straightforwardly resolved by simply doing:

ids.repeat(None).take(n_samples_per_epoch)

Your contribution

I'm not familiar enough with the codebase to assess how straightforward this would be to implement.

If it might be very straightforward, I could possibly have a go.

@alex-hh
Copy link
Contributor Author

alex-hh commented Oct 3, 2024

perhaps concatenate_datasets can already be used to achieve almost the same effect?

@lhoestq
Copy link
Member

lhoestq commented Oct 3, 2024

concatenate_datasets does the job when there is a finite number of repetitions, but in case of .repeat() forever we need a new logic in iterable_dataset.py

@lhoestq
Copy link
Member

lhoestq commented Mar 18, 2025

done in #7198

@lhoestq lhoestq closed this as completed Mar 18, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants