streaming datasets doesn't work properly with multi-node

### Feature request

Let’s say I have a dataset with 5 samples with values [1, 2, 3, 4, 5], with 2 GPUs (for DDP) and batch size of 2. This dataset is an `IterableDataset` since I am streaming it.

Now I split the dataset using `split_dataset_by_node` to ensure it doesn’t get repeated. And since it’s already splitted, I don’t have to use `DistributedSampler` (also they don't work with iterable datasets anyway)?

But in this case I noticed that the:

First iteraton:
first GPU will get → [1, 2]
first GPU will get → [3, 4]

Second iteraton:
first GPU will get → [5]
first GPU will get → Nothing

which actually creates an issue since in case of `DistributedSampler`, the samples are repeated internally to ensure non of the GPUs at any iteration is missing any data for gradient sync.

So my questions are:

1. Here since splitting is happening before hand, how to make sure each GPU get’s a batch at each iteration to avoid gradient sync issues?
2. Do we need to use `DistributedSampler`? If yes, how?
3. in the docstrings of `split_dataset_by_node`, this is mentioned: *"If the dataset has a number of shards that is a factor of `world_size` (i.e. if `dataset.n_shards % world_size == 0`), then the shards are evenly assigned across the nodes, which is the most optimized. Otherwise, each node keeps 1 example out of `world_size`, skipping the other examples."* Can you explain the last part here?
4. If `dataset.n_shards % world_size != 0`, is it possible to shard the streaming dataset on the fly to avoid the case where data is missing?

### Motivation

Somehow streaming datasets should work with DDP since for big LLMs a lot of data is required and DDP/multi-node is mostly used to train such models and streaming can actually help solve the data part of it.

### Your contribution

Yes, I can help in submitting the PR once we get mutual understanding on how it should behave.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

streaming datasets doesn't work properly with multi-node #6623

Feature request

Motivation

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

streaming datasets doesn't work properly with multi-node #6623

Description

Feature request

Motivation

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions