Provide an option of robust dataset iterator with error handling

### Feature request

Adding an option to skip corrupted data samples. Currently the datasets behavior is throwing errors if the data sample if corrupted and let user aware and handle the data corruption. When I tried to try-catch the error at user level, the iterator will raise StopIteration when I called next() again. 

The way I try to do error handling is: (This doesn't work, unfortunately)

```
        # Load the dataset with streaming enabled
        dataset = load_dataset(
            "pixparse/cc12m-wds", split="train", streaming=True
        )
        # Get an iterator from the dataset
        iterator = iter(dataset)

        while True:
            try:
                # Try to get the next example
                example = next(iterator)
                
                # Try to access and process the image
                image = example["jpg"]
                pil_image = Image.fromarray(np.array(image))
                pil_image.verify()  # Verify it's a valid image file

            except StopIteration:  # Code path 1
                print("\nStopIteration was raised! Reach the end of dataset")
                raise StopIteration

            except Exception as e:  # Code path 2
                errors += 1
                print("Error! Skip this sample")
                cotinue
            else:
                successful += 1
```

This is because the `IterableDataset` already throws an error (reaches Code path 2). And if I continue call next(), it will hit Code path 1. This is because the inner iterator of `IterableDataset`([code](https://github.com/huggingface/datasets/blob/89bd1f971402acb62805ef110bc1059c38b1c8c6/src/datasets/iterable_dataset.py#L2242)) as been stopped, so calling next() on it will raise StopIteration. 

So I can not skip the corrupted data sample in this way. Would also love to hear any suggestions about creating a robust dataloader.

Thanks for your help in advance!


### Motivation

## Public dataset corruption might be common
A lot of users would use public dataset, and the public dataset might contains some corrupted data, especially for dataset with image / video etc.  I totally understand it's dataset owner and user's responsibility to ensure the data integrity / run data cleaning or preprocessing, but it would be easier for developers who would use the dataset

## Use cases
For example, a robust dataloader would be easy for users who want to try quick tests on different dataset, and chose one dataset which fits their needs. So user could use IterableDataloader with `stream=True` to use the dataset easily without downloading and removing corrupted data samples from the dataset.


### Your contribution

The error handling might not trivial and might need more careful design. 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provide an option of robust dataset iterator with error handling #7612

Feature request

Motivation

Public dataset corruption might be common

Use cases

Your contribution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Provide an option of robust dataset iterator with error handling #7612

Description

Feature request

Motivation

Public dataset corruption might be common

Use cases

Your contribution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions