Skip to content

Provide an option of robust dataset iterator with error handling #7612

Open
@wwwjn

Description

@wwwjn

Feature request

Adding an option to skip corrupted data samples. Currently the datasets behavior is throwing errors if the data sample if corrupted and let user aware and handle the data corruption. When I tried to try-catch the error at user level, the iterator will raise StopIteration when I called next() again.

The way I try to do error handling is: (This doesn't work, unfortunately)

        # Load the dataset with streaming enabled
        dataset = load_dataset(
            "pixparse/cc12m-wds", split="train", streaming=True
        )
        # Get an iterator from the dataset
        iterator = iter(dataset)

        while True:
            try:
                # Try to get the next example
                example = next(iterator)
                
                # Try to access and process the image
                image = example["jpg"]
                pil_image = Image.fromarray(np.array(image))
                pil_image.verify()  # Verify it's a valid image file

            except StopIteration:  # Code path 1
                print("\nStopIteration was raised! Reach the end of dataset")
                raise StopIteration

            except Exception as e:  # Code path 2
                errors += 1
                print("Error! Skip this sample")
                cotinue
            else:
                successful += 1

This is because the IterableDataset already throws an error (reaches Code path 2). And if I continue call next(), it will hit Code path 1. This is because the inner iterator of IterableDataset(code) as been stopped, so calling next() on it will raise StopIteration.

So I can not skip the corrupted data sample in this way. Would also love to hear any suggestions about creating a robust dataloader.

Thanks for your help in advance!

Motivation

Public dataset corruption might be common

A lot of users would use public dataset, and the public dataset might contains some corrupted data, especially for dataset with image / video etc. I totally understand it's dataset owner and user's responsibility to ensure the data integrity / run data cleaning or preprocessing, but it would be easier for developers who would use the dataset

Use cases

For example, a robust dataloader would be easy for users who want to try quick tests on different dataset, and chose one dataset which fits their needs. So user could use IterableDataloader with stream=True to use the dataset easily without downloading and removing corrupted data samples from the dataset.

Your contribution

The error handling might not trivial and might need more careful design.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions