Description
Feature request
Adding an option to skip corrupted data samples. Currently the datasets behavior is throwing errors if the data sample if corrupted and let user aware and handle the data corruption. When I tried to try-catch the error at user level, the iterator will raise StopIteration when I called next() again.
The way I try to do error handling is: (This doesn't work, unfortunately)
# Load the dataset with streaming enabled
dataset = load_dataset(
"pixparse/cc12m-wds", split="train", streaming=True
)
# Get an iterator from the dataset
iterator = iter(dataset)
while True:
try:
# Try to get the next example
example = next(iterator)
# Try to access and process the image
image = example["jpg"]
pil_image = Image.fromarray(np.array(image))
pil_image.verify() # Verify it's a valid image file
except StopIteration: # Code path 1
print("\nStopIteration was raised! Reach the end of dataset")
raise StopIteration
except Exception as e: # Code path 2
errors += 1
print("Error! Skip this sample")
cotinue
else:
successful += 1
This is because the IterableDataset
already throws an error (reaches Code path 2). And if I continue call next(), it will hit Code path 1. This is because the inner iterator of IterableDataset
(code) as been stopped, so calling next() on it will raise StopIteration.
So I can not skip the corrupted data sample in this way. Would also love to hear any suggestions about creating a robust dataloader.
Thanks for your help in advance!
Motivation
Public dataset corruption might be common
A lot of users would use public dataset, and the public dataset might contains some corrupted data, especially for dataset with image / video etc. I totally understand it's dataset owner and user's responsibility to ensure the data integrity / run data cleaning or preprocessing, but it would be easier for developers who would use the dataset
Use cases
For example, a robust dataloader would be easy for users who want to try quick tests on different dataset, and chose one dataset which fits their needs. So user could use IterableDataloader with stream=True
to use the dataset easily without downloading and removing corrupted data samples from the dataset.
Your contribution
The error handling might not trivial and might need more careful design.