Skip to content

DataSet Source for Training EVA-VLIP + SDXL as Visual Autoencoder (LAION-COCO Dataset) #97

@shanpoyang654

Description

@shanpoyang654

Description:

I'm working on training EVA-VLIP + SDXL as a visual autoencoder using the LAION-COCO dataset from Hugging Face. I've noticed that:
The dataset contains parquet files with image URLs and captions, not the actual images

Many URLs appear to be broken or inaccessible

This results in only a small fraction of images being successfully downloaded

Questions:
For your training runs, did you use:

Only the successfully downloaded images from LAION-COCO's URLs?

Or did you have access to a more complete/pre-downloaded version of the dataset?
If using an alternative dataset source:

Could you recommend where to obtain a more reliable version of LAION-COCO with higher URL availability?

Are there other suitable datasets you'd recommend for training SDXL conditioned on visual embeddings?
For the image-text pairs that fail to download:

Did you implement any fallback strategies (e.g., using cached versions from other sources)?

Or did you simply exclude these samples from training?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions