A simple and robust Python script that automates the cleaning, validation, and processing of image datasets for machine learning projects.
This pipeline ensures that only high-quality, valid data is used for training AI models by inspecting each file, logging its metadata, and sorting it based on a defined set of rules.
- Validates Images: Checks each file for corruption, correct file format (
.jpg,.png,.jpeg), and minimum dimensions. - Extracts Metadata: Gathers key information like image dimensions (width, height), file size, and image format.
- Logs to a Database: Records the results for every file, both passed and failed, into a structured SQLite database for easy querying and analysis.
- Sorts Files: Automatically moves files to
validatedorbadfolders based on the inspection outcome.
- Language: Python 3
- Libraries:
Pillow (PIL): For image inspection and validation.SQLite3: For database management.tqdm: For displaying a clean progress bar.pathlib: For modern, OS-agnostic path management.
To run this pipeline, follow these steps:
-
Clone the repository:
git clone https://github.com/thonmay/python-data-pipeline.git cd python-data-pipeline -
Create a virtual environment and install dependencies:
python -m venv venv source venv/bin/activate # On Windows: venv\Scripts\activate pip install -r requirements.txt
-
Add data:
- Place your raw image files into the
source_datafolder. (I already put around 1000 images)
- Place your raw image files into the
-
Set up the database:
- This script only needs to be run once to create the database file and table.
python setup_database.py
-
Run the pipeline:
python main_pipeline.py
The script will process all files in
source_data, populate theimage_metadata.db, and move the files to the appropriateprocessed_datasubfolders.