The repositories are structured to share a set of core modules while remaining decoupled. Task-related notebooks should be placed in their respective task-specific repositories to maintain clarity.
Data Ingestion:
Repository | Description |
---|---|
scrapelib | Utilities for large-scale data scraping and extraction, enabling dataset collection and preprocessing. |
unibox | Unified data access layer for seamless intake and export across various file formats (e.g., Parquet, PNG) and storage backends (e.g., local, S3, Hugging Face). |
dataproc5 | Orchestrates data processing pipelines with Kedro, aggregating silver and gold-tier data from scrapes. |
Model Training & Inference:
Repository | Description |
---|---|
trainlib | Framework for training and experiment logging, supporting classifiers, SDXL, VLM, and other models. |
procslib | Inference framework for trained models, supporting aesthetics scoring, taggers, CV2 metrics, and VLM-based evaluations. |
Data Processing & Experimentation:
Repository | Description |
---|---|
aeslib | Aesthetic score processing, including data collection, cleaning, quality assurance, and model evaluation. Excludes training logic. |
audiolib | Handles audio-related data processing, including segmentation, tagging, and dataset preparation. |
imagelib | Image data processing for SD/SDXL training, encompassing metadata collection, dataset pipelines, and filtering configurations. Excludes training logic. |
videolib | Video data processing for sources like HunyuanVideo and LTXV, featuring video sectioning, optical flow filtering, VLM tagging, and dataset preparation. |