-
Notifications
You must be signed in to change notification settings - Fork 3
Open
Labels
datasetsDatasets related featuresDatasets related featuresfeature requestNew feature or requestNew feature or request
Description
User Story
Sometimes suitable datasets cannot be found on huggingface and it's better to use a smaller local dataset + synthesis / augmentation using some local files such as PDF, word, slides, etc. Handling these media would be helpful as well to turn documents into trainable datasets!
This issue can be tackled in the following steps:
- First validate and enhance general support for local datasets uploading + preprocessing
- Implement support for vision local datasets
- Implement various synthesis and augmentation strategies powered by https://github.com/meta-llama/synthetic-data-kit
The tool is designed to follow a simple CLI structure with 4 commands:
ingest various file formats
create your fine-tuning format: QA pairs, QA pairs with CoT, summary format
curate: Using Llama as a judge to curate high quality examples.
save-as: After that you can simply save these to a format that your fine-tuning workflow requires.
Acceptance criteria
Upgrades to preprocessing service
This is an end-to-end example with Llama, which we should replicate with Gemma: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Meta_Synthetic_Data_Llama3_2_(3B).ipynb
Metadata
Metadata
Assignees
Labels
datasetsDatasets related featuresDatasets related featuresfeature requestNew feature or requestNew feature or request