Skip to content

[feature] Improve general support for local datasets and documents #72

@supreme-gg-gg

Description

@supreme-gg-gg

User Story

Sometimes suitable datasets cannot be found on huggingface and it's better to use a smaller local dataset + synthesis / augmentation using some local files such as PDF, word, slides, etc. Handling these media would be helpful as well to turn documents into trainable datasets!

This issue can be tackled in the following steps:

  1. First validate and enhance general support for local datasets uploading + preprocessing
  2. Implement support for vision local datasets
  3. Implement various synthesis and augmentation strategies powered by https://github.com/meta-llama/synthetic-data-kit
The tool is designed to follow a simple CLI structure with 4 commands:

ingest various file formats
create your fine-tuning format: QA pairs, QA pairs with CoT, summary format
curate: Using Llama as a judge to curate high quality examples.
save-as: After that you can simply save these to a format that your fine-tuning workflow requires.

Acceptance criteria

Upgrades to preprocessing service

This is an end-to-end example with Llama, which we should replicate with Gemma: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Meta_Synthetic_Data_Llama3_2_(3B).ipynb

Metadata

Metadata

Assignees

Labels

datasetsDatasets related featuresfeature requestNew feature or request

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions