[feature] Improve general support for local datasets and documents

## User Story

Sometimes suitable datasets cannot be found on huggingface and it's better to use a smaller local dataset + synthesis / augmentation using some local files such as PDF, word, slides, etc. Handling these media would be helpful as well to turn documents into trainable datasets!

This issue can be tackled in the following steps:

1. First validate and enhance general support for local datasets uploading + preprocessing
2. Implement support for vision local datasets
3. Implement various synthesis and augmentation strategies powered by https://github.com/meta-llama/synthetic-data-kit

```
The tool is designed to follow a simple CLI structure with 4 commands:

ingest various file formats
create your fine-tuning format: QA pairs, QA pairs with CoT, summary format
curate: Using Llama as a judge to curate high quality examples.
save-as: After that you can simply save these to a format that your fine-tuning workflow requires.
```

## Acceptance criteria

Upgrades to preprocessing service

This is an end-to-end example with Llama, which we should replicate with Gemma: https://colab.research.google.com/github/unslothai/notebooks/blob/main/nb/Meta_Synthetic_Data_Llama3_2_(3B).ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[feature] Improve general support for local datasets and documents #72

User Story

Acceptance criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[feature] Improve general support for local datasets and documents #72

Description

User Story

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions