Skip to content

Commit

Permalink
introduce lofi and hifi commands for train, generate. Add Data Mixing
Browse files Browse the repository at this point in the history
This enhancement discusses "more intensive" training and data generation techniques as well as a new Data Mixing command. This is all built off of the command redesign. The goal here is to produce higher fidelity models using the CLI.

Signed-off-by: Charlie Doern <[email protected]>
  • Loading branch information
cdoern committed May 18, 2024
1 parent 3f447c4 commit 3de2f2c
Showing 1 changed file with 171 additions and 0 deletions.
171 changes: 171 additions & 0 deletions docs/lofi-hifi-backends.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,171 @@
# Introduce Lofi and Hifi Commands for Key ilab functions to Increase Model Fidelity

This document describes adding different data generation, mixing, and training backends to ilab to enable higher fidelity training using the backend code.

Currently all training is done via qlora or the like. Adding the following commands will enable higher fidelity training and introduce commands such as data mixing.

## Key Component

### Building off of the InstructLab Structure Redesign

After github.com/instructlab/instructlab/pull/990 is merged, ilab will use a parent -> child command structure. This proposal operates under that new structure.

This is the proposed new structure under the above Enhancement:

```console
ilab
|
|_______model
| |
| |____convert
| |____download
| |____train (--convert)
| |____serve (-i)
| |____chat
| |____inference
| |
|_______data
| |
| |____generate
| |
|_______config
| |
| |____init
| |
|_______tax
| |
| |____diff
| |____check
| |____download
```

And this would be the structure after these new commands are added:

```console
ilab
|
|_______model
| |
| |____convert
| |____download
| |____train (--convert)
| | |
| | |______lofi *
| | |______hifi *
| |
| |____serve (-i)
| |____chat
| |____inference
| |
|_______data
| |____generate
| | |
| | |______lofi *
| | |______hifi *
| |
| |____mix *
| |
|_______config
| |
| |____init
|_______tax
| |
| |____diff
| |____check
| |____download
```


Check failure on line 78 in docs/lofi-hifi-backends.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/lofi-hifi-backends.md:78 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
The starred commands are the new ones under the redesigned structure. We are opting to add lofi and hifi as commands under train and generate rather than flags given their completely different functionality and backend structure.


Check failure on line 81 in docs/lofi-hifi-backends.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/lofi-hifi-backends.md:81 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
These commands would connect to the instructlab backend, which will be in the form of libraries. The lofi commands if accepted will be the equivalent of what currently exists for generate and train. qlora, with pytorch, mlx, etc.


Check failure on line 84 in docs/lofi-hifi-backends.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/lofi-hifi-backends.md:84 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
The Hifi versions would validate the existence of hardware that can properly run the generation, mixing, and training backends. At least for training, the existing infrastructure simply shells out to various python scripts, libraries, etc. So, as long as we combine this backend code into a place that can be imported into ilab without breaking other dependencies, this should be more of a structural change than a functional one. We know the backend code works on an isolated system, we just need to make it pluggable.

Hifi can run locally on someone's laptop or desktop and even utilize deepspeed if they have multiple GPUs. In a more powerful system, the user can also run it in a container, utilize deepspeed and potentially even distribute the workload accorss machines using torch.distributed.

Check failure on line 87 in docs/lofi-hifi-backends.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Trailing spaces

docs/lofi-hifi-backends.md:87:280 MD009/no-trailing-spaces Trailing spaces [Expected: 0 or 2; Actual: 1] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md009.md


Check failure on line 89 in docs/lofi-hifi-backends.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/lofi-hifi-backends.md:89 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
### Reasoning


Check failure on line 92 in docs/lofi-hifi-backends.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/lofi-hifi-backends.md:92 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
Plugging into hardware acceleration and multi-phase training is the logical next step for ilab. Ensuring we do this in a clean way that does not overload our current commands is also crucial. Many of the processes in the backend are confusing so we want to abstract some of the steps away from users while also giving them a reasonable amount of choice in configuring these new processes. However, maintaining the current laptop story is important to users without hardware access. Splitting these two paths into separate commands maintains the integrity of each.


Check failure on line 95 in docs/lofi-hifi-backends.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/lofi-hifi-backends.md:95 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md

Check failure on line 96 in docs/lofi-hifi-backends.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/lofi-hifi-backends.md:96 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 3] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
### ilab model train hifi


Check failure on line 99 in docs/lofi-hifi-backends.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/lofi-hifi-backends.md:99 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
This command would take something like the following arguments


Check failure on line 102 in docs/lofi-hifi-backends.md

View workflow job for this annotation

GitHub Actions / markdown-lint

Multiple consecutive blank lines

docs/lofi-hifi-backends.md:102 MD012/no-multiple-blanks Multiple consecutive blank lines [Expected: 1; Actual: 2] https://github.com/DavidAnson/markdownlint/blob/v0.34.0/doc/md012.md
--gpus=str , describes the amount of GPUs (of what is available) to use for this process. This comes in the form of: 0-1, 8, etc.
--multi-phase=bool , describes if we want to run multi phase training
--optimizer=str (deepspeed, fsdp, adam) describes the optimizer to use during training
--ditributed or --containerized, triggers a containerized version of the train workflow with hardcoded dependencies.
--learning-rate, int (?)
--batch-len=int
--input-dir=str, describes where the generated+mixed data lives to be trained on.
--model-name=str, name of the model to be output
--output-dir=str, where to put the model after training
--num-epochs=int, the amount of epochs to run in this phase of training
++ many of the current train flags probably

#### Implementation Specifics

The Transformers library (which is currently what we use for training along with torch) has options for fsdp, deepspeed, and many of the related arguments currently in use elsewhere in the project. My idea here is to try and use these as best as we can in the CLI rather than make our own custom versions that we then have to maintain.

There might be a usecase for kicking off some custom deepspeed code for more in depth inter-checkpoint eval. That would be the point of the --containerized flag. When this flag is used, all user options would be honored and be passed as build args into a containerized workflow that might use some different code under the hood.

Keep in mind though, this is the community CLI! I feel as though we should try to find a middle ground between server usecases and communuty usecases. Having the default path in the InstructLab CLI be torch+transformers makes sense for the following usecases:

1. Developer with a Gaming PC:
* Transformers+Pytorch support Qlora&&FSDP. While deepspeed might be a more "server-rack" use-case, having multi-phase training in the CLI for anyone with a consumer GPU makes sense.
2. Someone interested in ML, has a Homelab, or *anything with 2 GPUs*
* Trandformers+Pytorch supports Deepspeed on a single system spreading the training over the GPUs. Any professional or Hobbyist that has 2 GPUs will be looking for this option
3. The laptop usecase
* Maintianing Qlora as the performant training mode for the laptop is crucial as most people cannot handle the full models. However, unlocking some better results by using FSDP+Qlora could improve local results and get people more interested in InstructLab.

The above usecases create a spectrum of possibility of what users can do with ilab! adding `ilab model train lofi/hifi` each with different options for optimizers: Adam (default, Qlora), FSDP (Qlora OR Non-Qlora train), and Deepspeed (Non-Qlora, multi GPU) increase the amount of situations where ilab is viable. Adding other options like --multi-phase, --learning-rate, etc give the user granular control over this new multi-phased training approach.


### ilab model train lofi

This command would have the same arguments as the current train as of github.com/instructlab/instructlab/pull/1157.

The main emphasis here is on the backend to be used: mlx or pytorch. GPU acceleration here is not an option, nor is the multi-phase training.

### ilab data generate hifi

This command would take something like the following arguments

--num-samples=int
--num-grounded-questions=int
--num-gen-proc=int
--num-util-proc=int (or is this for mixing?)


### ilab data generate lofi

This command would be the same as the existing `ilab generate`

### ilab data mix

This command would take something like the following arguments

--num-util-proc=int
--output-dir=str (defaults to generated/mixed)
--knowledge-recipes=[]str (path to yaml)
--skill-recipes=[]str (path to yaml)

* Do we need an `ilab recipe` cmd? *

## Alternatives

The other alternative is to keep the same train and generate commands and instead add a --backend or --hifi flag to trigger the high fidelity code. The issue here is that ilab train is already overloaded with pytorch, mlx, etc. Adding more switches and dials into the main train code will make it hard to maintain.





0 comments on commit 3de2f2c

Please sign in to comment.