introduce lofi and hifi commands for train, generate. Add Data Mixing

cdoern · cdoern · commit 19ef6c4bcb9c · 2024-05-23T09:02:39.000-04:00
This enhancement discusses "more intensive" training and data generation techniques as well as a new Data Mixing command. This is all built off of the command redesign. The goal here is to produce higher fidelity models using the CLI.

Signed-off-by: Charlie Doern &lt;cdoern@redhat.com&gt;
diff --git a/docs/lofi-hifi-backends.md b/docs/lofi-hifi-backends.md
@@ -0,0 +1,212 @@
+# Introduce Commands that Run Jobs with Different Fidelity levels for Key ilab functions
+
+This document describes adding different data generation, mixing, and training backends to ilab to enable higher fidelity training using the backend code.
+
+Currently all training is done via qlora or the like. Adding the following commands will enable higher fidelity training and introduce commands such as data mixing.
+
+## Key Component
+
+### Building off of the InstructLab Structure Redesign
+
+After github.com/instructlab/instructlab/pull/990 is merged, ilab will use a parent -> child command structure. This proposal operates under that new structure.
+
+This is the proposed new structure under the above Enhancement:
+
+```console
+ilab
+|
+|_______model
+|       |
+|       |____convert
+|       |____download
+|       |____train (--convert)
+|       |____serve   (-i)
+|       |____chat
+|       |____inference
+|       |
+|_______data
+|       |
+|       |____generate
+|       |
+|_______config
+|       |
+|       |____init
+|       |
+|_______tax
+|       |
+|       |____diff
+|       |____check
+|       |____download
+```
+
+And this would be the structure after these new commands are added:
+
+```console
+ilab
+|
+|_______model
+|       |
+|       |____convert
+|       |____download
+|       |____train (--convert)
+|       |    |
+|       |    |______integrated *
+|       |    |______phased *
+|       |
+|       |____serve   (-i)
+|       |____chat
+|       |____inference
+|       |
+|_______data
+|       |____generate
+|       |    |
+|       |    |______lofi * (name pending)
+|       |    |______hifi * (name pending)
+|       |
+|       |____mix *
+|       |
+|       |
+|_______checkpoint
+|       |
+|       |____evaluate *
+|       |
+|_______config
+|       |
+|       |____init
+|_______tax
+|       |
+|       |____diff
+|       |____check
+|       |____download
+```
+
+
+The starred commands are the new ones under the redesigned structure. We are opting to add integrated and phased as commands under train and lofi/higi under generate rather than flags given their completely different functionality and backend structure.
+
+
+These commands would connect to the instructlab backend, which will be in the form of libraries. The lower fidelity commands if accepted will be the equivalent of what currently exists for generate and train. qlora, with pytorch, mlx, etc.
+
+
+The Higher Fidelity versions would validate the existence of hardware that can properly run the generation, mixing, and training backends. At least for training, the existing infrastructure simply shells out to various python scripts, libraries, etc. So, as long as we combine this backend code into a place that can be imported into ilab without breaking other dependencies, this should be more of a structural change than a functional one. We know the backend code works on an isolated system, we just need to make it pluggable.
+
+High Fidelity can run locally on someone's laptop or desktop and even utilize deepspeed if they have GPUs. In a more powerful system, the user can also run it in a container, utilize deepspeed and potentially even distribute the workload accorss machines using torch.distributed.
+
+A key component here too is checkpoint evaluation. We want users to be able to have an understanding of model checkpoints starting with the ability to run evaluation on them in between phased training runs. The output of `ilab checkpoint evaluate` will tell the user what to point `ilab model train phased` arguments towards.
+
+
+### Reasoning
+
+
+Plugging into hardware acceleration and multi-phase training is the logical next step for ilab. Ensuring we do this in a clean way that does not overload our current commands is also crucial. Many of the processes in the backend are confusing so we want to abstract some of the steps away from users while also giving them a reasonable amount of choice in configuring these new processes. However, maintaining the current laptop story is important to users without hardware access. Splitting these two paths into separate commands maintains the integrity of each.
+
+
+
+### ilab model train integrated
+
+
+This command would take something like the following arguments
+
+
+--gpus=str , describes the amount of GPUs (of what is available) to use for this process. This comes in the form of: 0-1, 8, etc.
+--quantize=bool, enabled Qlora which basically loads the model in a quantized form so it can fit on a consumer GPU
+--optimizer=str (deepspeed, fsdp) describes the optimizer framework to use during training
+--learning-rate, int (?)
+--batch-len=int
+--input-dir=str, describes where the generated+mixed data lives to be trained on.
+--model-name=str, name of the model to be output
+--output-dir=str, where to put the model after training
+--num-epochs=int, the amount of epochs to run in this phase of training
+--device=str, the accelerator to use: cuda,rocm,cpu,mlx,mps
+++ many of the current train flags probably
+
+#### Implementation Specifics
+
+The Transformers library (which is currently what we use for training along with torch) has options for fsdp, deepspeed, and many of the related arguments currently in use elsewhere in the project. My idea here is to try and use these as best as we can in the CLI rather than make our own custom versions that we then have to maintain.
+
+There might be a usecase for kicking off some custom deepspeed code for more in depth inter-checkpoint eval. That would be the point of the `phased` command. The phased train command is meant to be run in conjunction with `ilab checkpoint evaluate` which will give the user the best checkpoint to run the next phase on. 
+
+Keep in mind though, this is the community CLI! I feel as though we should try to find a middle ground between server usecases and community usecases. Having the default path in the InstructLab CLI be torch+transformers makes sense for the following usecases:
+
+1. Developer with a Gaming PC:
+    * Transformers+Pytorch support Qlora&&FSDP. While deepspeed might be a more "server-rack" use-case, having multi-phase training in the CLI for anyone with a consumer GPU makes sense.
+2. Someone interested in ML, has a Homelab, or *anything with 2 GPUs*
+    * Trandformers+Pytorch supports Deepspeed on a single system spreading the training over the GPUs. Any professional or Hobbyist that has 2 GPUs will be looking for this option
+3. The laptop usecase
+    * Maintianing Qlora as the performant training mode for the laptop is crucial as most people cannot handle the full models. However, unlocking some better results by using FSDP+Qlora could improve local results and get people more interested in InstructLab.
+
+The above usecases create a spectrum of possibility of what users can do with ilab! adding `ilab model train lofi/hifi` each with different options for optimizers: Adam (default, Qlora), FSDP (Qlora OR Non-Qlora train), and Deepspeed (Non-Qlora, multi GPU) increase the amount of situations where ilab is viable. Adding other options like --multi-phase, --learning-rate, etc give the user granular control over this new multi-phased training approach.
+
+
+### ilab model train phased
+
+This command would take roughly the following arguments
+
+--optimizer=str (deepspeed, fsdp) describes the optimizer framework to use during training
+--device=str , accelerator to use cpu,cuda,rocm,mlx,mps
+--model-dir=path , dir where the model to be used in this phase is located
+--data-dir=path , dir where the data for this phase is located
+
+Note there is no Lora in this command and there is no quantization. The `ilab model train integrated` command will use the transformers library with pytorch. This is because those have awesome plugins for deepspeed and fsdp WITH lora and qlora. Those are absolute necessities for community usecase. However, they will be mostly unused in the "High Fidelity" usecase.
+
+### ilab data generate hifi
+
+This command would take something like the following arguments
+
+--num-samples=int
+--num-grounded-questions=int
+--num-gen-proc=int
+--num-util-proc=int (or is this for mixing?)
+
+
+### ilab data generate lofi
+
+This command would be the same as the existing `ilab generate`
+
+### ilab data mix
+
+This command would take something like the following arguments
+
+--num-util-proc=int
+--output-dir=str (defaults to generated/mixed)
+--knowledge-recipes=[]str (path to yaml)
+--skill-recipes=[]str (path to yaml)
+
+* Do we need an `ilab recipe` cmd? *
+
+
+## Workflows
+
+### ilab model train integrated
+
+A user on a Desktop with a consumer GPU (assume RTX 20/30 series) would run something like
+
+`ilab model train integrated --device=cuda --quantize --optimizer=deepspeed --num-epochs=5`
+
+This would 
+1. load the model in 4-bit-quantized form onto the GPUs vram
+2. setup the transformers trainer with this model, and with a hardcoded deepspeed config ilab would come with
+3. train for 5 epochs using fusedadam as the optim and deepspeed on top of that.
+4. give you a model in safetensors format (we cannot convert a quantized safetensor model)
+
+
+The big advantage here is more "fine tuned" training than currently exists in the CLI because of deepspeed (or fsdp). The user could even set this up for multi GPU or multi system support with future ilab enhancements.
+
+### ilab model train phased
+
+A user would run something like (assuming phase00 has run) on a GPU enabled server.
+
+`ilab model train phased --device=cuda --model-dir=./phase00/model --data-dir=./phase00/data`
+`ilab checkpoint evaluate ./phase05/checkpoints --output-dir=./phase10`
+`ilab model train phased --device=cuda --model-dir=./phase10/model --data-dir=./phase10/data`
+....
+
+Basically they would run phased with an eval in between. The Eval looks at the checkpoints output by the previous phase and outputs a model dir in the next phase's working directory. 
+
+## Alternatives
+
+The other alternative is to keep the same train and generate commands and instead add a --backend or --hifi flag to trigger the high fidelity code. The issue here is that ilab train is already overloaded with pytorch, mlx, etc. Adding more switches and dials into the main train code will make it hard to maintain.
+
+
+
+
+