introduce lofi and hifi commands for train, generate. Add Data Mixing

cdoern · cdoern · commit 3de2f2c0bf1d · 2024-05-18T12:54:15.000-04:00
This enhancement discusses "more intensive" training and data generation techniques as well as a new Data Mixing command. This is all built off of the command redesign. The goal here is to produce higher fidelity models using the CLI.

Signed-off-by: Charlie Doern &lt;cdoern@redhat.com&gt;
diff --git a/docs/lofi-hifi-backends.md b/docs/lofi-hifi-backends.md
@@ -0,0 +1,171 @@
+# Introduce Lofi and Hifi Commands for Key ilab functions to Increase Model Fidelity
+
+This document describes adding different data generation, mixing, and training backends to ilab to enable higher fidelity training using the backend code.
+
+Currently all training is done via qlora or the like. Adding the following commands will enable higher fidelity training and introduce commands such as data mixing.
+
+## Key Component
+
+### Building off of the InstructLab Structure Redesign
+
+After github.com/instructlab/instructlab/pull/990 is merged, ilab will use a parent -> child command structure. This proposal operates under that new structure.
+
+This is the proposed new structure under the above Enhancement:
+
+```console
+ilab
+|
+|_______model
+|       |
+|       |____convert
+|       |____download
+|       |____train (--convert)
+|       |____serve   (-i)
+|       |____chat
+|       |____inference
+|       |
+|_______data
+|       |
+|       |____generate
+|       |
+|_______config
+|       |
+|       |____init
+|       |
+|_______tax
+|       |
+|       |____diff
+|       |____check
+|       |____download
+```
+
+And this would be the structure after these new commands are added:
+
+```console
+ilab
+|
+|_______model
+|       |
+|       |____convert
+|       |____download
+|       |____train (--convert)
+|       |    |
+|       |    |______lofi *
+|       |    |______hifi *
+|       |
+|       |____serve   (-i)
+|       |____chat
+|       |____inference
+|       |
+|_______data
+|       |____generate
+|       |    |
+|       |    |______lofi *
+|       |    |______hifi *
+|       |
+|       |____mix *
+|       |
+|_______config
+|       |
+|       |____init
+|_______tax
+|       |
+|       |____diff
+|       |____check
+|       |____download
+```
+
+
+The starred commands are the new ones under the redesigned structure. We are opting to add lofi and hifi as commands under train and generate rather than flags given their completely different functionality and backend structure.
+
+
+These commands would connect to the instructlab backend, which will be in the form of libraries. The lofi commands if accepted will be the equivalent of what currently exists for generate and train. qlora, with pytorch, mlx, etc.
+
+
+The Hifi versions would validate the existence of hardware that can properly run the generation, mixing, and training backends. At least for training, the existing infrastructure simply shells out to various python scripts, libraries, etc. So, as long as we combine this backend code into a place that can be imported into ilab without breaking other dependencies, this should be more of a structural change than a functional one. We know the backend code works on an isolated system, we just need to make it pluggable.
+
+Hifi can run locally on someone's laptop or desktop and even utilize deepspeed if they have multiple GPUs. In a more powerful system, the user can also run it in a container, utilize deepspeed and potentially even distribute the workload accorss machines using torch.distributed. 
+
+
+### Reasoning
+
+
+Plugging into hardware acceleration and multi-phase training is the logical next step for ilab. Ensuring we do this in a clean way that does not overload our current commands is also crucial. Many of the processes in the backend are confusing so we want to abstract some of the steps away from users while also giving them a reasonable amount of choice in configuring these new processes. However, maintaining the current laptop story is important to users without hardware access. Splitting these two paths into separate commands maintains the integrity of each.
+
+
+
+### ilab model train hifi
+
+
+This command would take something like the following arguments
+
+
+--gpus=str , describes the amount of GPUs (of what is available) to use for this process. This comes in the form of: 0-1, 8, etc.
+--multi-phase=bool , describes if we want to run multi phase training
+--optimizer=str (deepspeed, fsdp, adam) describes the optimizer to use during training
+--ditributed or --containerized, triggers a containerized version of the train workflow with hardcoded dependencies.
+--learning-rate, int (?)
+--batch-len=int
+--input-dir=str, describes where the generated+mixed data lives to be trained on.
+--model-name=str, name of the model to be output
+--output-dir=str, where to put the model after training
+--num-epochs=int, the amount of epochs to run in this phase of training
+++ many of the current train flags probably
+
+#### Implementation Specifics
+
+The Transformers library (which is currently what we use for training along with torch) has options for fsdp, deepspeed, and many of the related arguments currently in use elsewhere in the project. My idea here is to try and use these as best as we can in the CLI rather than make our own custom versions that we then have to maintain.
+
+There might be a usecase for kicking off some custom deepspeed code for more in depth inter-checkpoint eval. That would be the point of the --containerized flag. When this flag is used, all user options would be honored and be passed as build args into a containerized workflow that might use some different code under the hood. 
+
+Keep in mind though, this is the community CLI! I feel as though we should try to find a middle ground between server usecases and communuty usecases. Having the default path in the InstructLab CLI be torch+transformers makes sense for the following usecases:
+
+1. Developer with a Gaming PC:
+    * Transformers+Pytorch support Qlora&&FSDP. While deepspeed might be a more "server-rack" use-case, having multi-phase training in the CLI for anyone with a consumer GPU makes sense.
+2. Someone interested in ML, has a Homelab, or *anything with 2 GPUs*
+    * Trandformers+Pytorch supports Deepspeed on a single system spreading the training over the GPUs. Any professional or Hobbyist that has 2 GPUs will be looking for this option
+3. The laptop usecase
+    * Maintianing Qlora as the performant training mode for the laptop is crucial as most people cannot handle the full models. However, unlocking some better results by using FSDP+Qlora could improve local results and get people more interested in InstructLab.
+
+The above usecases create a spectrum of possibility of what users can do with ilab! adding `ilab model train lofi/hifi` each with different options for optimizers: Adam (default, Qlora), FSDP (Qlora OR Non-Qlora train), and Deepspeed (Non-Qlora, multi GPU) increase the amount of situations where ilab is viable. Adding other options like --multi-phase, --learning-rate, etc give the user granular control over this new multi-phased training approach.
+
+
+### ilab model train lofi
+
+This command would have the same arguments as the current train as of github.com/instructlab/instructlab/pull/1157.
+
+The main emphasis here is on the backend to be used: mlx or pytorch. GPU acceleration here is not an option, nor is the multi-phase training.
+
+### ilab data generate hifi
+
+This command would take something like the following arguments
+
+--num-samples=int
+--num-grounded-questions=int
+--num-gen-proc=int
+--num-util-proc=int (or is this for mixing?)
+
+
+### ilab data generate lofi
+
+This command would be the same as the existing `ilab generate`
+
+### ilab data mix
+
+This command would take something like the following arguments
+
+--num-util-proc=int
+--output-dir=str (defaults to generated/mixed)
+--knowledge-recipes=[]str (path to yaml)
+--skill-recipes=[]str (path to yaml)
+
+* Do we need an `ilab recipe` cmd? *
+
+## Alternatives
+
+The other alternative is to keep the same train and generate commands and instead add a --backend or --hifi flag to trigger the high fidelity code. The issue here is that ilab train is already overloaded with pytorch, mlx, etc. Adding more switches and dials into the main train code will make it hard to maintain.
+
+
+
+
+