This document is an entrypoint into the code used for An Empirical Study of Mamba-based Language Models.
We are releasing the parameters for some of the models described in that
technical report via
HuggingFace.
The code in the main
branch is no longer compatible with the Mamba2-*
checkpoints. You can load them using the
fixed snapshot of the code used for the technical report.
Create and run a Docker container using the Dockerfile.
docker build -t your_image_name:your_tag .
docker run --gpus all -it --rm \
-v /path/to/megatron:/workspace/megatron \
-v /path/to/dataset:/workspace/dataset \
-v /path/to/checkpoints:/workspace/checkpoints \
-w /workspace/megatron/examples/mamba \
your_image_name:your_tag
train.sh
is an example pretraining script, showing how to run on
a single node. Select between 800M-scale and 8B-scale models by setting the
MODEL_SCALE
variable. The 8B-scale hybrid model architecture is the same as
the one described in the technical report.
Use run_text_gen_server_8b.sh
to start a text
generation server using an 8B hybrid checkpoint. This is configured to run the
8B hybrid model described in the technical report, with tensor model parallel
set to 1.
The arguments in the script will need to be changed if using a checkpoint with a
different model parallel configuration or other differences, such as model
architecture. For example, to run the 8B pure Mamba-2 model, change
--hybrid-attention-ratio
and --hybrid-mlp-ratio
to 0.0, or remove them.
Use run_text_gen_server_8b_gpt3.sh
to start
a text generation server using the 8B reference Transformer checkpoint.
For inference, the model must be configured to match the checkpoint file used, including the hybrid layer configuration and model parallel configuration.
If you need to convert a hybrid checkpoint file to a different tensor parallel or pipeline parallel size, use the hybrid conversion script. There is an example run command at the end of that file.
Before running that script, you will need to set PYTHONPATH
to include the
root directory of your Megatron-LM repository clone.
export PYTHONPATH=<path-to-megatron>:PYTHONPATH
--hybrid-attention-ratio ATT
specifies a target ratio of attention layers
to total layers. For example, 4 attention layers out of 48 total layers is
specified by --hybrid-attention-ratio 0.08
.
--hybrid-mlp-ratio MLP
specifies a target ratio of MLP layers to total
layers. For example, 24 MLP layers out of 48 total layers is specified by
--hybrid-mlp-ratio 0.5
.
- (
ATT
+MLP
) must be less than or equal to 1.0. - (1.0 -
ATT
-MLP
) is the hybrid mamba ratio, the ratio of mamba layers to total layers. ATT
=MLP
= 0 is a pure Mamba model.ATT
=MLP
= 0.5 is a transfomer model.
If either ATT
or MLP
is greater than 0.0 or if --hybrid-override-pattern
is specified, the logfile will include information about the hybrid layer
pattern used. --hybrid-override-pattern
can be used to specify a different
pattern than the default, algorithmically-generated one.
This codebase currently only supports Mamba-2, and not the original version of Mamba. However, the fixed snapshot of the code used for the technical report can be configured to run the original version of Mamba.