This repo runs Mamba models in C. We use tokenizer of Zephyr family of models, available on Huggingface.
Now runs at about the same speed as llama2.c when comparing mamba-130m vs llama-110m.
Clone the repo on your device
git clone https://github.com/SalmanHabeeb/mamba-inference-in-c.git
and navigate to the folder:
cd mamba-inference-in-c
Build using
make runomp
Other build commands are listed inside Makefile
Models can be downloaded from this link manually, or using gdown:
# For model
gdown https://drive.google.com/file/d/1cI6_LmfSuKLtgGNyOUbcQ2K_2h6a-KaL/view?usp=sharing --fuzzy
# For tokenizer
gdown https://drive.google.com/file/d/1qUjULatBdbrJaJqsJuTrtuWYt4G6GI2D/view?usp=sharing --fuzzy
Now run the model to generate text using:
OMP_NUM_THREADS=2 ./run "path/to/model"
and for chatting, use:
OMP_NUM_THREADS=2 ./run "path/to/model" -m chat
You can download chat-finetuned quantized version of model using:
wget https://huggingface.co/SalmanHabeeb/quantized-mambas-c-format/resolve/main/mamba-2.8b-64g-zephyr.bin
To run quantized models, run any make command and:
OMP_NUM_THREADS=2 ./runq "path/to/model"
Also, models can be exported to required bin format by running
python export_model.py path/to/save/model --checkpoint "huggingface-chkpoint"
Also, tokenizer can be obtained by using:
python export_tokenizer.py path/to/save/tokenizer --tokenizer "huggingface-tokenizer"
The above chart details the time consumed per token by each function. Most of the time is spent in matmul in_proj, since that is the largest matrix multiplication, other than final matmul (to produce logits). I still don't understand why there are so many peaks and valleys, but it may have something to do with cache.
The chart can be generated using this code:
# CC={gcc, clang}
$(CC) -Ofast -fopenmp -march=native run-timeit.c -lm -o run-timeit
./run-timeit
python plot.py
- karpathy/llama2.c -- Basically this repository is unofficial fork of karpathy/llama2.c
- johnma2006/mamba-minimal -- Inspired mamba implementation
- havenhq/mamba-chat -- For chat-finetuning of models
- state-spaces/mamba