Skip to content

syedsalman137/mamba-inference-in-c

Repository files navigation

mamba-inference-in-c

This repo runs Mamba models in C. We use tokenizer of Zephyr family of models, available on Huggingface.

Now runs at about the same speed as llama2.c when comparing mamba-130m vs llama-110m.

Build and Run

Clone the repo on your device

git clone https://github.com/SalmanHabeeb/mamba-inference-in-c.git

and navigate to the folder:

cd mamba-inference-in-c

Build using

make runomp

Other build commands are listed inside Makefile

Models can be downloaded from this link manually, or using gdown:

# For model
gdown https://drive.google.com/file/d/1cI6_LmfSuKLtgGNyOUbcQ2K_2h6a-KaL/view?usp=sharing --fuzzy

# For tokenizer
gdown https://drive.google.com/file/d/1qUjULatBdbrJaJqsJuTrtuWYt4G6GI2D/view?usp=sharing --fuzzy

Now run the model to generate text using:

OMP_NUM_THREADS=2 ./run "path/to/model"

and for chatting, use:

OMP_NUM_THREADS=2 ./run "path/to/model" -m chat

Quantized models

You can download chat-finetuned quantized version of model using:

wget https://huggingface.co/SalmanHabeeb/quantized-mambas-c-format/resolve/main/mamba-2.8b-64g-zephyr.bin

To run quantized models, run any make command and:

OMP_NUM_THREADS=2 ./runq "path/to/model"

Export

Also, models can be exported to required bin format by running

python export_model.py path/to/save/model --checkpoint "huggingface-chkpoint"

Also, tokenizer can be obtained by using:

python export_tokenizer.py path/to/save/tokenizer --tokenizer "huggingface-tokenizer"

Performance

Cute Llama

The above chart details the time consumed per token by each function. Most of the time is spent in matmul in_proj, since that is the largest matrix multiplication, other than final matmul (to produce logits). I still don't understand why there are so many peaks and valleys, but it may have something to do with cache.

The chart can be generated using this code:

# CC={gcc, clang}
$(CC) -Ofast -fopenmp -march=native run-timeit.c  -lm  -o run-timeit
./run-timeit
python plot.py

Attribution

License

MIT License

About

Inference mamba in one file of C

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published