mamba-inference-in-c

This repo runs Mamba models in C. We use tokenizer of Zephyr family of models, available on Huggingface.

Now runs at about the same speed as llama2.c when comparing mamba-130m vs llama-110m.

Build and Run

Clone the repo on your device

git clone https://github.com/SalmanHabeeb/mamba-inference-in-c.git

and navigate to the folder:

cd mamba-inference-in-c

Build using

make runomp

Other build commands are listed inside Makefile

Models can be downloaded from this link manually, or using gdown:

# For model
gdown https://drive.google.com/file/d/1cI6_LmfSuKLtgGNyOUbcQ2K_2h6a-KaL/view?usp=sharing --fuzzy

# For tokenizer
gdown https://drive.google.com/file/d/1qUjULatBdbrJaJqsJuTrtuWYt4G6GI2D/view?usp=sharing --fuzzy

Now run the model to generate text using:

OMP_NUM_THREADS=2 ./run "path/to/model"

and for chatting, use:

OMP_NUM_THREADS=2 ./run "path/to/model" -m chat

Quantized models

You can download chat-finetuned quantized version of model using:

wget https://huggingface.co/SalmanHabeeb/quantized-mambas-c-format/resolve/main/mamba-2.8b-64g-zephyr.bin

To run quantized models, run any make command and:

OMP_NUM_THREADS=2 ./runq "path/to/model"

Export

Also, models can be exported to required bin format by running

python export_model.py path/to/save/model --checkpoint "huggingface-chkpoint"

Also, tokenizer can be obtained by using:

python export_tokenizer.py path/to/save/tokenizer --tokenizer "huggingface-tokenizer"

Performance

The above chart details the time consumed per token by each function. Most of the time is spent in matmul in_proj, since that is the largest matrix multiplication, other than final matmul (to produce logits). I still don't understand why there are so many peaks and valleys, but it may have something to do with cache.

The chart can be generated using this code:

# CC={gcc, clang}
$(CC) -Ofast -fopenmp -march=native run-timeit.c  -lm  -o run-timeit
./run-timeit
python plot.py

Attribution

karpathy/llama2.c -- Basically this repository is unofficial fork of karpathy/llama2.c
johnma2006/mamba-minimal -- Inspired mamba implementation
havenhq/mamba-chat -- For chat-finetuning of models
state-spaces/mamba

License

MIT License

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
assets		assets
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
export_model.py		export_model.py
export_tokenizer.py		export_tokenizer.py
model.py		model.py
plot.py		plot.py
run-timeit.c		run-timeit.c
run.c		run.c
runq		runq
runq.c		runq.c
win.c		win.c
win.h		win.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

mamba-inference-in-c

Build and Run

Quantized models

Export

Performance

Attribution

License

About

Uh oh!

Releases

Packages

Languages

License

syedsalman137/mamba-inference-in-c

Folders and files

Latest commit

History

Repository files navigation

mamba-inference-in-c

Build and Run

Quantized models

Export

Performance

Attribution

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages