Skip to content
This repository was archived by the owner on Aug 7, 2025. It is now read-only.

Commit 9a9b69e

Browse files
committed
Add doc for llamacpp example
1 parent 8e33517 commit 9a9b69e

File tree

1 file changed

+86
-0
lines changed

1 file changed

+86
-0
lines changed

examples/cpp/llamacpp/README.md

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
This example used [llama.cpp](https://github.com/ggerganov/llama.cpp) to deploy a Llama-2-7B-Chat model using the TorchServe C++ backend.
2+
The handler C++ source code for this examples can be found [here](../../../cpp/src/examples/llamacpp/).
3+
4+
### Setup
5+
1. Follow the instructions in [README.md](../../../cpp/README.md) to build the TorchServe C++ backend.
6+
7+
```bash
8+
cd ~/serve/cpp
9+
./builld.sh
10+
```
11+
12+
2. Download the model
13+
14+
```bash
15+
cd ~/serve/examples/cpp/llamacpp
16+
curl -L https://huggingface.co/TheBloke/Llama-2-7B-Chat-GGUF/resolve/main/llama-2-7b-chat.Q5_0.gguf?download=true -o llama-2-7b-chat.Q5_0.gguf
17+
```
18+
19+
4. Create a [config.json](config.json) with the path of the downloaded model weights:
20+
21+
```bash
22+
echo '{
23+
"checkpoint_path" : "/home/ubuntu/serve/examples/cpp/llamacpp/llama-2-7b-chat.Q5_0.gguf"
24+
}' > config.json
25+
```
26+
27+
5. Copy handle .so file
28+
29+
While building the C++ backend the `libllamacpp_handler.so` file is generated in the [llamacpp_handler](../../../cpp/test/resources/examples/llamacpp/llamacpp_handler) folder.
30+
31+
```bash
32+
cp ../../../cpp/test/resources/examples/llamacpp/llamacpp_handler/libllamacpp_handler.so ./
33+
```
34+
35+
### Generate MAR file
36+
37+
Now lets generate the mar file
38+
39+
```bash
40+
torch-model-archiver --model-name llm --version 1.0 --handler libllamacpp_handler:LlamaCppHandler --runtime LSP --extra-files config.json
41+
```
42+
43+
Create model store directory and move the mar file
44+
45+
```
46+
mkdir model_store
47+
mv llm.mar model_store/
48+
```
49+
50+
### Inference
51+
52+
Start torchserve using the following command
53+
54+
```
55+
torchserve --ncs --model-store model_store/
56+
```
57+
58+
Register the model using the following command
59+
60+
```
61+
curl -v -X POST "http://localhost:8081/models?initial_workers=1&url=llm.mar&batch_size=2&max_batch_delay=5000"
62+
```
63+
64+
Infer the model using the following command
65+
66+
```
67+
curl http://localhost:8080/predictions/llm -T prompt1.txt
68+
```
69+
70+
This example supports batching. To run batch prediction, run the following command
71+
72+
```
73+
curl http://localhost:8080/predictions/llm -T prompt1.txt & curl http://localhost:8080/predictions/llm -T prompt2.txt &
74+
```
75+
76+
Sample Response
77+
78+
```
79+
Hello my name is Daisy. Daisy is three years old. She loves to play with her toys.
80+
One day, Daisy's mommy said, "Daisy, it's time to go to the store." Daisy was so excited! She ran to the store with her mommy.
81+
At the store, Daisy saw a big, red balloon. She wanted it so badly! She asked her mommy, "Can I have the balloon, please?"
82+
Mommy said, "No, Daisy. We don't have enough money for that balloon."
83+
Daisy was sad. She wanted the balloon so much. She started to cry.
84+
Mommy said, "Daisy, don't cry. We can get the balloon. We can buy it and take it home."
85+
Daisy smiled. She was so happy. She hugged her mommy and said, "Thank you, mommy!"
86+
```

0 commit comments

Comments
 (0)