You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+15-6Lines changed: 15 additions & 6 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -83,6 +83,18 @@ This ran at about 4 tokens/s compiled with [OpenMP](#OpenMP) on 96 threads on my
83
83
84
84
base models... ¯\\_(ツ)_/¯. Since we can inference the base model, it should be possible to also inference the chat model quite easily, and have a conversation with it. And if we can find a way to run 7B more efficiently, we can start adding LoRA to our training script, and going wild with finetunes all within the repo!
85
85
86
+
You can also chat with the Llama Chat models. Export the chat model exactly as above:
Then chat with it by specifying the chat mode using the `-m` flag, e.g.:
93
+
94
+
```bash
95
+
./run llama2_7b_chat.bin -m chat
96
+
```
97
+
86
98
## hugginface models
87
99
88
100
We can load any huggingface models that use the Llama 2 architecture. See the script [export.py](export.py) and the `--hf` flag to export the model .bin file.
@@ -207,8 +219,7 @@ You can also experiment with replacing `gcc` with `clang`.
207
219
208
220
If compiling with gcc, try experimenting with `-funroll-all-loops`, see PR [#183](https://github.com/karpathy/llama2.c/pull/183)
209
221
210
-
### OpenMP
211
-
Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors.
222
+
**OpenMP**. Big improvements can also be achieved by compiling with OpenMP, which "activates" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors.
212
223
You'll need to install the OpenMP library and the clang compiler first (e.g. `apt install clang libomp-dev` on ubuntu). Then you can compile with `make runomp`, which does:
213
224
214
225
```bash
@@ -324,13 +335,11 @@ If your candidate PRs have elements of these it doesn't mean they won't get merg
324
335
325
336
## unsorted todos
326
337
327
-
- support Llama 2 7B Chat models with a Chat UI/UX in run.c, very similar to llama.cpp
328
-
- ability to calculate perplexity in run.c, exactly as done in llama.cpp
329
338
- add support in run.c of reading version 1+ files from export, later deprecate "version 0"
330
-
- add more tests in [test.c](test.c)
331
339
- runq.c (int8 quantization) add
332
340
- run.cu (CUDA) investigate and merge
333
-
- add an Engine class that serves the model ~efficiently but in PyTorch (see [Issue 346](https://github.com/karpathy/llama2.c/issues/346))
341
+
- add more tests inside [test.c](test.c)
342
+
- add Engine class for use in sample.py that does efficient inference in PyTorch, e.g. KV cache keeping
334
343
- make it easier to add a new dataset with not too much pain
0 commit comments