-
Notifications
You must be signed in to change notification settings - Fork 12.3k
llama : reuse compute graphs #14482
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
llama : reuse compute graphs #14482
Conversation
2f577c5
to
30b4d4e
Compare
f61b0f7
to
d9e1781
Compare
0d9c3d4
to
fc4fdf6
Compare
ggml-ci
fc4fdf6
to
76681e3
Compare
This should be ready for review. Currently, there is some small gain for Metal where It would be interesting to try to reuse the Metal command buffers to speed this up even further on the backend side. Currently, we use |
res &= self_kq_mask->ne[0] == mctx->get_n_kv(); | ||
res &= self_kq_mask->ne[1] == GGML_PAD(params.ubatch.n_tokens, GGML_KQ_MASK_PAD); | ||
|
||
res &= mctx->get_supports_set_rows(); // TODO: tmp |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If update()
is implemented for the recurrent cache, I think it could work even without adapting it to ggml_set_rows
, because the head
offset tends to be the same for similar consecutive ubatches in find_slot
.
That might not work as well once multiple recurrent state cells per sequence are implemented (because they won't get re-used as much), but at that point it should be possible to use ggml_set_rows
.
target #14285
Reuse computation graphs from the previous ubatch when possible. Works with any batch size and any model.
Note
The functionality currently requires the
LLAMA_SET_ROWS
from #14285This functionality requires the
ggml_set_rows()
operator to be supported (see #14285). In order to be able to reuse a compute graph, its topology (shapes, strides, parameters, etc.) has to be entirely defined by the set of input tensors (e.g.inp_embd
,inp_pos
,inp_attn
, etc.).This PR adds logic to update a previous
llm_graph_result
by verifying that the newllm_graph_params
would result in the same tensor shapes. For this to work, we should no longer preemptively reset the scheduler after processing a batch so that all buffers from the previous graph remain allocated and ready for reuse, in case the newubatch
is compatible. See the newllm_graph_result::update()
method:llama.cpp/src/llama-graph.h
Lines 506 to 525 in fc4fdf6
The other change that is needed is to introduce a way to swap the
llama_memory_context
of all graph inputs, so that the new call tollm_graph_result_i::set_inputs()
uses the correct context from the currentubatch
. This is performed by calling thellm_graph_input_i::update()
method of all input tensors.To enable this feature, define the
LLAMA_SET_ROWS
environment variable and add the new--graph-reuse
CLI argument to the llama.cpp tools.API Changes
bool llama_context_params::graph_reuse
. Default isfalse
Tests
LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -p "I believe the meaning of life is" -n 32 --top-k 1 -fa -gr LLAMA_SET_ROWS=1 ./bin/llama-parallel -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -np 8 -ns 128 -s 1 -c 4096 -fa -n 128 -gr
Benchmark on M2 Ultra:
TODO
is_same
methods?Next PRs
llama_graph_result_i
interface - does not seem to have any purposeggml_cgraph * gf
everywhere. Simply move it tollm_graph_context