Skip to content

llama : reuse compute graphs #14482

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

llama : reuse compute graphs #14482

wants to merge 1 commit into from

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Jul 1, 2025

target #14285

Reuse computation graphs from the previous ubatch when possible. Works with any batch size and any model.

Note

The functionality currently requires the LLAMA_SET_ROWS from #14285

  • CPU
  • Metal
  • CUDA
  • Vulkan
  • OpenCL
  • SYCL

This functionality requires the ggml_set_rows() operator to be supported (see #14285). In order to be able to reuse a compute graph, its topology (shapes, strides, parameters, etc.) has to be entirely defined by the set of input tensors (e.g. inp_embd, inp_pos, inp_attn, etc.).

This PR adds logic to update a previous llm_graph_result by verifying that the new llm_graph_params would result in the same tensor shapes. For this to work, we should no longer preemptively reset the scheduler after processing a batch so that all buffers from the previous graph remain allocated and ready for reuse, in case the new ubatch is compatible. See the new llm_graph_result::update() method:

llama.cpp/src/llama-graph.h

Lines 506 to 525 in fc4fdf6

// try to update the existing graph result using the new graph parameters
// this can only be done if we determine that the resulting graph using the new graph parameters
// would be identical to the existing graph. in that case, we simply have to update the memory
// contexts of the input tensors of the graph and we can reuse it for another computation
// return true if the graph was updated and can be reused
bool update(const llm_graph_params & params) override {
if (!this->params.is_same(params)) {
return false;
}
bool res = true;
for (auto & input : inputs) {
res &= input->update(params);
}
return res;
}

The other change that is needed is to introduce a way to swap the llama_memory_context of all graph inputs, so that the new call to llm_graph_result_i::set_inputs() uses the correct context from the current ubatch. This is performed by calling the llm_graph_input_i::update() method of all input tensors.

To enable this feature, define the LLAMA_SET_ROWS environment variable and add the new --graph-reuse CLI argument to the llama.cpp tools.

API Changes

  • Add bool llama_context_params::graph_reuse. Default is false

Tests

LLAMA_SET_ROWS=1 ./bin/llama-cli -m ../models/llama-3.2-3b-instruct/ggml-model-q8_0.gguf -p "I believe the meaning of life is" -n 32 --top-k 1 -fa -gr

LLAMA_SET_ROWS=1 ./bin/llama-parallel -m ../models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -np 8 -ns 128 -s 1 -c 4096 -fa -n 128 -gr

Benchmark on M2 Ultra:

LLAMA_ARG_GRAPH_REUSE=1 LLAMA_SET_ROWS=1 ./scripts/compare-commits.sh gg/kv-cache-use-set-rows gg/llama-reuse-graphs -m models/qwen2.5-3b-coder/ggml-model-q8_0.gguf -m models/qwen2.5-3b-coder/ggml-model-q4_0.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q4_0.gguf -m models/qwen2.5-1.5b-coder/ggml-model-q8_0.gguf -m models/gemma-3-4b/ggml-model-q4_0.gguf -m models/llama-3.2-1b-instruct/ggml-model-q8_0.gguf -fa 0,1 -t 1 -r 10 -n 1,32 -p 0
Model Test t/s master t/s gg/llama-reuse-graphs Speedup
gemma3 4B Q4_0 tg32 113.56 116.19 1.02
llama 1B Q8_0 tg32 275.25 280.95 1.02
qwen2 1.5B Q4_0 tg32 206.63 215.55 1.04
qwen2 1.5B Q8_0 tg32 174.63 180.79 1.04
qwen2 3B Q4_0 tg32 141.91 147.95 1.04
qwen2 3B Q8_0 tg32 111.40 114.48 1.03

TODO

  • Clean-up and improve new interfaces and members
  • Avoid graph input dynamic casts in is_same methods?
  • Allow to reuse more models
  • Manual user option to force disable of graph reuse?

Next PRs

  • Remove llama_graph_result_i interface - does not seem to have any purpose
  • Be able to compare the unique sequence ids of 2 ubatches
  • Avoid passing ggml_cgraph * gf everywhere. Simply move it to llm_graph_context
  • Try to reuse Metal graphs via MTLIndirectCommandBuffer
  • Make CUDA reuse CUDA graphs using this new mechanism

@ggerganov ggerganov mentioned this pull request Jul 1, 2025
5 tasks
@rgerganov rgerganov marked this pull request as ready for review July 2, 2025 06:04
@ggerganov ggerganov force-pushed the gg/kv-cache-use-set-rows branch from 2f577c5 to 30b4d4e Compare July 2, 2025 12:49
Base automatically changed from gg/kv-cache-use-set-rows to master July 3, 2025 07:53
@ggerganov ggerganov force-pushed the gg/llama-reuse-graphs branch from f61b0f7 to d9e1781 Compare July 3, 2025 08:00
@gabe-l-hart gabe-l-hart mentioned this pull request Jul 3, 2025
3 tasks
@ggerganov ggerganov marked this pull request as draft July 4, 2025 05:50
@ggerganov ggerganov force-pushed the gg/llama-reuse-graphs branch 3 times, most recently from 0d9c3d4 to fc4fdf6 Compare July 5, 2025 11:57
@ggerganov ggerganov force-pushed the gg/llama-reuse-graphs branch from fc4fdf6 to 76681e3 Compare July 5, 2025 12:26
@ggerganov
Copy link
Member Author

This should be ready for review. Currently, there is some small gain for Metal where ggml_set_rows() is available. We basically save the time for creating a new ggml_cgraph for each ubatch.

It would be interesting to try to reuse the Metal command buffers to speed this up even further on the backend side. Currently, we use MTLCommandBuffer to encode the compute commands and these objects do not allow the commands to be reused. However, according to the Apple documentation, the MTLIndirectCommandBuffer can be reused multiple times, so it seems to be what we need. It's still not clear to me how to encode compute commands to it (the docs only show examples of encoding rendering commands), but it might be possible. Any hints would be highly appreciated.

@ggerganov ggerganov requested a review from slaren July 5, 2025 13:16
res &= self_kq_mask->ne[0] == mctx->get_n_kv();
res &= self_kq_mask->ne[1] == GGML_PAD(params.ubatch.n_tokens, GGML_KQ_MASK_PAD);

res &= mctx->get_supports_set_rows(); // TODO: tmp
Copy link
Collaborator

@compilade compilade Jul 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If update() is implemented for the recurrent cache, I think it could work even without adapting it to ggml_set_rows, because the head offset tends to be the same for similar consecutive ubatches in find_slot.

That might not work as well once multiple recurrent state cells per sequence are implemented (because they won't get re-used as much), but at that point it should be possible to use ggml_set_rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants