server : chunked prefill support #10718

ggerganov · 2024-12-08T07:55:39Z

This is an example of what I think is "chunked prefill". The idea is to not block text-generating slots when a new large prompt comes for processing in parallel.

# start server with 3 parallel slots
./bin/llama-server -m ../models/qwen2.5-32b-coder-instruct/ggml-model-q8_0.gguf -ngl 99 -fa --port 8033 -c 0 -np 3

# generation task with small prompt
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n '{ messages: [{ role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Write quick sort in c++." }], "stream": true }')"

# task with a large prompt
curl --request POST --url http://127.0.0.1:8033/completion --header "Content-Type: application/json" --data '{"prompt": "'"$(printf 'hello %.0s' $(seq 1 8149))"'. I believe the meaning of life is","n_predict": 64, "cache_prompt": true}' | jq

With this PR, the first task is no longer "blocked" by the second long prompt processing task.

Still, I'm not sure how valuable this feature is. Chunking the prompts like this leads to slower overall prompt processing. So even though the server seems more responsive, the total wait time over all requests is longer.

ggml-ci

server : chunked prefill support

a6648b9

ggml-ci

github-actions bot added examples server labels Dec 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server : chunked prefill support #10718

server : chunked prefill support #10718

ggerganov commented Dec 8, 2024

server : chunked prefill support #10718

Are you sure you want to change the base?

server : chunked prefill support #10718

Conversation

ggerganov commented Dec 8, 2024