Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : chunked prefill support #10718

Draft
wants to merge 1 commit into
base: master
Choose a base branch
from
Draft

Conversation

ggerganov
Copy link
Owner

This is an example of what I think is "chunked prefill". The idea is to not block text-generating slots when a new large prompt comes for processing in parallel.

# start server with 3 parallel slots
./bin/llama-server -m ../models/qwen2.5-32b-coder-instruct/ggml-model-q8_0.gguf -ngl 99 -fa --port 8033 -c 0 -np 3

# generation task with small prompt
curl --request POST --url http://localhost:8033/v1/chat/completions -H "Content-Type: application/json" -H "Authorization: Bearer no-key" -d "$(jq -n '{ messages: [{ role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Write quick sort in c++." }], "stream": true }')"

# task with a large prompt
curl --request POST --url http://127.0.0.1:8033/completion --header "Content-Type: application/json" --data '{"prompt": "'"$(printf 'hello %.0s' $(seq 1 8149))"'. I believe the meaning of life is","n_predict": 64, "cache_prompt": true}' | jq

With this PR, the first task is no longer "blocked" by the second long prompt processing task.

Still, I'm not sure how valuable this feature is. Chunking the prompts like this leads to slower overall prompt processing. So even though the server seems more responsive, the total wait time over all requests is longer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant