-
-
Notifications
You must be signed in to change notification settings - Fork 6.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously #15326
base: main
Are you sure you want to change the base?
Conversation
…IDs might be sent to EngineCore simultaneously Signed-off-by: 盏一 <[email protected]>
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
@@ -373,7 +393,6 @@ def _update_stats_from_finished(self, req_state: RequestState, | |||
num_prompt_tokens=len(req_state.prompt_token_ids), | |||
max_tokens_param=req_state.max_tokens_param, | |||
req_stats=req_state.stats) | |||
self.lora_states.finish_request(req_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I agree this can cause leaks if metrics is disabled
Thanks for your contribution! I agree that this is a race condition. Appreciate you digging in |
self.handle_abort_reqs(request_ids_to_abort) | ||
return request_ids_to_abort | ||
|
||
def flatten_req_to_abort(self, req_ids: Iterable[str]) -> list[str]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we call this something more descriptive? get_parent_and_children_reqs
?
ret.extend(parent.child_requests) | ||
return ret | ||
|
||
# "Aborted request", meaning the frontend first detects that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be a docstring rather than a comment.
# "Finished request", meaning EngineCore first detects that | ||
# the request has ended, and the resources related to the request | ||
# maintained by EngineCore have been released. | ||
def _handle_finished_reqs(self, req_id): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets call this def finish_request(self, request_id: str) -> None
put the RequestOutput objects into the queue for | ||
handling by the per-request generate() tasks. | ||
|
||
* If there is no queue (for usage with LLMEngine), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment to the docstring about why we finish the stop string requests externally to this function?
await self.engine_core.abort_requests_async(request_ids) | ||
# At this point, the abort message has already been sent to EngineCore, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update this comment to explain why this ordering is important for the race condition?
Thanks a ton! I reviewed the implementation in detail and you have fixed the problem! Just left some minor comments about naming the functions and comments. Ping me on slack when this is ready! |
Currently, vllm allows users to send duplicate request IDs. At the same time, numerous modules in EngineCore use request IDs as dictionary keys, such as
KVCacheManager.req_to_blocks
. This is based on the assumption that EngineCore always expects the Frontend to first abort a request before adding a new one with the same request ID:Currently,
AsyncLLM
ensures that duplicate request IDs must first be aborted before they can be added through the sequenceAsyncLLM._add_request
->OutputProcessor.add_request
:We can easily simulate the potential bug by enlarging the possible time window with an

await asyncio.sleep(13)
inserted at the BUG point:To fix this issue, we categorized completed requests into two types:
handle_abort_reqs
_handle_finished_reqs
And ensured that the scope of request visibility in the Frontend always includes the scope of request visibility in EngineCore.