[Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously #15326

hidva · 2025-03-22T09:08:18Z

Currently, vllm allows users to send duplicate request IDs. At the same time, numerous modules in EngineCore use request IDs as dictionary keys, such as KVCacheManager.req_to_blocks. This is based on the assumption that EngineCore always expects the Frontend to first abort a request before adding a new one with the same request ID:

# req1, req2 have the same request_id.
(EngineCoreRequestType.ADD, req1(request_id=RequestId))
(EngineCoreRequestType.ABORT, req1)
(EngineCoreRequestType.ADD, req2(request_id=RequestId))

Currently, AsyncLLM ensures that duplicate request IDs must first be aborted before they can be added through the sequence AsyncLLM._add_request -> OutputProcessor.add_request:

# OutputProcessor.add_request
request_id = request.request_id
if request_id in self.request_states:
    raise ValueError(f"Request id {request_id} already running.")

# AsyncLLM.abort
async def abort(self, request_id: str) -> None:
    """Abort RequestId in OutputProcessor and EngineCore."""

    request_ids = self.output_processor.abort_requests((request_id,))
    # BUG!
    # This operation is not atomic, and there might be a time window during which
    # the request has already been removed from OutputProcessor.request_states,
    # but the corresponding ABORT has not yet been issued to EngineCore.
    await self.engine_core.abort_requests_async(request_ids)

    if self.log_requests:
        logger.info("Aborted request %s.", request_id)

We can easily simulate the potential bug by enlarging the possible time window with an await asyncio.sleep(13) inserted at the BUG point:

To fix this issue, we categorized completed requests into two types:

abort req, handle_abort_reqs
finished req, _handle_finished_reqs

And ensured that the scope of request visibility in the Frontend always includes the scope of request visibility in EngineCore.

…IDs might be sent to EngineCore simultaneously Signed-off-by: 盏一 <[email protected]>

github-actions · 2025-03-22T09:08:29Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

hidva · 2025-03-22T09:17:38Z

vllm/v1/engine/output_processor.py

@@ -373,7 +393,6 @@ def _update_stats_from_finished(self, req_state: RequestState,
            num_prompt_tokens=len(req_state.prompt_token_ids),
            max_tokens_param=req_state.max_tokens_param,
            req_stats=req_state.stats)
-        self.lora_states.finish_request(req_state)


See comment

Thanks, I agree this can cause leaks if metrics is disabled

robertgshaw2-redhat · 2025-03-22T12:25:34Z

Thanks for your contribution! I agree that this is a race condition. Appreciate you digging in

robertgshaw2-redhat · 2025-03-22T12:54:09Z

vllm/v1/engine/output_processor.py

+        self.handle_abort_reqs(request_ids_to_abort)
+        return request_ids_to_abort
+
+    def flatten_req_to_abort(self, req_ids: Iterable[str]) -> list[str]:


Can we call this something more descriptive? get_parent_and_children_reqs?

robertgshaw2-redhat · 2025-03-22T12:54:43Z

vllm/v1/engine/output_processor.py

+                ret.extend(parent.child_requests)
+        return ret
+
+    # "Aborted request", meaning the frontend first detects that


This should be a docstring rather than a comment.

robertgshaw2-redhat · 2025-03-22T13:02:09Z

vllm/v1/engine/output_processor.py

+    # "Finished request", meaning EngineCore first detects that
+    # the request has ended, and the resources related to the request
+    # maintained by EngineCore have been released.
+    def _handle_finished_reqs(self, req_id):


lets call this def finish_request(self, request_id: str) -> None

robertgshaw2-redhat · 2025-03-22T13:04:02Z

vllm/v1/engine/output_processor.py

              put the RequestOutput objects into the queue for
              handling by the per-request generate() tasks.

-            * If there is no queue (for usage with LLMEngine), 


Can you add a comment to the docstring about why we finish the stop string requests externally to this function?

robertgshaw2-redhat · 2025-03-22T13:04:47Z

vllm/v1/engine/async_llm.py

        await self.engine_core.abort_requests_async(request_ids)
+        # At this point, the abort message has already been sent to EngineCore,


Can you update this comment to explain why this ordering is important for the race condition?

robertgshaw2-redhat · 2025-03-22T13:05:34Z

Thanks a ton! I reviewed the implementation in detail and you have fixed the problem! Just left some minor comments about naming the functions and comments. Ping me on slack when this is ready!

[Bugfix][Frontend] Fixed issue where requests with duplicate request …

03867b2

…IDs might be sent to EngineCore simultaneously Signed-off-by: 盏一 <[email protected]>

hidva requested review from WoosukKwon, robertgshaw2-redhat, njhill, ywang96, comaniac and alexm-redhat as code owners March 22, 2025 09:08

mergify bot added the v1 label Mar 22, 2025

hidva commented Mar 22, 2025

View reviewed changes

robertgshaw2-redhat reviewed Mar 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously #15326

[Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously #15326

hidva commented Mar 22, 2025 •

edited by github-actions bot

Loading

github-actions bot commented Mar 22, 2025

hidva Mar 22, 2025

robertgshaw2-redhat Mar 22, 2025 •

edited

Loading

robertgshaw2-redhat commented Mar 22, 2025

robertgshaw2-redhat Mar 22, 2025

robertgshaw2-redhat Mar 22, 2025

robertgshaw2-redhat Mar 22, 2025

robertgshaw2-redhat Mar 22, 2025

robertgshaw2-redhat Mar 22, 2025

robertgshaw2-redhat commented Mar 22, 2025

		await self.engine_core.abort_requests_async(request_ids)
		# At this point, the abort message has already been sent to EngineCore,

[Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously #15326

Are you sure you want to change the base?

[Bugfix][Frontend] Fixed issue where requests with duplicate request IDs might be sent to EngineCore simultaneously #15326

Conversation

hidva commented Mar 22, 2025 • edited by github-actions bot Loading

github-actions bot commented Mar 22, 2025

hidva Mar 22, 2025

Choose a reason for hiding this comment

robertgshaw2-redhat Mar 22, 2025 • edited Loading

Choose a reason for hiding this comment

robertgshaw2-redhat commented Mar 22, 2025

robertgshaw2-redhat Mar 22, 2025

Choose a reason for hiding this comment

robertgshaw2-redhat Mar 22, 2025

Choose a reason for hiding this comment

robertgshaw2-redhat Mar 22, 2025

Choose a reason for hiding this comment

robertgshaw2-redhat Mar 22, 2025

Choose a reason for hiding this comment

robertgshaw2-redhat Mar 22, 2025

Choose a reason for hiding this comment

robertgshaw2-redhat commented Mar 22, 2025

hidva commented Mar 22, 2025 •

edited by github-actions bot

Loading

robertgshaw2-redhat Mar 22, 2025 •

edited

Loading