Skip to content

Conversation

zhongdaor-nv
Copy link
Contributor

@zhongdaor-nv zhongdaor-nv commented Oct 15, 2025

Overview:

This PR expands the reasoning effort test suite to include comprehensive tool calling and reasoning tests for GPT-OSS models. The changes refactor the test structure to use module-scoped fixtures for better test efficiency and add three new test cases covering tool calling scenarios and reasoning capabilities.

Details:

  • Refactored test fixtures to use module-scoped runtime services (NATS, etcd, frontend, worker) to avoid repeated startup/teardown
  • Renamed GPTOSSWorkerProcess to VllmWorkerProcess for clarity
  • Renamed REASONING_TEST_MODEL to TEST_MODEL for broader test coverage
  • Added test_tool_calling to verify basic tool calling with weather and system health tools
  • Added test_tool_calling_second_round to test multi-turn conversations with tool call results
  • Added test_reasoning to validate reasoning capabilities with mathematical problems
  • Simplified _send_chat_request helper to accept generic payload instead of specific parameters
  • Added reusable tool definitions: WEATHER_TOOL and SYSTEM_HEALTH_TOOL

Where should the reviewer start?

Review tests/frontend/reasoning_effort/test_reasoning_effort.py to examine the new test cases and fixture refactoring. Pay attention to the module-scoped fixtures and how the new tool calling tests validate different conversation flows.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

  • Tests
    • Added end-to-end coverage for reasoning effort comparisons, ensuring higher effort produces more detailed reasoning or token usage.
    • Validated tool-calling flows, including follow-up interactions, with tool outputs reflected in final answers.
    • Added checks for basic reasoning correctness and response structure to improve stability.
    • Removed legacy reasoning-effort E2E tests to reduce redundancy and flakiness.

Signed-off-by: zhongdaor <[email protected]>
@zhongdaor-nv zhongdaor-nv marked this pull request as ready for review October 15, 2025 04:26
@zhongdaor-nv zhongdaor-nv requested review from a team as code owners October 15, 2025 04:26
Copy link
Contributor

coderabbitai bot commented Oct 15, 2025

Walkthrough

Replaces removed frontend reasoning-effort E2E test with a new VLLM-focused E2E test suite. Introduces process managers for Dynamo frontend and VLLM worker, shared fixtures for NATS/ETCD, and tests covering reasoning effort comparison, tool calling (single and follow-up rounds), and basic reasoning validation.

Changes

Cohort / File(s) Summary of Changes
Remove legacy reasoning-effort E2E test
tests/frontend/reasoning_effort/test_reasoning_effort.py
Deleted end-to-end reasoning-effort test harness, including custom frontend/worker process managers, helpers, and a single comparative reasoning-effort test.
Add VLLM E2E tests and process managers
tests/frontend/test_vllm.py
Added module with DynamoFrontendProcess and VllmWorkerProcess, fixtures for NATS/ETCD and process lifecycles, helpers for chat requests and parsing, and tests for reasoning effort, tool calling, second-round tool flow, and basic reasoning.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Py as pytest test
  participant FE as DynamoFrontendProcess
  participant WK as VllmWorkerProcess
  participant API as Frontend HTTP API
  Note over FE,WK: Setup phase (module-scoped)
  Py->>FE: start() + health check
  Py->>WK: start() + health check
  Note over Py,API: Reasoning Effort Comparison
  Py->>API: POST /chat (low effort)
  API->>WK: Dispatch request
  WK-->>API: Response (content, usage)
  API-->>Py: Return low-effort result
  Py->>API: POST /chat (high effort)
  API->>WK: Dispatch request
  WK-->>API: Response (content, usage)
  API-->>Py: Return high-effort result
  Py->>Py: Assert tokens/length: high ≥ low
Loading
sequenceDiagram
  autonumber
  participant Py as pytest test
  participant FE as DynamoFrontendProcess
  participant WK as VllmWorkerProcess
  participant API as Frontend HTTP API
  participant Tool as Tools (weather, system health)
  Note over Py,API: Tool Calling - Round 1
  Py->>API: POST /chat (tool-enabled)
  API->>WK: Dispatch
  WK-->>API: Message with tool_calls
  API-->>Py: Return tool call(s)
  Note over Py,API: Tool Calling - Round 2 (follow-up)
  Py->>Tool: Execute prior tool_calls
  Tool-->>Py: Tool results (e.g., temperature)
  Py->>API: POST /chat (include tool results + prior calls)
  API->>WK: Dispatch
  WK-->>API: Final content with tool-derived data
  API-->>Py: Return final message
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I hop through logs and fixtures bright,
Spin up workers, health checks light—
With tools and tokens side by side,
High effort’s longer, stats don’t hide.
VLLM hums, the frontend sings,
A bunny ships well-tested things. 🐇✨

Pre-merge checks

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description follows the repository template for Overview, Details, and reviewer guidance but leaves the Related Issues section empty, which is required by the template. Populate the Related Issues section with the appropriate issue references using action keywords such as Closes, Fixes, Resolves, or Relates to.
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title Check ✅ Passed The title clearly summarizes the main change by indicating the addition of tool calling and reasoning tests for the frontend on GPT-OSS and uses concise, descriptive language that aligns with the changeset.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/frontend/test_vllm.py (2)

67-72: Consider using shutil.rmtree with ignore_errors=True for cleaner cleanup.

The manual try/except pattern works but can be simplified:

-        try:
-            shutil.rmtree(log_dir)
-            logger.info(f"Cleaned up existing log directory: {log_dir}")
-        except FileNotFoundError:
-            # Directory doesn't exist, which is fine
-            pass
+        shutil.rmtree(log_dir, ignore_errors=True)

108-111: Consider using shutil.rmtree with ignore_errors=True for cleaner cleanup.

Same simplification opportunity as in the frontend class:

-        try:
-            shutil.rmtree(log_dir)
-        except FileNotFoundError:
-            pass
+        shutil.rmtree(log_dir, ignore_errors=True)
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a1391e and 371a1fb.

📒 Files selected for processing (2)
  • tests/frontend/reasoning_effort/test_reasoning_effort.py (0 hunks)
  • tests/frontend/test_vllm.py (1 hunks)
💤 Files with no reviewable changes (1)
  • tests/frontend/reasoning_effort/test_reasoning_effort.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/frontend/test_vllm.py (3)
tests/conftest.py (3)
  • EtcdServer (192-215)
  • NatsServer (218-229)
  • predownload_models (124-136)
tests/utils/managed_process.py (1)
  • ManagedProcess (71-568)
tests/utils/payloads.py (1)
  • check_models_api (232-243)
🪛 Ruff (0.14.0)
tests/frontend/test_vllm.py

168-168: Unused function argument: runtime_services

(ARG001)


181-181: Avoid specifying long messages outside the exception class

(TRY003)


191-191: Avoid specifying long messages outside the exception class

(TRY003)


203-203: Avoid specifying long messages outside the exception class

(TRY003)


212-212: Unused function argument: request

(ARG001)


212-212: Unused function argument: runtime_services

(ARG001)


212-212: Unused function argument: predownload_models

(ARG001)


278-278: Unused function argument: request

(ARG001)


278-278: Unused function argument: runtime_services

(ARG001)


278-278: Unused function argument: predownload_models

(ARG001)


321-321: Unused function argument: request

(ARG001)


321-321: Unused function argument: runtime_services

(ARG001)


321-321: Unused function argument: predownload_models

(ARG001)


385-385: Unused function argument: request

(ARG001)


385-385: Unused function argument: runtime_services

(ARG001)


385-385: Unused function argument: predownload_models

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Build and Test - dynamo
🔇 Additional comments (3)
tests/frontend/test_vllm.py (3)

207-271: LGTM!

The test correctly compares reasoning effort with a sensible fallback mechanism when token counts are unavailable.


273-313: LGTM!

The test correctly validates tool calling behavior and verifies the expected tool is invoked.


380-417: LGTM!

The test appropriately validates reasoning capabilities with a mathematical problem and checks for numerical output.

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Signed-off-by: zhongdaor-nv <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant