test: add tool calling and reasoning tests for frontend on GPT-OSS #3636

zhongdaor-nv · 2025-10-15T04:25:39Z

Overview:

This PR expands the reasoning effort test suite to include comprehensive tool calling and reasoning tests for GPT-OSS models. The changes refactor the test structure to use module-scoped fixtures for better test efficiency and add three new test cases covering tool calling scenarios and reasoning capabilities.

Details:

Refactored test fixtures to use module-scoped runtime services (NATS, etcd, frontend, worker) to avoid repeated startup/teardown
Renamed GPTOSSWorkerProcess to VllmWorkerProcess for clarity
Renamed REASONING_TEST_MODEL to TEST_MODEL for broader test coverage
Added test_tool_calling to verify basic tool calling with weather and system health tools
Added test_tool_calling_second_round to test multi-turn conversations with tool call results
Added test_reasoning to validate reasoning capabilities with mathematical problems
Simplified _send_chat_request helper to accept generic payload instead of specific parameters
Added reusable tool definitions: WEATHER_TOOL and SYSTEM_HEALTH_TOOL

Where should the reviewer start?

Review tests/frontend/reasoning_effort/test_reasoning_effort.py to examine the new test cases and fixture refactoring. Pay attention to the module-scoped fixtures and how the new tool calling tests validate different conversation flows.

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Tests
- Added end-to-end coverage for reasoning effort comparisons, ensuring higher effort produces more detailed reasoning or token usage.
- Validated tool-calling flows, including follow-up interactions, with tool outputs reflected in final answers.
- Added checks for basic reasoning correctness and response structure to improve stability.
- Removed legacy reasoning-effort E2E tests to reduce redundancy and flakiness.

Signed-off-by: zhongdaor <[email protected]>

…reasoning-and-tool-calling-parser

coderabbitai · 2025-10-15T04:26:36Z

Walkthrough

Replaces removed frontend reasoning-effort E2E test with a new VLLM-focused E2E test suite. Introduces process managers for Dynamo frontend and VLLM worker, shared fixtures for NATS/ETCD, and tests covering reasoning effort comparison, tool calling (single and follow-up rounds), and basic reasoning validation.

Changes

Cohort / File(s)	Summary of Changes
Remove legacy reasoning-effort E2E test `tests/frontend/reasoning_effort/test_reasoning_effort.py`	Deleted end-to-end reasoning-effort test harness, including custom frontend/worker process managers, helpers, and a single comparative reasoning-effort test.
Add VLLM E2E tests and process managers `tests/frontend/test_vllm.py`	Added module with DynamoFrontendProcess and VllmWorkerProcess, fixtures for NATS/ETCD and process lifecycles, helpers for chat requests and parsing, and tests for reasoning effort, tool calling, second-round tool flow, and basic reasoning.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Py as pytest test
  participant FE as DynamoFrontendProcess
  participant WK as VllmWorkerProcess
  participant API as Frontend HTTP API
  Note over FE,WK: Setup phase (module-scoped)
  Py->>FE: start() + health check
  Py->>WK: start() + health check
  Note over Py,API: Reasoning Effort Comparison
  Py->>API: POST /chat (low effort)
  API->>WK: Dispatch request
  WK-->>API: Response (content, usage)
  API-->>Py: Return low-effort result
  Py->>API: POST /chat (high effort)
  API->>WK: Dispatch request
  WK-->>API: Response (content, usage)
  API-->>Py: Return high-effort result
  Py->>Py: Assert tokens/length: high ≥ low

sequenceDiagram
  autonumber
  participant Py as pytest test
  participant FE as DynamoFrontendProcess
  participant WK as VllmWorkerProcess
  participant API as Frontend HTTP API
  participant Tool as Tools (weather, system health)
  Note over Py,API: Tool Calling - Round 1
  Py->>API: POST /chat (tool-enabled)
  API->>WK: Dispatch
  WK-->>API: Message with tool_calls
  API-->>Py: Return tool call(s)
  Note over Py,API: Tool Calling - Round 2 (follow-up)
  Py->>Tool: Execute prior tool_calls
  Tool-->>Py: Tool results (e.g., temperature)
  Py->>API: POST /chat (include tool results + prior calls)
  API->>WK: Dispatch
  WK-->>API: Final content with tool-derived data
  API-->>Py: Return final message

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

I hop through logs and fixtures bright,
Spin up workers, health checks light—
With tools and tokens side by side,
High effort’s longer, stats don’t hide.
VLLM hums, the frontend sings,
A bunny ships well-tested things. 🐇✨

Pre-merge checks

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description follows the repository template for Overview, Details, and reviewer guidance but leaves the Related Issues section empty, which is required by the template.	Populate the Related Issues section with the appropriate issue references using action keywords such as Closes, Fixes, Resolves, or Relates to.
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title clearly summarizes the main change by indicating the addition of tool calling and reasoning tests for the frontend on GPT-OSS and uses concise, descriptive language that aligns with the changeset.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/frontend/test_vllm.py (2)

67-72: Consider using shutil.rmtree with ignore_errors=True for cleaner cleanup.

The manual try/except pattern works but can be simplified:

-        try:
-            shutil.rmtree(log_dir)
-            logger.info(f"Cleaned up existing log directory: {log_dir}")
-        except FileNotFoundError:
-            # Directory doesn't exist, which is fine
-            pass
+        shutil.rmtree(log_dir, ignore_errors=True)

108-111: Consider using shutil.rmtree with ignore_errors=True for cleaner cleanup.

Same simplification opportunity as in the frontend class:

-        try:
-            shutil.rmtree(log_dir)
-        except FileNotFoundError:
-            pass
+        shutil.rmtree(log_dir, ignore_errors=True)

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6a1391e and 371a1fb.

📒 Files selected for processing (2)

tests/frontend/reasoning_effort/test_reasoning_effort.py (0 hunks)
tests/frontend/test_vllm.py (1 hunks)

💤 Files with no reviewable changes (1)

tests/frontend/reasoning_effort/test_reasoning_effort.py

🧰 Additional context used

🧬 Code graph analysis (1)

tests/frontend/test_vllm.py (3)

tests/conftest.py (3)

EtcdServer (192-215)

NatsServer (218-229)

predownload_models (124-136)

tests/utils/managed_process.py (1)

ManagedProcess (71-568)

tests/utils/payloads.py (1)

check_models_api (232-243)

🪛 Ruff (0.14.0)

tests/frontend/test_vllm.py

168-168: Unused function argument: runtime_services

(ARG001)

181-181: Avoid specifying long messages outside the exception class

(TRY003)

191-191: Avoid specifying long messages outside the exception class

(TRY003)

203-203: Avoid specifying long messages outside the exception class

(TRY003)

212-212: Unused function argument: request

(ARG001)

212-212: Unused function argument: runtime_services

(ARG001)

212-212: Unused function argument: predownload_models

(ARG001)

278-278: Unused function argument: request

(ARG001)

278-278: Unused function argument: runtime_services

(ARG001)

278-278: Unused function argument: predownload_models

(ARG001)

321-321: Unused function argument: request

(ARG001)

321-321: Unused function argument: runtime_services

(ARG001)

321-321: Unused function argument: predownload_models

(ARG001)

385-385: Unused function argument: request

(ARG001)

385-385: Unused function argument: runtime_services

(ARG001)

385-385: Unused function argument: predownload_models

(ARG001)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Build and Test - dynamo

🔇 Additional comments (3)

tests/frontend/test_vllm.py (3)

207-271: LGTM!

The test correctly compares reasoning effort with a sensible fallback mechanism when token counts are unavailable.

273-313: LGTM!

The test correctly validates tool calling behavior and verifies the expected tool is invoked.

380-417: LGTM!

The test appropriately validates reasoning capabilities with a mathematical problem and checks for numerical output.

tests/frontend/test_vllm.py

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: zhongdaor-nv <[email protected]>

add tests

5300471

Signed-off-by: zhongdaor <[email protected]>

pull-request-size bot added the size/XL label Oct 15, 2025

github-actions bot added the test label Oct 15, 2025

Merge branch 'main' into zhongdaor/dis-821-add-e2e-tests-for-gpt-oss-…

371a1fb

…reasoning-and-tool-calling-parser

copy-pr-bot bot temporarily deployed to GITLAB October 15, 2025 04:25 Inactive

zhongdaor-nv marked this pull request as ready for review October 15, 2025 04:26

zhongdaor-nv requested review from a team as code owners October 15, 2025 04:26

copy-pr-bot bot temporarily deployed to GITLAB October 15, 2025 04:26 Inactive

coderabbitai bot reviewed Oct 15, 2025

View reviewed changes

tests/frontend/test_vllm.py Outdated Show resolved Hide resolved

Apply suggestion from @coderabbitai[bot]

a6dd1ff

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: zhongdaor-nv <[email protected]>

copy-pr-bot bot temporarily deployed to GITLAB October 15, 2025 04:34 Inactive

copy-pr-bot bot temporarily deployed to GITLAB October 15, 2025 04:35 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

test: add tool calling and reasoning tests for frontend on GPT-OSS #3636

test: add tool calling and reasoning tests for frontend on GPT-OSS #3636

zhongdaor-nv commented Oct 15, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 15, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

test: add tool calling and reasoning tests for frontend on GPT-OSS #3636

Are you sure you want to change the base?

test: add tool calling and reasoning tests for frontend on GPT-OSS #3636

Conversation

zhongdaor-nv commented Oct 15, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview:

Details:

Where should the reviewer start?

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

zhongdaor-nv commented Oct 15, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 15, 2025 •

edited

Loading