-
Notifications
You must be signed in to change notification settings - Fork 641
test: add tool calling and reasoning tests for frontend on GPT-OSS #3636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
test: add tool calling and reasoning tests for frontend on GPT-OSS #3636
Conversation
Signed-off-by: zhongdaor <[email protected]>
…reasoning-and-tool-calling-parser
WalkthroughReplaces removed frontend reasoning-effort E2E test with a new VLLM-focused E2E test suite. Introduces process managers for Dynamo frontend and VLLM worker, shared fixtures for NATS/ETCD, and tests covering reasoning effort comparison, tool calling (single and follow-up rounds), and basic reasoning validation. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Py as pytest test
participant FE as DynamoFrontendProcess
participant WK as VllmWorkerProcess
participant API as Frontend HTTP API
Note over FE,WK: Setup phase (module-scoped)
Py->>FE: start() + health check
Py->>WK: start() + health check
Note over Py,API: Reasoning Effort Comparison
Py->>API: POST /chat (low effort)
API->>WK: Dispatch request
WK-->>API: Response (content, usage)
API-->>Py: Return low-effort result
Py->>API: POST /chat (high effort)
API->>WK: Dispatch request
WK-->>API: Response (content, usage)
API-->>Py: Return high-effort result
Py->>Py: Assert tokens/length: high ≥ low
sequenceDiagram
autonumber
participant Py as pytest test
participant FE as DynamoFrontendProcess
participant WK as VllmWorkerProcess
participant API as Frontend HTTP API
participant Tool as Tools (weather, system health)
Note over Py,API: Tool Calling - Round 1
Py->>API: POST /chat (tool-enabled)
API->>WK: Dispatch
WK-->>API: Message with tool_calls
API-->>Py: Return tool call(s)
Note over Py,API: Tool Calling - Round 2 (follow-up)
Py->>Tool: Execute prior tool_calls
Tool-->>Py: Tool results (e.g., temperature)
Py->>API: POST /chat (include tool results + prior calls)
API->>WK: Dispatch
WK-->>API: Final content with tool-derived data
API-->>Py: Return final message
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/frontend/test_vllm.py (2)
67-72
: Consider usingshutil.rmtree
withignore_errors=True
for cleaner cleanup.The manual try/except pattern works but can be simplified:
- try: - shutil.rmtree(log_dir) - logger.info(f"Cleaned up existing log directory: {log_dir}") - except FileNotFoundError: - # Directory doesn't exist, which is fine - pass + shutil.rmtree(log_dir, ignore_errors=True)
108-111
: Consider usingshutil.rmtree
withignore_errors=True
for cleaner cleanup.Same simplification opportunity as in the frontend class:
- try: - shutil.rmtree(log_dir) - except FileNotFoundError: - pass + shutil.rmtree(log_dir, ignore_errors=True)
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
tests/frontend/reasoning_effort/test_reasoning_effort.py
(0 hunks)tests/frontend/test_vllm.py
(1 hunks)
💤 Files with no reviewable changes (1)
- tests/frontend/reasoning_effort/test_reasoning_effort.py
🧰 Additional context used
🧬 Code graph analysis (1)
tests/frontend/test_vllm.py (3)
tests/conftest.py (3)
EtcdServer
(192-215)NatsServer
(218-229)predownload_models
(124-136)tests/utils/managed_process.py (1)
ManagedProcess
(71-568)tests/utils/payloads.py (1)
check_models_api
(232-243)
🪛 Ruff (0.14.0)
tests/frontend/test_vllm.py
168-168: Unused function argument: runtime_services
(ARG001)
181-181: Avoid specifying long messages outside the exception class
(TRY003)
191-191: Avoid specifying long messages outside the exception class
(TRY003)
203-203: Avoid specifying long messages outside the exception class
(TRY003)
212-212: Unused function argument: request
(ARG001)
212-212: Unused function argument: runtime_services
(ARG001)
212-212: Unused function argument: predownload_models
(ARG001)
278-278: Unused function argument: request
(ARG001)
278-278: Unused function argument: runtime_services
(ARG001)
278-278: Unused function argument: predownload_models
(ARG001)
321-321: Unused function argument: request
(ARG001)
321-321: Unused function argument: runtime_services
(ARG001)
321-321: Unused function argument: predownload_models
(ARG001)
385-385: Unused function argument: request
(ARG001)
385-385: Unused function argument: runtime_services
(ARG001)
385-385: Unused function argument: predownload_models
(ARG001)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Build and Test - dynamo
🔇 Additional comments (3)
tests/frontend/test_vllm.py (3)
207-271
: LGTM!The test correctly compares reasoning effort with a sensible fallback mechanism when token counts are unavailable.
273-313
: LGTM!The test correctly validates tool calling behavior and verifies the expected tool is invoked.
380-417
: LGTM!The test appropriately validates reasoning capabilities with a mathematical problem and checks for numerical output.
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> Signed-off-by: zhongdaor-nv <[email protected]>
Overview:
This PR expands the reasoning effort test suite to include comprehensive tool calling and reasoning tests for GPT-OSS models. The changes refactor the test structure to use module-scoped fixtures for better test efficiency and add three new test cases covering tool calling scenarios and reasoning capabilities.
Details:
GPTOSSWorkerProcess
toVllmWorkerProcess
for clarityREASONING_TEST_MODEL
toTEST_MODEL
for broader test coveragetest_tool_calling
to verify basic tool calling with weather and system health toolstest_tool_calling_second_round
to test multi-turn conversations with tool call resultstest_reasoning
to validate reasoning capabilities with mathematical problems_send_chat_request
helper to accept generic payload instead of specific parametersWEATHER_TOOL
andSYSTEM_HEALTH_TOOL
Where should the reviewer start?
Review
tests/frontend/reasoning_effort/test_reasoning_effort.py
to examine the new test cases and fixture refactoring. Pay attention to the module-scoped fixtures and how the new tool calling tests validate different conversation flows.Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)
Summary by CodeRabbit