Skip to content

Tool calling functionality #477

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: scosman/tools_scaffolding
Choose a base branch
from

Conversation

scosman
Copy link
Collaborator

@scosman scosman commented Aug 3, 2025

Summary by CodeRabbit

  • New Features

    • Enhanced support for iterative tool usage within a single conversation turn, allowing multiple tool calls per turn.
    • Improved handling and formatting of tool calls and responses in chat history and model output.
    • Added support for math tools in model interactions.
    • Introduced a new method to list available tools.
    • Added detailed trace information to run outputs for better debugging.
  • Bug Fixes

    • Added validation to prevent conflicts between tool usage and structured output formats.
  • Tests

    • Introduced comprehensive tests for tool integration, chat history, and tool call handling.
    • Added end-to-end tests verifying model and tool interaction for math tasks.

scosman added 3 commits July 31, 2025 23:55
…aving the checkpoint.

Prompt:
We are mid-refactor on adding tool calling to this class which makes a series of LLM Chat calls.

The chat formatter manages generating the user messages in order. We want to add tooling calling.

It's partly done. We can handle a single turn of tool calls (inside the main turn) and it works. But we need to extend it to do several more things:

 - Allow it to make many tool calls in a row (don't force it to tool-choice none after one), and keep iterating on tool calls until all tool calls are complete, before going to the next turn from the chat formatter.
 - Currenlty the chat fomatter takes care of generating the whole chat history for the next turn. This will no longer work, as some of the chat history will be tool calls/responses it's not aware of. Refactor to make it only generate the next message, not the history. The litellm_adapter class should manage the message history, and at the end of the turn it should have the complete chat saved in a property (we'll use this later, but test it now).
 - Note that there's a special tool call `task-response`, which is used for ending a top level turn (chat-formatter managed turn). We take the JSON from that and save it as message (working). This needs to be treated differently from the other tool calls, as it always ends the turn.

Guidance
 - make all tool calling handled in a new function the main loop calls. We don't want the main chat loop (while True) getting more complex.
 - add appriopiate tests, and keep the code testable
Copy link
Contributor

coderabbitai bot commented Aug 3, 2025

Walkthrough

This update introduces iterative tool call handling in the LiteLlmAdapter, allowing multiple consecutive tool invocations within a single turn. It adds new and refactored methods to manage chat history, process tool calls, and integrate tools into model requests. Extensive new tests validate these behaviors, including paid-integration tests for math tools. Additionally, a stub method for available tools is added to the base adapter, and the RunOutput dataclass is extended with a trace field.

Changes

Cohort / File(s) Change Summary
Base Adapter Tool Interface
libs/core/kiln_ai/adapters/model_adapters/base_adapter.py
Added import for KilnTool and introduced a stub available_tools method in BaseAdapter.
LiteLlmAdapter Iterative Tool Call Handling
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
Refactored to support iterative tool calls per turn, maintaining chat history, adding tool call loop logic, tool integration in completion kwargs, and helper methods for tool management and execution.
Unit Tests for LiteLlmAdapter Tool Integration
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py
Added fixtures and comprehensive tests for tool formatting, completion kwargs, chat history, tool call loops, and _run method behavior with and without tools.
Paid Integration Tests for Math Tools
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_paid.py
Introduced new test module with helpers and async tests to verify math tool integration with the adapter and OpenAI models.
RunOutput Extension
libs/core/kiln_ai/adapters/run_output.py
Added optional trace field to RunOutput dataclass to capture detailed execution trace information.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant LiteLlmAdapter
    participant Model
    participant Tool

    User->>LiteLlmAdapter: Send input message
    LiteLlmAdapter->>LiteLlmAdapter: Append to chat history
    loop Tool Call Loop (max N times)
        LiteLlmAdapter->>Model: Send chat history + available tools
        Model-->>LiteLlmAdapter: Return tool call(s) or final response
        alt Tool call(s) present
            LiteLlmAdapter->>Tool: Execute tool(s) with arguments
            Tool-->>LiteLlmAdapter: Return tool result(s)
            LiteLlmAdapter->>LiteLlmAdapter: Append tool results to chat history
        else Final response
            LiteLlmAdapter->>LiteLlmAdapter: Extract output and break loop
        end
    end
    LiteLlmAdapter-->>User: Return output and updated chat history
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~40 minutes

Poem

In the warren where logic and numbers entwine,
The adapter now loops, making tool calls in line.
With chat history growing and math tools in play,
Our tests leap and bound in a bright, clever way.
Oh, what fun for a rabbit to see—
Tools and models, in harmony! 🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 93b0bd4 and 9ccbc6e.

📒 Files selected for processing (1)
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_paid.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_paid.py
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
  • GitHub Check: Build Desktop Apps (windows-latest)
  • GitHub Check: Build Desktop Apps (macos-latest)
  • GitHub Check: Build Desktop Apps (macos-13)
  • GitHub Check: Build Desktop Apps (ubuntu-22.04)

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Explain this complex logic.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai explain this code block.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and explain its main purpose.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Support

Need help? Create a ticket on our support page for assistance with any issues or questions.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate sequence diagram to generate a sequence diagram of the changes in this PR.
  • @coderabbitai generate unit tests to generate unit tests for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🔭 Outside diff range comments (1)
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py (1)

47-59: Use a different provider in test config to avoid OpenRouter initialization

The test config uses model_provider_name="openrouter" which may trigger OpenRouter initialization and API key checks even when mocked.

Consider using a provider that doesn't require API keys for testing:

 @pytest.fixture
 def config():
     return LiteLlmConfig(
         base_url="https://api.test.com",
         run_config_properties=RunConfigProperties(
             model_name="test-model",
-            model_provider_name="openrouter",
+            model_provider_name="openai",
             prompt_id="simple_prompt_builder",
             structured_output_mode="json_schema",
         ),
         default_headers={"X-Test": "test"},
         additional_body_options={"api_key": "test_key"},
     )

This should prevent the OpenRouter initialization errors in the tests.

🧹 Nitpick comments (2)
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_paid.py (1)

80-86: Remove duplicate assertion

There's a duplicate assertion checking for "64" in the output.

Remove the duplicate assertion:

         assert "64" in run.output.output
         assert run.id is not None
         assert (
             run.input
             == "You should answer the following question: four plus six times 10"
         )
-        assert "64" in run.output.output
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (1)

32-33: Document the rationale for MAX_CALLS_PER_TURN limit

The constant MAX_CALLS_PER_TURN = 10 prevents infinite loops, but the specific value seems arbitrary.

Add a comment explaining why 10 was chosen:

-MAX_CALLS_PER_TURN = 10
+# Maximum number of tool calls allowed per turn to prevent infinite loops
+# and excessive token usage. Based on typical tool interaction patterns.
+MAX_CALLS_PER_TURN = 10
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 599ba54 and e8a0152.

📒 Files selected for processing (4)
  • libs/core/kiln_ai/adapters/model_adapters/base_adapter.py (2 hunks)
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (7 hunks)
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py (2 hunks)
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_paid.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/project.mdc)

**/*.py: Always assume pydantic 2 (not pydantic 1)
The project supports Python 3.10 and above

Files:

  • libs/core/kiln_ai/adapters/model_adapters/base_adapter.py
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_paid.py
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py
**/test_*.py

📄 CodeRabbit Inference Engine (.cursor/rules/project.mdc)

**/test_*.py: Always use pytest for tests in Python code
Test brevity is important. Use approaches for re-use and brevity including using fixtures for repeated code, and using pytest parameterize for similar tests

Files:

  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_paid.py
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py
🧬 Code Graph Analysis (2)
libs/core/kiln_ai/adapters/model_adapters/base_adapter.py (1)
libs/core/kiln_ai/tools/base_tool.py (1)
  • KilnTool (40-82)
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (6)
libs/core/kiln_ai/datamodel/json_schema.py (1)
  • validate_schema_with_value_error (50-70)
libs/core/kiln_ai/adapters/model_adapters/base_adapter.py (3)
  • model_provider (79-94)
  • build_chat_formatter (205-249)
  • available_tools (323-325)
libs/core/kiln_ai/tools/base_tool.py (1)
  • KilnTool (40-82)
libs/core/kiln_ai/utils/exhaustive_error.py (1)
  • raise_exhaustive_enum_error (5-6)
libs/core/kiln_ai/adapters/chat/chat_formatter.py (7)
  • messages (45-46)
  • next_turn (56-58)
  • next_turn (62-79)
  • next_turn (95-125)
  • next_turn (141-174)
  • next_turn (178-195)
  • intermediate_outputs (51-53)
libs/core/kiln_ai/adapters/run_output.py (1)
  • RunOutput (8-11)
🪛 GitHub Actions: Debug Detector
libs/core/kiln_ai/adapters/model_adapters/base_adapter.py

[error] 324-324: Developer content found: TODO comment present in code.

libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py

[error] 531-531: Developer content found: TODO comment present in code.


[error] 643-643: Developer content found: TODO comment present in code.

🪛 GitHub Actions: Build and Test
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py

[error] 728-728: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys


[error] 767-767: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys


[error] 821-821: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys


[error] 908-908: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys


[error] 988-988: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys


[error] 1040-1040: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys

🪛 GitHub Actions: Coverage Report
libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py

[error] 728-728: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys


[error] 767-767: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys


[error] 821-821: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys


[error] 908-908: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys


[error] 988-988: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys


[error] 1040-1040: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
  • GitHub Check: Build Desktop Apps (macos-latest)
  • GitHub Check: Build Desktop Apps (windows-latest)
  • GitHub Check: Build Desktop Apps (ubuntu-22.04)
  • GitHub Check: Build Desktop Apps (macos-13)

Comment on lines 129 to 138
# If no tool calls, return the content as final output
if content:
return content, messages

# If we get here with no content and no tool calls, break
break

raise RuntimeError(
f"Too many tool calls ({tool_calls_count}). Stopping iteration to avoid using too many tokens."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle edge case when model returns neither content nor tool calls

The loop breaks silently when there's no content and no tool calls, then raises a generic error about too many tool calls which is misleading.

Add proper error handling for this edge case:

             # If no tool calls, return the content as final output
             if content:
                 return content, messages
 
             # If we get here with no content and no tool calls, break
-            break
+            raise RuntimeError(
+                "Model returned neither content nor tool calls. This may indicate an issue with the model response."
+            )
 
-        raise RuntimeError(
-            f"Too many tool calls ({tool_calls_count}). Stopping iteration to avoid using too many tokens."
-        )
+        # Only reached if we exit the while loop due to tool_calls_count
+        raise RuntimeError(
+            f"Too many tool calls ({tool_calls_count}). Stopping iteration to avoid using too many tokens."
+        )
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# If no tool calls, return the content as final output
if content:
return content, messages
# If we get here with no content and no tool calls, break
break
raise RuntimeError(
f"Too many tool calls ({tool_calls_count}). Stopping iteration to avoid using too many tokens."
)
# If no tool calls, return the content as final output
if content:
return content, messages
# If we get here with no content and no tool calls, break
- break
+ raise RuntimeError(
+ "Model returned neither content nor tool calls. This may indicate an issue with the model response."
+ )
- raise RuntimeError(
- f"Too many tool calls ({tool_calls_count}). Stopping iteration to avoid using too many tokens."
- )
+ # Only reached if we exit the while loop due to tool_calls_count
+ raise RuntimeError(
+ f"Too many tool calls ({tool_calls_count}). Stopping iteration to avoid using too many tokens."
+ )
🤖 Prompt for AI Agents
In libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py around lines 129
to 138, the code breaks the loop silently when there is neither content nor tool
calls, then raises a misleading RuntimeError about too many tool calls. Add
explicit error handling for this edge case by raising a clear and specific
exception or returning an appropriate error message indicating that the model
returned neither content nor tool calls, so the issue is accurately reported.

Comment on lines +531 to +536
# TODO: maybe reconsider this. Model should be able to choose between a final answer or a tool call on any turn. But good models have json_schea, so do we need to support both? If we do, merge them, and consider auto vs forced when merging (only forced for final, auto for merged).
# Check for a conflict between tools and response format using tools
if has_tools and "tools" in response_format_options:
raise ValueError(
"Function calling/tools can't be used as the JSON response format if you're also using tools. Please select a different structured output mode."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove TODO comment and clarify the design decision

The pipeline flagged this TODO comment. The comment raises a valid design question about supporting both structured output and tools simultaneously.

Either implement the functionality or document why the current approach was chosen:

-            # TODO: maybe reconsider this. Model should be able to choose between a final answer or a tool call on any turn. But good models have json_schea, so do we need to support both? If we do, merge them, and consider auto vs forced when merging (only forced for final, auto for merged).
             # Check for a conflict between tools and response format using tools
             if has_tools and "tools" in response_format_options:
                 raise ValueError(
                     "Function calling/tools can't be used as the JSON response format if you're also using tools. Please select a different structured output mode."
                 )

Consider tracking this design question in an issue if it needs further discussion.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# TODO: maybe reconsider this. Model should be able to choose between a final answer or a tool call on any turn. But good models have json_schea, so do we need to support both? If we do, merge them, and consider auto vs forced when merging (only forced for final, auto for merged).
# Check for a conflict between tools and response format using tools
if has_tools and "tools" in response_format_options:
raise ValueError(
"Function calling/tools can't be used as the JSON response format if you're also using tools. Please select a different structured output mode."
)
# Check for a conflict between tools and response format using tools
if has_tools and "tools" in response_format_options:
raise ValueError(
"Function calling/tools can't be used as the JSON response format if you're also using tools. Please select a different structured output mode."
)
🧰 Tools
🪛 GitHub Actions: Debug Detector

[error] 531-531: Developer content found: TODO comment present in code.

🤖 Prompt for AI Agents
In libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py around lines 531
to 536, remove the TODO comment about reconsidering the model's choice between
final answer or tool call and the question about supporting both JSON schema and
tools. Instead, clarify the design decision by either implementing support for
both or explicitly documenting why the current restriction exists. If the design
question requires further discussion, create and reference an issue for tracking
it.

Comment on lines +643 to +645
# TODO: check if this really needs to be a string. If it does, run should be forced to return a string and we should avoid the str()
"content": str(result),
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

⚠️ Potential issue

Clarify tool return type requirements

The TODO comment indicates uncertainty about whether tool results must be strings.

Define the return type contract clearly:

                     "tool_call_id": tool_call.id,
                     "name": tool_name,
-                    # TODO: check if this really needs to be a string. If it does, run should be forced to return a string and we should avoid the str()
-                    "content": str(result),
+                    "content": str(result),  # Tool results are always stringified for consistency
                 }

Consider updating the KilnTool.run() method signature to specify the return type should be str or Any that can be stringified, and document this requirement in the base class.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# TODO: check if this really needs to be a string. If it does, run should be forced to return a string and we should avoid the str()
"content": str(result),
}
"tool_call_id": tool_call.id,
"name": tool_name,
"content": str(result), # Tool results are always stringified for consistency
}
🧰 Tools
🪛 GitHub Actions: Debug Detector

[error] 643-643: Developer content found: TODO comment present in code.

🤖 Prompt for AI Agents
In libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py around lines 643
to 645, clarify the expected return type of the KilnTool.run() method by
updating its signature to specify it should return a string or a type that can
be converted to a string. Add documentation in the base class to explicitly
state this requirement, and remove the need for calling str() on the result in
this code by enforcing the contract at the method level.

Comment on lines +709 to +740
async def test_handle_tool_calls_loop_no_tool_calls(config, mock_task):
"""Test _handle_tool_calls_loop with no tool calls in response"""
adapter = LiteLlmAdapter(config=config, kiln_task=mock_task)
initial_messages = [{"role": "user", "content": "Hello"}]

# Mock response with no tool calls
mock_response = Mock()
mock_choice = Mock()
mock_message = Mock()
mock_message.content = "Hello response"
mock_message.tool_calls = None
mock_choice.message = mock_message
mock_response.choices = [mock_choice]

with patch.object(
adapter,
"acompletion_checking_response",
return_value=(mock_response, mock_choice),
):
content, messages = await adapter._handle_tool_calls_loop(
Mock(), initial_messages, None
)

assert content == "Hello response"
assert len(messages) == 2
assert messages[0] == {"role": "user", "content": "Hello"}
assert messages[1] == {
"role": "assistant",
"content": "Hello response",
"tool_calls": None,
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Fix OpenRouter API key errors in tests

Multiple tests are failing with "ValueError: Attempted to use OpenRouter without an API key set". Even though acompletion_checking_response is mocked, the tests are still trying to initialize an OpenRouter connection.

The issue might be in the test setup or the way the adapter is initialized. Consider:

  1. Mock the litellm module earlier in the test setup
  2. Ensure the config doesn't trigger OpenRouter initialization
  3. Mock any litellm initialization that happens during adapter creation

🏁 Script executed:

#!/bin/bash
# Check how OpenRouter is being initialized in the codebase
rg -A 5 "OpenRouter|openrouter" --type py

Length of output: 82194


Mock Litellm Client in tests to avoid OpenRouter API-key errors

Apply one of the following fixes in libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py:

• Add an autouse fixture to stub out Litellm’s client before any adapter instantiates:

+ import pytest
+ import litellm
+
+ @pytest.fixture(autouse=True)
+ def patch_litellm_client(monkeypatch):
+     """Prevent real OpenRouter/OpenAI clients from initializing in unit tests."""
+     class DummyClient:
+         def __init__(*args, **kwargs): pass
+     monkeypatch.setattr(
+         "kiln_ai.adapters.model_adapters.litellm_adapter.litellm.Client",
+         DummyClient,
+     )

• Or inject a dummy API key into your existing config fixture so that OpenRouter never complains:

 @pytest.fixture
 def config():
     run_props = RunConfigProperties(
         model_name="test-model",
         model_provider_name=ModelProviderName.openrouter,
         prompt_id="simple_prompt_builder",
         structured_output_mode="json_schema",
     )
-    return LiteLlmConfig(run_config_properties=run_props)
+    return LiteLlmConfig(
+        run_config_properties=run_props,
+        default_headers={},
+        additional_body_options={"api_key": "test-openrouter-key"},
+    )

Either approach decouples your adapter’s initialization from real API keys and stops the
ValueError: Attempted to use OpenRouter without an API key set. failures in async tests.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools
🪛 GitHub Actions: Build and Test

[error] 728-728: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys

🪛 GitHub Actions: Coverage Report

[error] 728-728: ValueError: Attempted to use OpenRouter without an API key set. Get your API key from https://openrouter.ai/settings/keys

🤖 Prompt for AI Agents
In libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter.py around
lines 709 to 740, the test fails due to missing OpenRouter API key when
LiteLlmAdapter initializes. Fix this by either adding an autouse pytest fixture
that mocks or stubs the Litellm client before any adapter instantiation, or
modify the existing config fixture to inject a dummy API key. This prevents the
adapter from attempting real API calls and avoids the ValueError about missing
API keys during async tests.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (3)
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (3)

132-141: Handle edge case when model returns neither content nor tool calls

The loop breaks silently when there's no content and no tool calls, then raises a generic error about too many tool calls which is misleading.


534-539: Remove TODO comment and clarify the design decision

The pipeline flagged this TODO comment. The comment raises a valid design question about supporting both structured output and tools simultaneously.


646-648: Clarify tool return type requirements

The TODO comment indicates uncertainty about whether tool results must be strings.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e8a0152 and c4102c3.

📒 Files selected for processing (2)
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (7 hunks)
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_paid.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • libs/core/kiln_ai/adapters/model_adapters/test_litellm_adapter_paid.py
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/project.mdc)

**/*.py: Always assume pydantic 2 (not pydantic 1)
The project supports Python 3.10 and above

Files:

  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
🧠 Learnings (1)
📚 Learning: the `glm_z1_rumination_32b_0414` model was intentionally removed from the built_in_models list due t...
Learnt from: leonardmq
PR: Kiln-AI/Kiln#418
File: libs/core/kiln_ai/adapters/ml_model_list.py:0-0
Timestamp: 2025-07-16T09:37:39.816Z
Learning: The `glm_z1_rumination_32b_0414` model was intentionally removed from the built_in_models list due to output formatting issues: output was duplicated in both `output` and `reasoning` fields, and contained random internal JSON in the output. This model should not be re-added without addressing these formatting problems.

Applied to files:

  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
🪛 GitHub Actions: Debug Detector
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py

[error] 109-109: Developer TODO comment found: 'TODO: ensure structure'


[error] 534-534: Developer TODO comment found: 'TODO: maybe reconsider this. Model should be able to choose between a final answer or a tool call on any turn. But good models have json_schea, so do we need to support both? If we do, merge them, and consider auto vs forced when merging (only forced for final, auto for merged).'


[error] 646-646: Developer TODO comment found: 'TODO: check if this really needs to be a string. If it does, run should be forced to return a string and we should avoid the str()'

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Build Desktop Apps (windows-latest)
  • GitHub Check: Build Desktop Apps (ubuntu-22.04)
  • GitHub Check: Build Desktop Apps (macos-13)
  • GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
  • GitHub Check: Build Desktop Apps (macos-latest)
🔇 Additional comments (5)
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (5)

1-34: Constants and imports look good

The new imports and constants are appropriate for implementing tool calling functionality. The limits for MAX_CALLS_PER_TURN and MAX_TOOL_CALLS_PER_TURN provide good safeguards against runaway iterations.


50-67: Well-designed state management for tools and chat history

Good use of caching for available tools and proper encapsulation of chat history with a read-only property.


143-273: Well-structured implementation of tool calling in the run method

The integration of tool calling with proper chat history management is well done. The handling of different scenarios (tool calls vs regular completion) is clear, and the limitation regarding logprobs from tool calls is properly documented.


275-288: Good helper method for response validation

Excellent addition of a helper method that ensures type safety and provides clear error messages.


592-656: Excellent implementation of tool processing with proper validation

The tool processing logic is well-structured with:

  • Proper argument parsing and JSON schema validation
  • Clear error messages for debugging
  • Good separation between task_response and normal tools
  • Appropriate error handling for edge cases

The validation against tool schemas before execution is particularly good for catching errors early.

Comment on lines +109 to +111
# TODO: ensure structure
"tool_calls": tool_calls,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove TODO comment and ensure message structure

The TODO comment should be addressed. Either implement validation for the assistant message structure or document why it's not needed.

Apply this diff to ensure proper structure:

            # Add the assistant's response to messages
            assistant_message = {
                "role": "assistant",
                "content": content,
-                # TODO: ensure structure
-                "tool_calls": tool_calls,
+                "tool_calls": tool_calls if tool_calls else [],
            }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# TODO: ensure structure
"tool_calls": tool_calls,
}
# Add the assistant's response to messages
assistant_message = {
"role": "assistant",
"content": content,
"tool_calls": tool_calls if tool_calls else [],
}
🧰 Tools
🪛 GitHub Actions: Debug Detector

[error] 109-109: Developer TODO comment found: 'TODO: ensure structure'

🤖 Prompt for AI Agents
In libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py around lines 109
to 111, remove the TODO comment about ensuring the structure of the assistant
message. Instead, implement validation logic to confirm the message structure
meets expected requirements before assigning tool_calls, or add a clear comment
explaining why such validation is unnecessary. This will resolve the TODO and
clarify the code's intent.

scosman added 2 commits August 3, 2025 20:58
… More TODOs for unhandled cases.

Now running on all paid test cases.

A bunch of the new tests don't work yet.
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (4)
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (4)

103-109: Remove TODO and ensure proper message structure

The TODO comment should be addressed. The assistant message structure should be validated or documented.

Apply this diff to ensure proper structure:

            # Add the assistant's response to messages
            assistant_message = {
                "role": "assistant",
                "content": content,
-                # TODO: ensure structure
-                "tool_calls": tool_calls,
+                "tool_calls": tool_calls if tool_calls else [],
            }

129-138: Handle edge case when model returns neither content nor tool calls

The loop breaks silently when there's no content and no tool calls, potentially leading to a misleading error message.


504-509: Remove TODO and document design decision

The TODO comment raises a valid design question about supporting both structured output and tools.


616-618: Clarify tool return type requirements

The TODO indicates uncertainty about tool result type requirements.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c4102c3 and 93b0bd4.

📒 Files selected for processing (2)
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (6 hunks)
  • libs/core/kiln_ai/adapters/run_output.py (2 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
**/*.py

📄 CodeRabbit Inference Engine (.cursor/rules/project.mdc)

**/*.py: Always assume pydantic 2 (not pydantic 1)
The project supports Python 3.10 and above

Files:

  • libs/core/kiln_ai/adapters/run_output.py
  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
🧠 Learnings (1)
📚 Learning: the `glm_z1_rumination_32b_0414` model was intentionally removed from the built_in_models list due t...
Learnt from: leonardmq
PR: Kiln-AI/Kiln#418
File: libs/core/kiln_ai/adapters/ml_model_list.py:0-0
Timestamp: 2025-07-16T09:37:39.816Z
Learning: The `glm_z1_rumination_32b_0414` model was intentionally removed from the built_in_models list due to output formatting issues: output was duplicated in both `output` and `reasoning` fields, and contained random internal JSON in the output. This model should not be re-added without addressing these formatting problems.

Applied to files:

  • libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py
🪛 GitHub Actions: Debug Detector
libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py

[error] 106-106: Developer TODO comment found: 'TODO: ensure structure'


[error] 180-180: Developer TODO comment found: 'TODO check format'


[error] 189-189: Developer TODO comment found: 'TODO'


[error] 236-236: Developer TODO comment found: 'TODO: usage is likely incorrect. We're only getting the usage from the final response, not each call'


[error] 504-504: Developer TODO comment found: 'TODO: maybe reconsider this. Model should be able to choose between a final answer or a tool call on any turn. But good models have json_schea, so do we need to support both? If we do, merge them, and consider auto vs forced when merging (only forced for final, auto for merged).'


[error] 616-616: Developer TODO comment found: 'TODO: check if this really needs to be a string. If it does, run should be forced to return a string and we should avoid the str()'

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: Build Desktop Apps (ubuntu-22.04)
  • GitHub Check: Build Desktop Apps (ubuntu-22.04-arm)
  • GitHub Check: Build Desktop Apps (macos-13)
  • GitHub Check: Build Desktop Apps (macos-latest)
  • GitHub Check: Build Desktop Apps (windows-latest)
🔇 Additional comments (4)
libs/core/kiln_ai/adapters/run_output.py (1)

1-12: LGTM!

The addition of the trace field to store conversation history is well-implemented. The type annotation correctly allows for a list of dictionaries with flexible value types, which is appropriate for capturing various message formats and tool call results.

libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py (3)

1-34: LGTM!

The new imports and constants are well-chosen for implementing tool calling functionality. The iteration limits (10 turns, 30 tool calls) provide reasonable safeguards against runaway loops.


245-258: Well-implemented response validation

Good defensive programming with comprehensive validation of the response structure before accessing nested fields.


562-626: Robust tool call processing implementation

The implementation includes excellent validation:

  • Tool availability checking
  • JSON parsing with error handling
  • Schema validation against tool definitions
  • Clear error messages for debugging

The separation of concerns between task_response and regular tools is well-handled.

Comment on lines +179 to +186
# Add the assistant message to chat history
# TODO check format
messages.append(
{
"role": "assistant",
"content": prior_output,
}
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove TODO and validate message format

The TODO comment indicates uncertainty about the message format. This should be validated or documented.

Apply this diff to add validation:

            # Add the assistant message to chat history
-            # TODO check format
            messages.append(
                {
                    "role": "assistant",
                    "content": prior_output,
                }
            )

Consider adding a helper method to validate message structure if needed.

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Add the assistant message to chat history
# TODO check format
messages.append(
{
"role": "assistant",
"content": prior_output,
}
)
# Add the assistant message to chat history
messages.append(
{
"role": "assistant",
"content": prior_output,
}
)
🧰 Tools
🪛 GitHub Actions: Debug Detector

[error] 180-180: Developer TODO comment found: 'TODO check format'

🤖 Prompt for AI Agents
In libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py around lines 179
to 186, remove the TODO comment about checking the message format and add
validation to ensure the message dictionary has the correct structure with
required keys "role" and "content" as strings. Implement or call a helper method
to validate the message format before appending it to the messages list, and
document the expected message structure clearly.

Comment on lines +188 to 190
# Get the final response from the chat formatter
# TODO
intermediate_outputs = chat_formatter.intermediate_outputs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Remove incomplete TODO comment

Line 189 contains an incomplete TODO comment that provides no context.

Apply this diff:

        # Get the final response from the chat formatter
-        # TODO
        intermediate_outputs = chat_formatter.intermediate_outputs()
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Get the final response from the chat formatter
# TODO
intermediate_outputs = chat_formatter.intermediate_outputs()
# Get the final response from the chat formatter
intermediate_outputs = chat_formatter.intermediate_outputs()
🧰 Tools
🪛 GitHub Actions: Debug Detector

[error] 189-189: Developer TODO comment found: 'TODO'

🤖 Prompt for AI Agents
In libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py at lines 188 to
190, remove the incomplete TODO comment on line 189 as it provides no useful
context or information.

Comment on lines +236 to +242
# TODO: usage is likely incorrect. We're only getting the usage from the final response, not each call
usage = (
self.usage_from_response(last_response)
if last_response is not None
else None
)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Address incorrect usage calculation

The TODO correctly identifies that usage is calculated only from the final response, which will underreport actual token usage across multiple tool calls.

This is a significant issue that could lead to incorrect cost tracking and rate limiting. Consider accumulating usage across all model calls or documenting this limitation clearly.

Consider implementing a usage accumulator that tracks tokens across all calls within the conversation, similar to how messages are accumulated.

🧰 Tools
🪛 GitHub Actions: Debug Detector

[error] 236-236: Developer TODO comment found: 'TODO: usage is likely incorrect. We're only getting the usage from the final response, not each call'

🤖 Prompt for AI Agents
In libs/core/kiln_ai/adapters/model_adapters/litellm_adapter.py around lines 236
to 242, the usage calculation only accounts for the final response, missing
token usage from earlier calls. To fix this, implement an accumulator that sums
usage from all individual model call responses throughout the conversation,
similar to how messages are aggregated. Replace the single usage_from_response
call with logic that iterates over all responses, accumulates their usage data,
and returns the total usage to ensure accurate cost tracking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant