Make chat context summarization action-aware#5099
Conversation
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…it/agents into brian/summarize-with-tools
theomonnom
left a comment
There was a problem hiding this comment.
This is great, just a few nits.
Also do you think we necessarily need xml? This seems harder than just doing something very simple
| class MessageRenderable(BaseModel, ABC): | ||
| """Base class for chat context items that can be converted into a `ChatMessage`.""" | ||
|
|
||
| @abstractmethod | ||
| def to_message(self) -> ChatMessage: | ||
| pass | ||
|
|
There was a problem hiding this comment.
I don't think we need this abstraction? It feels unnecessary
There was a problem hiding this comment.
Also do you think we necessarily need xml? This seems harder than just doing something very simple
I don’t think XML is strictly necessary, but I’d still prefer it here. It makes the boundaries between sections much clearer for the model, and it also helps avoid cases where plain labels like User: / Assistant: could get confused with the actual message content.
For example, if a user message itself contains \n User: or \n Assistant:, a simple text format can break down pretty easily and user can easily do prompt injections on it. XML makes that structure more explicit and resilient.
It also seems to be a pretty standard prompting pattern, e.g. Claude Code does something similar. So to me it feels like a safer default, without really adding much downside.
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Summary
Reworks
ChatContext._summarize()so the LLM "sees" tool call inputs/outputs alongside conversation messages, producing summaries that capture knowledge gained from actions rather than only what was said in dialogue.Key changes:
FunctionCallandFunctionCallOutputitems into the summarization input as XML (<function_call>,<function_call_output>) so the summarizer LLM can distill action results into the summary.MessageRenderablebase class for items that can be converted to aChatMessagerepresentation (currentlyFunctionCallandFunctionCallOutput).self.itemsdirectly (not a filtered message list), preserving correct causal ordering of messages, tool calls, and handoffs in the tail.TaskGroupto stop excluding items before summarization (exclude_function_call=False,exclude_handoff=False, etc.) so that_summarizereceives the full context to kick off action aware summarization.Future Todos
Comparison: old vs new summarization
Full terminal output
Given a realistic support conversation with 3 tool calls (
lookup_order,check_warranty,create_rma), an agent handoff, and agenerate_return_labelcall, here's howkeep_last_turns=1summarization compares (view above full output):lookup_orderoutput)check_warrantydatacreate_rmacallStructural ordering (tail correctness)
Both produce 9 items after summarization, but the tail composition differs:
Old — tail ordering is broken:
create_rmaFunctionCallcreate_rmaFunctionCallOutputgenerate_return_labelFunctionCallgenerate_return_labelFunctionCallOutputThe user message (item 8) appears after the function calls it triggered (items 3–7). This happens because the old code computed the tail on a filtered list of only
ChatMessageitems, so function calls were not accounted for when choosing the split point. The result is a context where causality is inverted — the RMA was created before the user even provided their address.New — causal ordering preserved:
create_rmaFunctionCallcreate_rmaFunctionCallOutputgenerate_return_labelFunctionCallgenerate_return_labelFunctionCallOutputThe tail is computed on
self.itemsdirectly, so the user message that caused the RMA creation sits before the function calls it triggered. Any downstream LLM consuming this context sees events in the order they actually happened.Test plan
test_summarize_head_tail_split_basic— tail preserves last N turns, head is summarizedtest_summarize_head_tail_split_with_renderables— FunctionCall/FunctionCallOutput in tail region are preserved with correct orderingtest_summarize_keep_last_turns_zero— everything summarized, no tailtest_summarize_preserves_structural_items— system messages and AgentHandoff survivetest_summarize_skips_when_not_enough_messages— early return when budget covers all messagestest_chat_ctx.py: 18/18)