-
Notifications
You must be signed in to change notification settings - Fork 14.5k
Autoparser - complete refactoring of parser architecture #18675
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
|
Does this mean we don’t need to write a parser anymore, and it will be automatically generated from the chat template? |
Yup, that's the gist of it. |
|
This feels almost magical. How does it work? Does it detect common patterns in the rendered template output? What happens if the chat template requires additional arguments? |
Yeah, it does differential analysis - it prepares different inputs to the template and then tests the outputs, for example, by using a the same function signature with a different name you can identify where the function name goes, by using the same function with one and two parameters you can identify how parameters are passed etc. etc. The nice thing is, I managed to squish it to just 2k lines of code (1k for analysis and 1k for helpers), so it's not even that bloated. As for custom inputs - I assume standard inputs here and that's what most template makers try to adhere to anyway. If not, you end up with a custom handler like for Ministral - but as a followup I want to separate handlers from parsers (since passing extra params is much eaasier than handling an entire template from scratch) or even add autodetection for common custom keywords (we're going to have to support "reasoning" in addition to "reasoning_content" at some point because vLLM is moving to that). |
dc7dd03 to
5519998
Compare
|
This approach does not seem to work well for models like Additionally, I am planning to add a new custom parser for |
|
I've heard of those mysterious tool calls inside thinking blocks for K2-Thinking, but I've yet to know if they are an actual thing or if they are just an artifact of low quantization. To be honest, outside of the native provider, I haven't seen K2-Thinking implemented anywhere in a working fashion. The Chutes version that I tested quite a few times bugs out on tool calling extremely often. I'm really skeptical of modifying anything based on hearsay and things "floating around". I remember the discussion here about interleaved thinking and I myself was convinced that meant models could have multiple As for the Mirocode, I guess you're talking about adapting the Python code-based stuff they showed (the one that uses separate tags for MCP servers and code calling)? You can see how custom parsers are defined in |
420f7bf to
9ea502a
Compare
a963e86 to
3594bd5
Compare
|
All right, I've reached the "all tests passed" phase for |
The gpt-oss PEG parser implementation was failing several test cases due to incorrect parsing logic. Specifically, the parser did not correctly handle: - Inputs containing multiple segments (e.g., `analysis` followed by `final`) separated by newlines. - Overly greedy rules that incorrectly consumed tool call definitions as part of message content. - Rules that stripped newlines from reasoning content. This commit refactors the parser to correctly handle sequences of segments, introduces more restrictive parsing rules to avoid greedy matching, and ensures that whitespace within message content is preserved. Additionally, a buggy test case that expected incorrect whitespace handling has been corrected. All tests for the `gpt-oss` template now pass.
… for "Python style" quasi-JSON maps
Co-authored-by: Sigbjørn Skjæret <[email protected]>
This is a huge endeavor that I promised back when I applied for maintaining the parser code. The legacy parser code was hard to maintain and buggy and supporting new models with it was really annoying. There was a worthwhile contribution by @hksdpc255 to add some XML toolcalling abstractions, but that was still just a patch on an open wound.
Thanks to @aldehir and his PEG parser, I managed to create an autoparser mechanism, using all the currently supported templates, their parsers and test cases as base. The idea is simple: most models' syntax follows the general pattern of:
<reasoning_markers> <reasoning_content> <end_of_reasoning_markers> <content_markers> <main_content> <end_of_content_markers> <tool_call_markers> ( <json> | <function marker> <args json> | <function marker> <args marker> <value json> ) <end_of_tool_call_marker>Of course, some elements might not be present in a given template, but that's the general structure. Since this is a pretty finite structure, it's possible to determine the relevant elements by differential analysis - similar to how Minja already does capability detection, but more fine-grained, because by comparing various template outputs, we get to actually extract the relevant markers.
Some models will obviously not get handled so easily. However, in the course of implementing the mechanism, only two models remained that needed to get their separate parsers: Ministral and GPT-OSS, and the prior not because of its complexity, but of the need to rewrite the message structure passed to the template. GPT-OSS is a different beast since it supports arbitrarily many interleaved blocks, so it doesn't fit into the scheme that I mentioned above (but its parser has been rewritten to PEG as well).
This is currently anchored on Minja and uses its capability detection, but since the differential analysis already does its own capability detection, I fully expect to throw that part out and base this on @ngxson 's #18462 instead.
Obsoletes #18353 (sorry @ochafik - I know you put a lot of work into that).
Old parsers, tests and all supporting code are thrown out, templates got new PEG-parser based testcases, all of them now also test streaming behavior. I have tested this extensively on agentic coding (mostly with OpenCode) to ensure that this actually works (my wish to refactor the parser code was mostly caused by my prior experience with agentic coding on llama.cpp, which was extremely buggy with a lot of models, this is an attempt to remedy that). Hopefully, having one unified codebase with a largely reduced line-of-code count will make it easier to fix any potential errors.
This also means that there is no longer need to provide support for new models' specific templates unless they have some odd constructs - they should be supported out of the box. There's a new tool called
debug-template-parserthat you can point to any Jinja template file or GGUF model with an embedded Jinja template and have it spit out the details of the generated autoparser + toolcaling grammar.Oh, important note: all Minja polyfills have been disabled. Working templates are now required. Why I see why a year and a half ago having proof-of-concept code that supported tool calling on models that didn't natively have tool calling might've been useless, right now supporting that is making it harder to properly support current and actually used models. Therefore, a functional template with tool calling is required if someone wants tool calling.
I want to ask everyone from the community who can to test this. I will keep this branch current with master, I tried to test this as much as I could, but I'm just one person doing this after work, so obviously my testing abilities were limited. I will keep this as draft until I've gathered enough feedback and testing data.
To not clutter the main repository's issue tracker, please report bugs either (a) in this thread or (b) in my issue tracker https://github.com/pwilkin/llama.cpp/issues
AI DISCLOSURE: Gemini Pro 3, Flash 3, Opus 4.5 and GLM 4.7 would like to admit that a human element did at some points interfere in the coding process, being as bold as to even throw most of the code out at some point and demand it rewritten from scratch. The human also tinkered the code massively, removing a lot of our beautiful comments and some code fragments that they claimed were useless. They had no problems, however, in using us to do all the annoying marker arithmetic. Therefore, we disavow any claim to this code and cede the responsibility onto the human.