-
Notifications
You must be signed in to change notification settings - Fork 11.5k
server: add OpenAI compatible response format for /completions #10627
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
I think we can just replace the the Beside, @Nero7991 you should add a test in |
@ngxson So test_completion.py is currently testing for the existing response format. Unless we're switching completely to OpenAI compat, we'd need to figure out what response type the server is compiled for right? Or should I create a new test_completion_oai_compat.py file? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea is good overall, but I think this need a bit more refactoring to make it more "clean"
Also I don't think we should hide it via a compiler flag, it's just not convenient for most users. Another idea is that we can add a specific field for each request, says "oai_compat"
and set it to true
by default. Users who don't want OAI response need to explicit add "oai_compat": false
I'll propose my approach via another PR
examples/server/server.cpp
Outdated
send_final_response(slot); | ||
#else | ||
send_final_response_oaicompat(slot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the send_final_*
is called from inference thread, but what we're doing is only to format the response, which should be done at HTTP layer. I'd suggest to move your code to a new function format_final_*_oaicompat
, much like what we have with format_final_response_oaicompat
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed this yesterday. I found that there's a function called handle_completions_generic (there was a TODO suggesting merging that with handle_chat_completions. I've done that. I'll create another PR with that since it's probably the right way to do it.
I can probably do the oai_compat
set to true by default later and send a PR draft
Be aware that I'm doing a big refactoring in #10643 to reduce usage of JSON internally. This can introduce quite a lot of conflicts to your code. |
Support for full (almost) OpenAI API response format for the completion related endpoints (including when logprobs is specified)
The frontend is also modified to support this format as well as the existing format, so it remains functional.
HELM benchmarks from CRFM have support for a OpenAI compatible API server, this enables testing differently quantized models for degradation against this benchmark. Tested it on a QwQ Preview 32B GGUF Q4_K_M to evaluate the model against other frontier models.
This support can be compiled by using
OAI_FULL_COMPAT
pre compiler definition like so:Using
make
:make CXXFLAGS="-DOAI_FULL_COMPAT" llama-server
When compiled correctly and after running
llama-server
, the output should includeINFO: OpenAI full compatibility mode enabled
as seen in the following output snippet:Example:
Response: