Skip to content

server: add OpenAI compatible response format for /completions #10627

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 0 commits into from

Conversation

Nero7991
Copy link

@Nero7991 Nero7991 commented Dec 2, 2024

Support for full (almost) OpenAI API response format for the completion related endpoints (including when logprobs is specified)

The frontend is also modified to support this format as well as the existing format, so it remains functional.

HELM benchmarks from CRFM have support for a OpenAI compatible API server, this enables testing differently quantized models for degradation against this benchmark. Tested it on a QwQ Preview 32B GGUF Q4_K_M to evaluate the model against other frontier models.

This support can be compiled by using OAI_FULL_COMPAT pre compiler definition like so:

Using make:

make CXXFLAGS="-DOAI_FULL_COMPAT" llama-server

When compiled correctly and after running llama-server, the output should include INFO: OpenAI full compatibility mode enabled as seen in the following output snippet:

llama_new_context_with_model:  CUDA_Host compute buffer size =    18.01 MiB
llama_new_context_with_model: graph nodes  = 2246
llama_new_context_with_model: graph splits = 2
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 4096
INFO: OpenAI full compatibility mode enabled
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system

Example:

curl --request POST \
  --url http://localhost:8080/completion \
  --header "Content-Type: application/json" \
  --data '{"prompt": "Building a website can be done in 10 simple steps:","max_tokens": 8, "logprobs": 2}'

Response:

{
  "id": "cmpl-0",
  "id_slot": 0,
  "index": 0,
  "tokens_predicted": 8,
  "tokens_evaluated": 13,
  "generation_settings": {
    "n_ctx": 4096,
    "n_predict": -1,
    "model": "models/qwq-32b-preview-q4_k_m.gguf",
    "seed": 4294967295,
    "seed_cur": 3068603297,
    "temperature": 0.800000011920929,
    "dynatemp_range": 0,
    "dynatemp_exponent": 1,
    "top_k": 40,
    "top_p": 0.949999988079071,
    "min_p": 0.05000000074505806,
    "xtc_probability": 0,
    "xtc_threshold": 0.10000000149011612,
    "typical_p": 1,
    "repeat_last_n": 64,
    "repeat_penalty": 1,
    "presence_penalty": 0,
    "frequency_penalty": 0,
    "dry_multiplier": 0,
    "dry_base": 1.75,
    "dry_allowed_length": 2,
    "dry_penalty_last_n": -1,
    "dry_sequence_breakers": [
      "\n",
      ":",
      "\"",
      "*"
    ],
    "mirostat": 0,
    "mirostat_tau": 5,
    "mirostat_eta": 0.10000000149011612,
    "penalize_nl": false,
    "stop": [],
    "max_tokens": 8,
    "n_keep": 0,
    "n_discard": 0,
    "ignore_eos": false,
    "stream": false,
    "n_probs": 2,
    "min_keep": 0,
    "grammar": "",
    "samplers": [
      "dry",
      "top_k",
      "typ_p",
      "top_p",
      "min_p",
      "xtc",
      "temperature"
    ],
    "speculative": false,
    "speculative.n_max": 16,
    "speculative.n_min": 5,
    "speculative.p_min": 0.8999999761581421,
    "timings_per_token": false
  },
  "has_new_line": false,
  "truncated": false,
  "stopped_eos": false,
  "stopped_word": false,
  "stopped_limit": true,
  "stopping_word": "",
  "tokens_cached": 20,
  "timings": {
    "prompt_n": 13,
    "prompt_ms": 59.178,
    "prompt_per_token_ms": 4.552153846153846,
    "prompt_per_second": 219.67623103180236,
    "predicted_n": 8,
    "predicted_ms": 186.64,
    "predicted_per_token_ms": 23.33,
    "predicted_per_second": 42.86326618088299
  },
  "object": "text_completion",
  "created": 1733161457,
  "model": "models/qwq-32b-preview-q4_k_m.gguf",
  "choices": [
    {
      "text": " choosing a domain name, registering it,",
      "index": 0,
      "logprobs": {
        "tokens": [
          " choosing",
          " a",
          " domain",
          " name",
          ",",
          " registering",
          " it",
          ","
        ],
        "token_logprobs": [
          -0.8389889001846313,
          -0.03926413506269455,
          -0.09884411841630936,
          -0.04721870273351669,
          0,
          -0.5166370272636414,
          -0.494428426027298,
          0
        ],
        "top_logprobs": [
          {
            " ": -0.8389889001846313,
            " \n": -2.3360304832458496
          },
          {
            " a": -0.03926413506269455,
            " the": -3.2570128440856934
          },
          {
            " domain": -0.09884411841630936,
            " theme": -3.2751708030700684
          },
          {
            " name": -0.04721870273351669,
            ",": -3.076481342315674
          },
          {
            ",": 0
          },
          {
            " selecting": -0.5166370272636414,
            " registering": -1.518433690071106
          },
          {
            " the": -0.494428426027298,
            " a": -1.4887559413909912
          },
          {
            ",": 0
          }
        ],
        "text_offset": [
          0,
          9,
          11,
          18,
          23,
          24,
          36,
          39
        ]
      },
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 13,
    "completion_tokens": 8,
    "total_tokens": 21
  }
}

@ngxson
Copy link
Collaborator

ngxson commented Dec 2, 2024

I think we can just replace the the /completion with OAI-compat, instead of hiding it behind -DOAI_FULL_COMPAT. I don't see anyone actually using the non-OAI-compat format, and OAI-compat is pretty much a standard today thanks to its portability. What do you think about this @ggerganov ?

Beside, @Nero7991 you should add a test in test_completion.py to make sure that this works correctly. You can start by from openai import OpenAI and ask copilot to complete the rest.

@Nero7991
Copy link
Author

Nero7991 commented Dec 3, 2024

@ngxson So test_completion.py is currently testing for the existing response format. Unless we're switching completely to OpenAI compat, we'd need to figure out what response type the server is compiled for right? Or should I create a new test_completion_oai_compat.py file?

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is good overall, but I think this need a bit more refactoring to make it more "clean"

Also I don't think we should hide it via a compiler flag, it's just not convenient for most users. Another idea is that we can add a specific field for each request, says "oai_compat" and set it to true by default. Users who don't want OAI response need to explicit add "oai_compat": false

I'll propose my approach via another PR

send_final_response(slot);
#else
send_final_response_oaicompat(slot);
Copy link
Collaborator

@ngxson ngxson Dec 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the send_final_* is called from inference thread, but what we're doing is only to format the response, which should be done at HTTP layer. I'd suggest to move your code to a new function format_final_*_oaicompat, much like what we have with format_final_response_oaicompat

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed this yesterday. I found that there's a function called handle_completions_generic (there was a TODO suggesting merging that with handle_chat_completions. I've done that. I'll create another PR with that since it's probably the right way to do it.

I can probably do the oai_compatset to true by default later and send a PR draft

@ngxson
Copy link
Collaborator

ngxson commented Dec 3, 2024

Be aware that I'm doing a big refactoring in #10643 to reduce usage of JSON internally. This can introduce quite a lot of conflicts to your code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants