Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server : (refactoring) do not rely on JSON internally #10643

Merged
merged 19 commits into from
Dec 6, 2024

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented Dec 3, 2024

Motivation

Currently, the internal code of server.cpp depends too much on json type. To a point that we're kinda abusing JSON to circumvent doing proper struct in the code.

Here is now the server process input/output data currently:

  • Receive JSON input from HTTP thread
  • HTTP thread do some post-formatting (i.e. convert OAI format to "internal" format), then pass the formatted JSON to task queue
  • Inference thread pick up the task, run it through launch_slot_with_task to put correct data into slot
  • Slot decode, then give back a JSON as result
  • JSON passed to HTTP layer, being (optionally) formatted again, then send out to client
graph TD;
    user-- JSON -->http[http thread];
    http-- JSON -->launch_slot_with_task;
    launch_slot_with_task-- struct -->update_slots;
    update_slots-- JSON -->http;
    http-- JSON -->user;
Loading

Proposal

In this PR, I propose that we only handle JSON in HTTP thread:

  • Before putting it to task queue, we should already converted the JSON into a struct
  • Inference thread should never call anything related to json
  • Results from inference thread are always struct
  • Before sending out the response to client, HTTP thread is responsible for converting struct to JSON
graph TD;
    user-- JSON -->http[http thread];
    http-- struct -->inf[inference thread];
    inf-- struct -->http;
    http-- JSON -->user;
Loading

Changes to the API

/slots and /completions:

  • stopped_eos, stopped_word, stopped_limit are replaced by an enum string stop_type

/chat/completions:

  • Fixed finish_reason returning incorrect value. If generation is stopped due to stop word or EOS, finish_reason="stop". Otherwise, finish_reason="length"

TODO:

  • fix log probs result
  • add test result stream vs non-stream
  • make sure model alias is still correct
  • update docs

@ngxson ngxson changed the title server : (refactoring) reduce usage of json internally server : (refactoring) do not rely on JSON internally Dec 3, 2024
@github-actions github-actions bot added the python python script changes label Dec 4, 2024
@ngxson ngxson marked this pull request as ready for review December 4, 2024 19:10
@ngxson
Copy link
Collaborator Author

ngxson commented Dec 4, 2024

@ggerganov Because this refactoring change quite a lot of code, so I think it would be better to split it into 2 parts:

  • This PR: structs for sending result from inference thread --> http
  • Next PR:structs for sending result from http --> inference thread

So the current inference thread --> http part in this PR is ready to be reviewed. Could you please have a look when you have time?

I'm also tagging @slaren in case you want to leave some suggestions for my approach of polymorphism in this PR. Thank you!

@ngxson ngxson requested review from ggerganov and slaren December 4, 2024 19:15
@slaren
Copy link
Collaborator

slaren commented Dec 5, 2024

I don't have a good overall view of how the server is implemented and what this is doing, but there are several red flags that don't look right to me.

  • server_task being a base class without any virtual members for the derived classes to implement
  • Having both the derived classes and an enum result_type
  • The unsafe casts in in the from_ptr ptr that most of these classes implement. On a side note, do not pass unique_ptr by reference unless you plan to change or take ownership of the pointer, pass a pointer instead.
  • The send function that is implemented as a template instead of as a function that takes a ptr to the base type

Again, I do not have a good overall view of the server implementation to make specific recommendations, but that's not what I would expect from a class hierarchy. Generally, you should look into abstracting the interface into a few functions, and implement these in the derived classes. Casts from the base class to the derived class should never be necessary.

@ggerganov
Copy link
Owner

The goal to limit the use of JSON object in the server implementation is good, but the proposed implementation has some deficiencies. @slaren highlighted most of the issues.

IMO the server-result polymorphism is not warranted in this case and introduces unnecessary complexity. I would recommend to have a single struct server_task_result with all members from all results merged into it with proper naming (prefixes) or nested structs. The result type should be carried only by result_type type (btw, rename enum result_type to enum server_result_type). The server_task_result::to_json() method should use a switch (type) and construct the json based on that.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 5, 2024

Honestly I'm pretty new to cpp polymorphism and thanks to the points that @slaren highlighted, I understand it more clearly now.

IMO the server-result polymorphism is not warranted in this case and introduces unnecessary complexity. I would recommend to have a single struct server_task_result with all members from all results merged into it with proper naming (prefixes) or nested structs.

I think having prefixed may be worse to manage than the current JSON approach. Having nested struct can be cleaner, but I think it's kinda polymorphism, which better to do with proper cpp virtual function that @slaren pointed out.

Anw, I'll try to implement virtual and dynamic_cast for a more clean implementation. Let's see how it goes.

@ggerganov
Copy link
Owner

Anw, I'll try to implement virtual and dynamic_cast for a more clean implementation. Let's see how it goes.

Keep in mind that if you end up needing dynamic_cast then something is not OK in the implementation. There should be no need to cast from base class to derived class in this case.

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 5, 2024

So I've been able to refactor all JSON-related function into virtual to_json(), which eliminates the need of enum result_type and unsafe type casting.

I do still use dynamic_cast at some places for 3 reasons:

  1. GGML_ASSERT to check if the result is the expected derived class. This is not very important, can be removed if we don't want
  2. To check if a given server_task_result is an error response or not
  3. To read members of server_task_result_metrics --> this is because I haven't refactored metrics-related functions, but we can get rid of the dynamic_cast after they are refactored

examples/server/server.cpp Outdated Show resolved Hide resolved
examples/server/server.cpp Outdated Show resolved Hide resolved
Hint: You can compile and run test in single command, useful for local developement:

```shell
cmake --build build -j --target llama-server && ./examples/server/tests/tests.sh
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ggerganov FYI, the change in tests.sh should allow you to run test script from anywhere, not necessary need to cd tests

}

json to_json_oai_compat() {
std::string finish_reason = "length";
Copy link
Collaborator Author

@ngxson ngxson Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Nero7991 I'm gonna merge this PR soon. You can adapt your PR #10645 to take advantage of this to_json_oaicompat(). Please note that for /completion endpoint, oaicompat_chat will be false.

Since this is a private function, you can even refactor this function into to_json_oaicompat and to_json_oaicompat_chat for these 2 different cases, then have an if..else branch in to_json to select the correct one

@ngxson ngxson added the breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. label Dec 6, 2024
@ngxson ngxson merged commit 6c5bc06 into ggerganov:master Dec 6, 2024
46 checks passed
tinglou pushed a commit to tinglou/llama.cpp that referenced this pull request Dec 7, 2024
* server : (refactoring) reduce usage of json internally

* move all response types to struct

* wip [no ci]

* many fixes

* add virtual function

* fix index

* minor style fix

* add std::move

* refactor handle_completions_generic

* add virtual functions

* remove server.hpp

* clarify server_sent_event RFC specs

* apply review comments

* fix model_alias and completion_probabilities

* small clean up

* remove virtual for to_json_oai_compat()

* naming oai_compat --> oaicompat

* fix unwanted recursive call

* update docs
@kaetemi
Copy link
Collaborator

kaetemi commented Dec 7, 2024

n_ctx on /props is always returning 0 here since this commit

@ngxson
Copy link
Collaborator Author

ngxson commented Dec 7, 2024

hmm ok seems like /slots and /props are currently broken (missing tests for them, I was too confident!)

will fix that as soon as I get home

arthw pushed a commit to arthw/llama.cpp that referenced this pull request Dec 20, 2024
* server : (refactoring) reduce usage of json internally

* move all response types to struct

* wip [no ci]

* many fixes

* add virtual function

* fix index

* minor style fix

* add std::move

* refactor handle_completions_generic

* add virtual functions

* remove server.hpp

* clarify server_sent_event RFC specs

* apply review comments

* fix model_alias and completion_probabilities

* small clean up

* remove virtual for to_json_oai_compat()

* naming oai_compat --> oaicompat

* fix unwanted recursive call

* update docs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking change Changes that break ABIs, APIs, file formats, or other forms of backwards compatibility. examples python python script changes server
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants