`server` production readiness #6398

phymbert · 2024-03-30T09:59:00Z

phymbert
Mar 30, 2024
Collaborator

llama.cpp is considered as production ready.

But what about the server ?

In general, a production-ready system can include the following aspects:

Sufficient testing: The system has been thoroughly tested to ensure it functions as expected under various conditions
Reliability: The system can consistently perform its intended functions without failure
Scalability: The system can handle increased loads or expand in response to demand
Documentation: There is clear and comprehensive documentation that explains how the system works and how to use it
Monitoring: There are tools and processes in place to track the system's performance and alert the necessary parties if something goes wrong.

Since a couple of month, the following PR have been added:

What are the missing features or steps to make the server prod ready ?

References:

phymbert · 2024-03-30T10:04:12Z

phymbert
Mar 30, 2024
Collaborator Author

I am using the server since 3 months on a technical production environment with active users on kubernetes and I am pretty happy with the current server version in terms of features, code, stability, tests and performances.

Especially /chat/completions, /embeddings, /health and /metrics endpoints matter for me.

I think if we continue in this direction we can claim that the server is production ready and we have an efficient LLM serving solution.

@ggerganov @ngxson What do you think ?

2 replies

ngxson Mar 30, 2024
Collaborator

Thanks for the insight and also for your commitment on the project. Yeah I agree that from a high-level perspective, server is more or less ready for prod. However, IMO from a low-level view, I'm still feel that there's missing something (or maybe just because I'm a bit perfectionism when talking about low-level development).

For example, on the HTTP part of the server, we still use thread pool instead of modern coroutine approach. I'm not saying that the current httplib is good or bad, but just saying that in llama.cpp we focus mostly on inference, so the HTTP server is kinda "nice-to-have" to me.

For security, while the core llama.h library can be very memory safe, the interface between it and the HTTP world may have potential buffer overflow somewhere. I don't know exactly where, but just feel that. We may need to do some fuzzing if we really want to consider server as ready for prod.

But in the other hand, I also believe that server example is a good start for llamax. With llamax, we may see many other server implementations that are simply "HTTP wrapper" of llamax. By doing that, we can maybe spend more time to polish the HTTP part of server and make it a really production-ready application.

ggerganov Apr 2, 2024
Maintainer

I am using the server since 3 months on a technical production environment with active users on kubernetes and I am pretty happy with the current server version in terms of features, code, stability, tests and performances.

Wow, that's great to hear!

Regarding the production readiness - I am not actively using the server, so I'm not really familiar with the deficiencies for real-world use cases. The kind of feedback as the one above is quite helpful in this regard.

In terms of the implementation, from the top of my head here are a few things we have to improve on:

Tokenization, handling of special tokens. I think there are some failure cases atm (related to embedding models if I remember correctly). See the TODO in the code
System prompt. It's managed in some convoluted way and it should be simplified. There was a recent PR server: fix system_tokens being erased in kv_cache; #6312 that also should be looked into as it seems to be correct, but needs verification
Context shift and context extension logic should be simplified and reused. This is probably related to a large extend to the work needed for llama : move the sampling API from common into llama lib #5214

In any case, appreciate that you guys are helping out and looking out for the next steps. If you have any further suggestions what to do next for server, feel free to take the initiative.

jboero · 2024-05-16T10:31:03Z

jboero
May 16, 2024

I am pretty happy with the current server version in terms of features, code, stability, tests and performances.

I'm curious what your experience is with load balancing and cache redundancy across pods. Do you use sticky sessions in any way or find cached tokens to be problematic when round robin load balancing? I think server's most underrated feature is caching tokens between completions and saving the time to re-tokenize and re-infer prompts that have already been read before.

1 reply

mcharytoniuk Jun 3, 2024

llama.cpp implements forward-proxy (you can spread resource usage among multiple backends).

I also started a reverse proxy custom-tailored for llamas.cpp some time ago: https://github.com/distantmagic/paddler. Maybe it would be useful to you also. It reads llama.cpp health status and can work with Auto Scaling. It doesn't use round-robin—it prefers the servers with the highest number of context slots.

JohannesGaessler · 2024-05-16T12:06:24Z

JohannesGaessler
May 16, 2024
Collaborator

I think what is still missing for a lot of use cases is something like a --deterministic flag that (at the cost of performance) guarantees reproducible results. I think the main thing that would need to be done in terms of implementation is to always run the server at the maximum number of slots to keep the batch size constant.

0 replies

sivang · 2024-07-21T11:09:30Z

sivang
Jul 21, 2024

Sorry for hijacking this thread..

This is pretty awesome, to all who have been doing the work, thank you and Salute!

Would it be interesting to create a tutorial or a guide of how to use llama.cpp in production? From reading the comments above I gather that there's some production readyness, but performance and scaling to me seems a bit vague or under advertised.

I am particularity interested in making this serve-ready , with the following features to compete with prop. model providers (i.e. claude and openai):

Caching, Batching and all that between
Advertised strategy for scaling inside and outside a cloud native framework.
Capacity and hardware requirements per expected traffic and workload.
Detailed specs of how the server is operating for security and scalability and conformance.

Etc.

For example, what would be the recipe to loadbalance via an NGINX reverse-proxy deployed over a couple of CPU vms (CPU stating here is intentional, I think this could tap to use CPU based "cheap" instances as in the old days when the linux cluster project started - https://en.wikipedia.org/wiki/Beowulf_cluster)

If the community here could provide me with the mentoring and insights and detailed, I can volunteer to create this guide or set of guides :)

(I had just managed to run Phi-3-mini-4k-instruct-q4.gguf on my completely beaten MacBoox Pro 2019 i9 Intel Core , after scratching few misleading notes and crossing info from hugging face, this is just too cool as manifested also in Llamafile)

2 replies

mcharytoniuk Jul 21, 2024

@sivang I have already started such project. If you are interested to contribute further, you are welcome. :) https://llmops-handbook.distantmagic.com/general-concepts/load-balancing/index.html

sivang Jul 21, 2024

@mcharytoniuk this is incredible! This project is amazing because of folks like you, now I have a new fine-reading to do, and I will see if there's anything I can contribute it seems quite through covering also the wonderful vLLM and friends!

bong-furiosa · 2024-07-22T05:21:33Z

bong-furiosa
Jul 22, 2024

@phymbert, this work is so impressive. I guess your work on improving llama.cpp to a server-ready version is expected to make a great contribution to its adoption by more people and companies in the future!

🤔 However, I have one question.
I am curious if the benchmark results for current llama.cpp are running correctly on server system.

Before selecting llama.cpp as the final serving system, many users will conduct benchmark tests.
I have also conducted benchmark tests for llama.cpp in both desktop and server system.
And I was able to obtain test results that seems like a potential bug. It is described in detail at #8499 .

In summary, when running llama-cli on server GPUs (A100, H100), it recorded lower tokens per second compared to the desktop environment. However, when setting the global option LLAMA_DISABLE_LOGS=1 during the build of llama.cpp, both server GPUs and desktop GPUs record similarly fast tokens per second.

These situations could be due to problems with log printing code or because my server system is experiencing bottlenecks at the host level.
🙇 @mscheong01 suggested that this issue might be related CPU resources but recommended reporting these results in this discussion.

Therefore, I would like to ask to @phymbert , if you have obtained consistent and accurate results when conducting benchmark of llama.cpp on a server system using llama-cli and llama-bench. 🙇‍♂️🙇‍♂️

1 reply

mscheong01 Jul 22, 2024
Collaborator

If the performance downgrade is also true for llama-server, we might have to recommend users to disable the logs in production 🤔

eugeniosegala · 2024-12-10T08:53:39Z

eugeniosegala
Dec 10, 2024

Love the idea of promoting the server to production ready.

To be honest, it has more features than Ollama!

Also, most of developers are building native bindings for c++, but if we come up with a production server, they will be able to use it directly via endpoints, and by doing so, it will be much easier to stay aligned with llama.cpp.

0 replies

server production readiness #6398

Uh oh!

phymbert Mar 30, 2024 Collaborator

Replies: 6 comments · 6 replies

Uh oh!

phymbert Mar 30, 2024 Collaborator Author

Uh oh!

Uh oh!

ngxson Mar 30, 2024 Collaborator

Uh oh!

ggerganov Apr 2, 2024 Maintainer

Uh oh!

jboero May 16, 2024

Uh oh!

mcharytoniuk Jun 3, 2024

Uh oh!

JohannesGaessler May 16, 2024 Collaborator

Uh oh!

sivang Jul 21, 2024

Uh oh!

mcharytoniuk Jul 21, 2024

Uh oh!

sivang Jul 21, 2024

Uh oh!

bong-furiosa Jul 22, 2024

Uh oh!

mscheong01 Jul 22, 2024 Collaborator

Uh oh!

eugeniosegala Dec 10, 2024

`server` production readiness #6398

phymbert
Mar 30, 2024
Collaborator

Replies: 6 comments 6 replies

phymbert
Mar 30, 2024
Collaborator Author

ngxson Mar 30, 2024
Collaborator

ggerganov Apr 2, 2024
Maintainer

jboero
May 16, 2024

JohannesGaessler
May 16, 2024
Collaborator

sivang
Jul 21, 2024

bong-furiosa
Jul 22, 2024

mscheong01 Jul 22, 2024
Collaborator

eugeniosegala
Dec 10, 2024