Replies: 6 comments 6 replies
-
I am using the server since 3 months on a technical production environment with active users on kubernetes and I am pretty happy with the current server version in terms of features, code, stability, tests and performances. Especially I think if we continue in this direction we can claim that the server is production ready and we have an efficient LLM serving solution. @ggerganov @ngxson What do you think ? |
Beta Was this translation helpful? Give feedback.
-
I'm curious what your experience is with load balancing and cache redundancy across pods. Do you use sticky sessions in any way or find cached tokens to be problematic when round robin load balancing? I think server's most underrated feature is caching tokens between completions and saving the time to re-tokenize and re-infer prompts that have already been read before. |
Beta Was this translation helpful? Give feedback.
-
I think what is still missing for a lot of use cases is something like a |
Beta Was this translation helpful? Give feedback.
-
Sorry for hijacking this thread.. This is pretty awesome, to all who have been doing the work, thank you and Salute! Would it be interesting to create a tutorial or a guide of how to use llama.cpp in production? From reading the comments above I gather that there's some production readyness, but performance and scaling to me seems a bit vague or under advertised. I am particularity interested in making this serve-ready , with the following features to compete with prop. model providers (i.e. claude and openai):
Etc. For example, what would be the recipe to loadbalance via an NGINX reverse-proxy deployed over a couple of CPU vms (CPU stating here is intentional, I think this could tap to use CPU based "cheap" instances as in the old days when the linux cluster project started - https://en.wikipedia.org/wiki/Beowulf_cluster) If the community here could provide me with the mentoring and insights and detailed, I can volunteer to create this guide or set of guides :) (I had just managed to run Phi-3-mini-4k-instruct-q4.gguf on my completely beaten MacBoox Pro 2019 i9 Intel Core , after scratching few misleading notes and crossing info from hugging face, this is just too cool as manifested also in Llamafile) |
Beta Was this translation helpful? Give feedback.
-
@phymbert, this work is so impressive. I guess your work on improving llama.cpp to a server-ready version is expected to make a great contribution to its adoption by more people and companies in the future! 🤔 However, I have one question. Before selecting llama.cpp as the final serving system, many users will conduct benchmark tests. In summary, when running These situations could be due to problems with log printing code or because my server system is experiencing bottlenecks at the host level. Therefore, I would like to ask to @phymbert , if you have obtained consistent and accurate results when conducting benchmark of llama.cpp on a server system using |
Beta Was this translation helpful? Give feedback.
-
Love the idea of promoting the server to production ready. To be honest, it has more features than Ollama! Also, most of developers are building native bindings for c++, but if we come up with a production server, they will be able to use it directly via endpoints, and by doing so, it will be much easier to stay aligned with llama.cpp. |
Beta Was this translation helpful? Give feedback.
-
llama.cpp
is considered as production ready.But what about the server ?
In general, a production-ready system can include the following aspects:
Since a couple of month, the following PR have been added:
--threads
and--threads
,--ubatch-size
,--log-disable
#6254What are the missing features or steps to make the server prod ready ?
References:
Beta Was this translation helpful? Give feedback.
All reactions