You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Limit health check requests to the bare minimum in cost and make them as reliable as possible against very verbose/overthinking LLMs. I think it is enough to request only a single token response from the LLMs in chat completion mode and where applicable otherwise.
Motivation, pitch
Currently, unnecessary load can be introduced by the health checks and actually it is possible to see timeouts on actually healthy models due to overly long responses (especially reasoners).
Are you a ML Ops Team?
No
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered:
The Feature
Limit health check requests to the bare minimum in cost and make them as reliable as possible against very verbose/overthinking LLMs. I think it is enough to request only a single token response from the LLMs in chat completion mode and where applicable otherwise.
Motivation, pitch
Currently, unnecessary load can be introduced by the health checks and actually it is possible to see timeouts on actually healthy models due to overly long responses (especially reasoners).
Are you a ML Ops Team?
No
Twitter / LinkedIn details
No response
The text was updated successfully, but these errors were encountered: