Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature]: Limit health check request max_tokens to 1 (where applicable) #9584

Open
Ithanil opened this issue Mar 27, 2025 · 0 comments · May be fixed by #9587
Open

[Feature]: Limit health check request max_tokens to 1 (where applicable) #9584

Ithanil opened this issue Mar 27, 2025 · 0 comments · May be fixed by #9587
Labels
enhancement New feature or request

Comments

@Ithanil
Copy link

Ithanil commented Mar 27, 2025

The Feature

Limit health check requests to the bare minimum in cost and make them as reliable as possible against very verbose/overthinking LLMs. I think it is enough to request only a single token response from the LLMs in chat completion mode and where applicable otherwise.

Motivation, pitch

Currently, unnecessary load can be introduced by the health checks and actually it is possible to see timeouts on actually healthy models due to overly long responses (especially reasoners).

Are you a ML Ops Team?

No

Twitter / LinkedIn details

No response

@Ithanil Ithanil added the enhancement New feature or request label Mar 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant