[Feature]: Limit health check request max_tokens to 1 (where applicable) #9584

Ithanil · 2025-03-27T16:06:14Z

The Feature

Limit health check requests to the bare minimum in cost and make them as reliable as possible against very verbose/overthinking LLMs. I think it is enough to request only a single token response from the LLMs in chat completion mode and where applicable otherwise.

Motivation, pitch

Currently, unnecessary load can be introduced by the health checks and actually it is possible to see timeouts on actually healthy models due to overly long responses (especially reasoners).

Are you a ML Ops Team?

No

Twitter / LinkedIn details

No response

Ithanil added the enhancement New feature or request label Mar 27, 2025

Ithanil linked a pull request Mar 27, 2025 that will close this issue

feat: use max_tokens=1 for litellm health check requests #9587

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature]: Limit health check request max_tokens to 1 (where applicable) #9584

[Feature]: Limit health check request max_tokens to 1 (where applicable) #9584

Ithanil commented Mar 27, 2025

[Feature]: Limit health check request max_tokens to 1 (where applicable) #9584

[Feature]: Limit health check request max_tokens to 1 (where applicable) #9584

Comments

Ithanil commented Mar 27, 2025

The Feature

Motivation, pitch

Are you a ML Ops Team?

Twitter / LinkedIn details