-
Notifications
You must be signed in to change notification settings - Fork 402
Description
First of all, thanks for the great library!
Issue encountered
We're evaluating LLMs deployed on dedicated inference infrastructure with vLLM/vLLM Production Stack and want to run log-likelihood-based benchmarks (LL-based MMLU, ARC, WinoGrande, etc. + some custom ones) via API rather than using in-process backends such as transformers or vLLM.
From my understanding, the current LiteLLM backend only supports generative metrics (greedy_until) since it targets /chat/completions exclusively. The loglikelihood() method isn't implemented, which means LL-based benchmarks can't run against API endpoints.
If I missed an existing way to do this (via LiteLLM or another backend), please point me to the relevant docs!
Solution/Feature
Support for loglikelihood() in API-based evaluation, likely by querying /completions (not /chat/completions) with echo=True + logprobs=True parameters to retrieve prompt token logprobs.
Prior art / inspiration:
- lm-evaluation-harness
local-completionsbackend does exactly this for vLLM-served models - LiteLLM itself supports text completions via
litellm.text_completion()which could be leveraged
This would also benefit from making chat template application configurable (currently hardcoded use_chat_template=True in litellm_model.py#L161).
Possible alternatives
- Custom model class — Implement
loglikelihood()myself following the custom model guide or expand theLiteLLMClientclass, querying/completionsdirectly - Generative alternatives — Convert LL-based tasks to use generative metrics (e.g.,
exact_matchon generated "A/B/C/D") — though this changes evaluation semantics - Different tooling — Use lm-eval-harness for LL benchmarks and lighteval for generative ones (less ideal for workflow consistency)
Would appreciate any guidance on the recommended approach, or info on whether this is on the roadmap. Thanks!
Version Info
0.13.0