Skip to content

[FT] Log-likelihood metrics support for API-based evaluation (LiteLLM / completions endpoint) #1093

@DenysYurchenko24

Description

@DenysYurchenko24

First of all, thanks for the great library!

Issue encountered

We're evaluating LLMs deployed on dedicated inference infrastructure with vLLM/vLLM Production Stack and want to run log-likelihood-based benchmarks (LL-based MMLU, ARC, WinoGrande, etc. + some custom ones) via API rather than using in-process backends such as transformers or vLLM.
From my understanding, the current LiteLLM backend only supports generative metrics (greedy_until) since it targets /chat/completions exclusively. The loglikelihood() method isn't implemented, which means LL-based benchmarks can't run against API endpoints.
If I missed an existing way to do this (via LiteLLM or another backend), please point me to the relevant docs!

Solution/Feature

Support for loglikelihood() in API-based evaluation, likely by querying /completions (not /chat/completions) with echo=True + logprobs=True parameters to retrieve prompt token logprobs.

Prior art / inspiration:

  • lm-evaluation-harness local-completions backend does exactly this for vLLM-served models
  • LiteLLM itself supports text completions via litellm.text_completion() which could be leveraged

This would also benefit from making chat template application configurable (currently hardcoded use_chat_template=True in litellm_model.py#L161).

Possible alternatives

  1. Custom model class — Implement loglikelihood() myself following the custom model guide or expand the LiteLLMClient class, querying /completions directly
  2. Generative alternatives — Convert LL-based tasks to use generative metrics (e.g., exact_match on generated "A/B/C/D") — though this changes evaluation semantics
  3. Different tooling — Use lm-eval-harness for LL benchmarks and lighteval for generative ones (less ideal for workflow consistency)

Would appreciate any guidance on the recommended approach, or info on whether this is on the roadmap. Thanks!

Version Info

0.13.0

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions