[FT] Log-likelihood metrics support for API-based evaluation (LiteLLM / completions endpoint)

First of all, thanks for the great library!

## Issue encountered
We're evaluating LLMs deployed on dedicated inference infrastructure with vLLM/[vLLM Production Stack](https://docs.vllm.ai/projects/production-stack/en/latest/index.html) and want to run log-likelihood-based benchmarks (LL-based MMLU, ARC, WinoGrande, etc. + some custom ones) via API rather than using in-process backends such as transformers or vLLM.
From my understanding, the current LiteLLM backend only supports generative metrics (`greedy_until`) since it targets `/chat/completions` exclusively. The `loglikelihood()` method isn't implemented, which means LL-based benchmarks can't run against API endpoints.
If I missed an existing way to do this (via LiteLLM or another backend), please point me to the relevant docs!

## Solution/Feature
Support for `loglikelihood()` in API-based evaluation, likely by querying `/completions` (not `/chat/completions`) with `echo=True` + `logprobs=True` parameters to retrieve prompt token logprobs.

Prior art / inspiration:
- **lm-evaluation-harness** `local-completions` backend does exactly this for vLLM-served models
- LiteLLM itself supports text completions via [`litellm.text_completion()`](https://docs.litellm.ai/docs/text_completion) which could be leveraged

This would also benefit from making chat template application configurable (currently hardcoded `use_chat_template=True` in [litellm_model.py#L161](https://github.com/huggingface/lighteval/blob/main/src/lighteval/models/endpoints/litellm_model.py#L161)).

## Possible alternatives

1. Custom model class — Implement `loglikelihood()` myself following the [custom model guide](https://huggingface.co/docs/lighteval/en/evaluating-a-custom-model) or expand the `LiteLLMClient` class, querying `/completions` directly
2. Generative alternatives — Convert LL-based tasks to use generative metrics (e.g., `exact_match` on generated "A/B/C/D") — though this changes evaluation semantics
3. Different tooling — Use **lm-eval-harness** for LL benchmarks and **lighteval** for generative ones (less ideal for workflow consistency)

Would appreciate any guidance on the recommended approach, or info on whether this is on the roadmap. Thanks!

## Version Info
0.13.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[FT] Log-likelihood metrics support for API-based evaluation (LiteLLM / completions endpoint) #1093

Issue encountered

Solution/Feature

Possible alternatives

Version Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[FT] Log-likelihood metrics support for API-based evaluation (LiteLLM / completions endpoint) #1093

Description

Issue encountered

Solution/Feature

Possible alternatives

Version Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions