Skip to content

[FT] Support for retriever-augmented and latent-memory models. #1109

@akshathmangudi

Description

@akshathmangudi

Issue encountered

LightEval currently evaluates models as input -> output text generators. However, an increasing class of retrieval-augmented models perform retrieval and reasoning within a latent space or via tightly-coupled retriever-generator systems, which makes standard evaluation partially supported.

What kind of models can we support?

  1. Compression-native / latent RAG (e.g. Apple's CLaRa): documents are compressed into learned latent representations; retrieval and reasoning happen in continuous latent space.
  2. Joint retriever-generator models (e.g. RETRO, ATLAS): retrieval behavior materially affects generation but is not visible in standard evaluations.
  3. Latent memory systems (e.g. DSI): documents are stored and accessed implicitly via model parameters rather than text chunks.

In these cases, comparison against classic RAG baselines becomes difficult.

LightEval can already evaluate outputs from such systems, but it lacks a way to model these RAG systems as so, rather than as plain language models.

Solution/Feature

I don't have a fixed solution yet, rather I'm opening up an open discussion on the feasibility of this feature, for which I will take full initiative for, if it's viable.

Upon searching the codebase, I found few files like pipeline.py or some driver code that supports classes like LightevalModel. Maybe we can create some form of model adapter that wraps a RAG system, that way we can use the existing benchmarks available within LightEval with these new systems. This is ofc theoretical thinking, but I would love your thoughts on this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions