Skip to content

Support more aggregations of numeric eval output besides average and histogram #637

Open
@zsiegel92

Description

@zsiegel92

Suppose you have a function that uses an LLM to generate a score of 1, 2, or 3 from some text.

You label some data with ground truth and upload it to Laminar to evaluate with:

async function getScoreFromOneToThree(inputString: string): Promise<number> {
  return 1 // Placeholder for something that uses LLM
}

const dataset = new LaminarDataset<string, { score: number }>("my dataset")

evaluate({
  data: dataset,
  executor: getScoreFromOneToThree,
  evaluators: {
    correctness: async (o, t) => {
      if (!t) {
        throw new Error("No test case target")
      }
      return {
        exactlyCorrect: Number(o === t.score), // Average of this is % correct
        scoreMinusTarget: o - t.score, // Average of this is mean error
        scoreMinusTargetSquared: (o - t.score) ** 2, // Average of this is MSE
        scoreMinusTargetAbsolute: Math.abs(o - t.score), // Average of this is MAE
      }
    },
  },
})

When we run the eval we get an average (arithmetic mean) of each of those metrics along with a histogram:

Image

Problem: There is no way to get some more useful aggregation than the average like RMSE (square root of MSE), standard deviation, max error, etc.

There are a few straightforward ways to address this:

  • eject full evaluation results back to runtime where evaluation was run, i.e. const results = evaluate({...}) instead of evaluate being :void. Then users can log their own preferred metrics. Not as good as having it in the web UI 🤷‍♂️
  • UI automatically displays some of these like RMSE, etc.
  • add some API like reporters: { exactlyCorrect: async (outputs: o[], targets: t[]) => RMSE(outputs, targets) } and include these as datapoints in the line graph of the other averaged fields (the graph at /project/[projectId]/evaluations/[evaluationId]] that has p99 and p95 etc.)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions