Support more aggregations of numeric eval output besides average and histogram

Suppose you have a function that uses an LLM to generate a score of 1, 2, or 3 from some text.

You label some data with ground truth and upload it to Laminar to evaluate with:

```typescript
async function getScoreFromOneToThree(inputString: string): Promise<number> {
  return 1 // Placeholder for something that uses LLM
}

const dataset = new LaminarDataset<string, { score: number }>("my dataset")

evaluate({
  data: dataset,
  executor: getScoreFromOneToThree,
  evaluators: {
    correctness: async (o, t) => {
      if (!t) {
        throw new Error("No test case target")
      }
      return {
        exactlyCorrect: Number(o === t.score), // Average of this is % correct
        scoreMinusTarget: o - t.score, // Average of this is mean error
        scoreMinusTargetSquared: (o - t.score) ** 2, // Average of this is MSE
        scoreMinusTargetAbsolute: Math.abs(o - t.score), // Average of this is MAE
      }
    },
  },
})

```

When we run the eval we get an average (arithmetic mean) of each of those metrics along with a histogram:

![Image](https://github.com/user-attachments/assets/5c830275-7b60-4bad-98b3-fad77557060d)

**Problem: There is no way to get some more useful aggregation than the average** like RMSE (square root of MSE), standard deviation, max error, etc.

There are a few straightforward ways to address this:

- eject full evaluation results back to runtime where evaluation was run, i.e. `const results = evaluate({...})` instead of `evaluate` being `:void`. Then users can log their own preferred metrics. Not as good as having it in the web UI 🤷‍♂️
- UI automatically displays some of these like RMSE, etc.
- add some API like `reporters: { exactlyCorrect: async (outputs: o[], targets: t[]) => RMSE(outputs, targets) }` and include these as datapoints in the line graph of the other averaged fields (the graph at `/project/[projectId]/evaluations/[evaluationId]]` that has p99 and p95 etc.)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Support more aggregations of numeric eval output besides average and histogram #637

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Support more aggregations of numeric eval output besides average and histogram #637

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions