Open
Description
Suppose you have a function that uses an LLM to generate a score of 1, 2, or 3 from some text.
You label some data with ground truth and upload it to Laminar to evaluate with:
async function getScoreFromOneToThree(inputString: string): Promise<number> {
return 1 // Placeholder for something that uses LLM
}
const dataset = new LaminarDataset<string, { score: number }>("my dataset")
evaluate({
data: dataset,
executor: getScoreFromOneToThree,
evaluators: {
correctness: async (o, t) => {
if (!t) {
throw new Error("No test case target")
}
return {
exactlyCorrect: Number(o === t.score), // Average of this is % correct
scoreMinusTarget: o - t.score, // Average of this is mean error
scoreMinusTargetSquared: (o - t.score) ** 2, // Average of this is MSE
scoreMinusTargetAbsolute: Math.abs(o - t.score), // Average of this is MAE
}
},
},
})
When we run the eval we get an average (arithmetic mean) of each of those metrics along with a histogram:
Problem: There is no way to get some more useful aggregation than the average like RMSE (square root of MSE), standard deviation, max error, etc.
There are a few straightforward ways to address this:
- eject full evaluation results back to runtime where evaluation was run, i.e.
const results = evaluate({...})
instead ofevaluate
being:void
. Then users can log their own preferred metrics. Not as good as having it in the web UI 🤷♂️ - UI automatically displays some of these like RMSE, etc.
- add some API like
reporters: { exactlyCorrect: async (outputs: o[], targets: t[]) => RMSE(outputs, targets) }
and include these as datapoints in the line graph of the other averaged fields (the graph at/project/[projectId]/evaluations/[evaluationId]]
that has p99 and p95 etc.)
Metadata
Metadata
Assignees
Labels
No labels