Arsalan Mosenia
Abstract amber-to-rust gradient
Evals · LLMs · Security / 11 min read

Building an Eval Loop You Actually Trust

By Arsalan Mosenia · Published
0 claps

The single most common mistake I see with LLM evaluation is using a model to grade its own outputs with no calibration against human judgment. The grader agrees with the model because they share the same biases.

A useful eval loop has three properties: it is reproducible, it is calibrated against a small human-labeled set, and it is cheap enough to run on every change. Most teams have one of the three. Getting all three is what separates “we have evals” from “our evals are load-bearing.”

What follows is a recipe that has held up: task-specific rubrics, a small gold set you maintain like production code, and a judge model that is different in family from the model under test.

More posts

Get the newsletter

New AI/ML engineering notes in your inbox when I publish. No spam.

Powered by Buttondown. Unsubscribe any time.