Building an Eval Loop You Actually Trust
The single most common mistake I see with LLM evaluation is using a model to grade its own outputs with no calibration against human judgment. The grader agrees with the model because they share the same biases.
A useful eval loop has three properties: it is reproducible, it is calibrated against a small human-labeled set, and it is cheap enough to run on every change. Most teams have one of the three. Getting all three is what separates “we have evals” from “our evals are load-bearing.”
What follows is a recipe that has held up: task-specific rubrics, a small gold set you maintain like production code, and a judge model that is different in family from the model under test.