Building evaluation loops that survive contact with real models and real users.
1 post
The fastest way to ship a bad model is to grade it with a worse one. A practical guide to evaluation that survives contact with production.