We tell every new client the same thing on the first day: the eval harness is not a quality gate, it's the actual product. The model is replaceable. The prompt is replaceable. The harness — the thing that decides whether a change made the system better or worse — is what we build the engagement around.
Why eval-first wins
If you can't tell whether yesterday's experiment regressed, you are guessing. Guessing is fine for a hack day. It does not survive a board review.
- Eval gives the team a shared definition of "better." No more taste-debates.
- It catches model-update regressions automatically. The provider quietly swaps a checkpoint? You find out before the customer does.
- It makes the cost-quality trade obvious. Drop one tier, watch eval drop too. Decide.
The eval harness is the bit you can't ship without. Everything else is replaceable on a quarter's notice.
The shape we use
Every project we run starts with a flat list of test cases — a JSONL file, in the repo, plain English. It looks something like this:
{ "id": "u-0021", "input": "What's the policy on…", "expect_contains": ["…", "…"], "must_not_contain": ["personal data"] }
Three rules
- Every PR runs the harness. The number is the PR description.
- New requirements add a row before they add code.
- Failing rows do not become "known issues." They become tickets.
What it doesn't replace
Real users still tell you things the eval set won't. The harness keeps you from regressing on what you already know. Customer interviews keep you discovering what you don't.
If this resonated, our next piece walks through what we put in the eval set when retrieval is part of the system. Subscribe below.