Shipping evaluation frameworks that survive contact with production

Every team we've worked with started with an evaluation set and ended up rebuilding it after their first production incident. The first version usually proved that the demo still worked. It did not prove that the product could keep working after new users, new documents, new model behavior, and new support tickets arrived.

The fix is to treat the eval harness as a product surface, not a test fixture.

Treat evaluation as a product

The eval harness has users: engineers debugging regressions, product owners deciding whether a release can ship, support leads deciding whether a failure becomes a blocker, and finance owners watching model cost. Those users have workflows, and the data has a schema. Skip these and you'll have an eval set that nobody trusts within a quarter.

The failure mode is predictable: the first eval is a spreadsheet of happy-path examples, the product ships, support finds edge cases, and nobody knows whether the new examples belong in the blocking set, the monitoring set, or the backlog. By then, the eval is treated as a test fixture instead of a product surface.

The three sets we separate

The first mistake is putting every example into one blocking set. Production examples do not all play the same role.

Blocking set: small, stable, and severe. If this fails, the release stops.
Monitoring set: larger and noisier. If this drifts, someone investigates, but not every miss blocks a release.
Backlog set: real failures that are not yet understood well enough to become a gate.

That separation matters because eval trust is fragile. If the blocking set contains ambiguous examples, engineers learn to ignore red runs. If the monitoring set is too small, product quality drifts without anyone noticing. If the backlog never graduates into a gate, production keeps rediscovering the same failure.

What the factory includes now

A versioned eval dataset stored alongside the code.
A scorecard with explicit thresholds for blocking a release.
A weekly review where new regressions get triaged before they pile up.
A fixture generator for recurring edge cases: bad sources, stale records, malformed tool calls, unsafe requests, and "needs human" moments.
A cost and latency budget next to the quality threshold.
Ownership: one person who can decide whether a new production failure becomes a release blocker.

What belongs in the first version

Real user inputs, not invented prompt examples.
Expected answers, refusal cases, and "needs human" cases.
A small adversarial set for the failures the business cannot tolerate.
A cost and latency budget next to the quality threshold.
An owner who can decide when a new production failure becomes a release blocker.

What does not belong yet

Not every weird production failure belongs in the blocking eval the day it appears. Some failures need product judgment first. Some reveal a missing workflow state. Some are data-quality problems upstream of the model. Some should become monitoring examples before they become release gates.

The eval review exists to make that call deliberately. Otherwise the harness becomes a junk drawer of scary anecdotes, and the team stops trusting it.

The handover standard

At handover, your team should be able to answer five questions without us:

Which eval cases block a release?
Which examples are monitored but not blocking?
How does a new production failure become an eval case?
Who owns threshold changes?
What cost and latency budget does the model route have to respect?

The point is not to make the eval huge. The point is to make it trusted enough that a green run means something and a red run stops the release.