Verifier-gated agent loops — the eval, moved from CI into the runtime

The factory rule is to write the eval before the prompt. The natural extension, which almost nobody has shipped to production yet, is to put the eval inside the loop. Not as a release gate. As a runtime gate. A second model, smaller and specialized, sitting between the frontier model and any irreversible side effect, checking the work before it ships.

We have been calling this verifier-gated agent loops. The pattern is straightforward. The literature is converging on it. The tooling is barely there. It is one of the highest-conviction bets we are making about how agentic systems will operate by the end of 2026.

The shape of the loop

A frontier model proposes an action. The action is well-formed — a tool call with arguments, a SQL query, a draft response, a code patch, a customer email. Before the action executes, a verifier runs.

The verifier is not the same model. It is smaller, faster, often a fine-tuned smaller open-weights model trained on the failure modes that matter to your domain. Its only job is to score the proposed action against a narrow, written rubric: is this query within scope, is this email factually grounded in the retrieved context, is this code patch consistent with the test it claims to fix.

If the verifier passes, the action executes. If it fails, the loop either retries with the failure mode as feedback, escalates to a human, or aborts. The frontier model never directly causes a side effect. The verifier does.

A verifier-gated loop: the frontier model proposes an action; a small verifier model checks it against the task's success criteria; on pass, the action commits at the side-effect boundary; on fail, the loop retries with the verifier's feedback instead of acting. The verifier sits in the runtime, between the model and anything irreversible.

Why a separate model and not the same one self-checking

Self-critique by the same model is a known weak gate — a model can't grade its own homework. It catches the obvious failures and misses the ones that share its bias — which are precisely the failures that matter in production. A verifier with a different training distribution, often a smaller model, often fine-tuned on the actual failure cases your system produced, sees what the proposer missed.

The economics also work out. The frontier model runs once per step. The verifier runs once per step. The verifier is roughly an order of magnitude cheaper. Total per-step cost goes up by ten to fifteen percent. The blast radius of a wrong action drops to near zero.

What you need before you can ship one

The verifier is a model. A model needs a training set. A training set needs failure examples. Most teams do not have a clean record of their agent's past failures because they were not capturing them. The pre-work for verifier-gated loops is six to eight weeks of disciplined failure capture in the eval pipeline before any verifier is trained — the same discipline that decides whether an agent is ready for production.

Once you have that, the verifier is small, focused, and trainable in days. We have shipped verifiers in the seven-to-thirteen-billion parameter range that match the gate quality of GPT-4-class self-critique at one-twentieth the cost.

Where this is going

Two trends will make this pattern routine inside eighteen months. Frontier vendors will start shipping verifier models as a first-class product, the way they shipped embedding models in 2023. And the cost of small fine-tunes will keep falling, which means the verifier becomes a per-engagement asset, not a shared one.

The teams shipping verifier-gated loops in 2026 will look like the teams shipping CI in 2010: doing something the rest of the industry will treat as obvious in three years and is treating as exotic right now. The eval moved from being a release gate to being a runtime gate. That is the same move CI made twenty years ago, and it is going to land just as hard.

What we are not doing yet

We are not running verifier-gated loops on tasks where the verifier itself is the bottleneck on quality. Some domains — long-horizon strategic reasoning, open-ended creative work — do not yet have a verifier model that scores reliably. For those, we still keep the human in the loop. The verifier is not magic. It is an eval with a runtime interface.