The eval that's allowed to kill the build

A discovery eval is not a deliverable. It is the instrument that decides whether the build should happen at all — and sometimes the honest output is don't build. When it is, the right move is to say so and refund the fee in week two, before anyone has spent the build budget. This is how that's supposed to work.

The shape of the engagement before we wrote the eval

A team comes in with a real problem and a real budget. The most common version is some form of "RAG over our internal docs so support / sales / ops stops asking the same questions." Approved budget. Executive sponsor. A vendor demo already in the room.

We do not start writing prompts. We start writing the eval.

Days 1–9: questions, not prompts

We sit with the people who are actually going to use the system. Not the executive sponsor — the agents, the analysts, the operators. We ask each of them to write thirty to fifty questions they would expect the system to answer correctly. We collect their honest answers, including the ones that contain numbers, dates, or judgment calls.

That set is the eval. Real inputs. Expected outputs. Refusal cases ("the system should say it does not know"). The cases where the right answer is "escalate to a human."

By day nine we score whatever already exists — the vendor demo, the in-house prototype, a baseline LLM call. We score it against the eval the users wrote.

When the math kills the build

When the eval kills a build, the score isn't close. Picture a strict pass rate in the high twenties and a lenient one in the forties — the kind of gap no prompt engineering closes.

Pass rate alone is not the kill signal. The kill signal is the unit economics implied by the pass rate. If the bot is wrong on three out of four answers and a wrong answer creates a downstream escalation, the bot is generating tickets, not deflecting them. The build does not need to be cheaper. The build needs to not happen.

We bring the eval, the failure modes, and the implied unit economics to the steering meeting. We recommend not building. We refund the discovery fee.

What the eval actually decides

The eval does more than gate quality: it decides whether the next ninety days of engineering exist at all. Most teams discover this only after they have spent the build budget. The factory rule is that we discover it in the first two weeks, on our risk, not yours.

When the eval is clean, the build is straightforward — green eval merges, red eval does not, the operating system runs itself. When the eval cannot be written, no amount of prompt engineering will make the system honest. That is the project we will not take.