A discovery eval is not a deliverable. It is the instrument we use to decide whether the build should happen at all. Twice in the last eighteen months it told us not to build, and we refunded the discovery fee on the spot. This is what those weeks looked like.
The shape of the engagement before we wrote the eval
A team comes in with a real problem and a real budget. The most common version is some form of "RAG over our internal docs so support / sales / ops stops asking the same questions." Approved budget. Executive sponsor. A vendor demo already in the room.
We do not start writing prompts. We start writing the eval.
Days 1–9: questions, not prompts
We sit with the people who are actually going to use the system. Not the executive sponsor — the agents, the analysts, the operators. We ask each of them to write thirty to fifty questions they would expect the system to answer correctly. We collect their honest answers, including the ones that contain numbers, dates, or judgment calls.
That set is the eval. Real inputs. Expected outputs. Refusal cases ("the system should say it does not know"). The cases where the right answer is "escalate to a human."
By day nine we score whatever already exists — the vendor demo, the in-house prototype, a baseline LLM call. We score it against the eval the users wrote.
When the math kills the build
In the two cases we walked away, the score was not close. The strict pass rate was somewhere in the high twenties. The lenient pass rate was somewhere in the forties.
Pass rate alone is not the kill signal. The kill signal is the unit economics implied by the pass rate. If the bot is wrong on three out of four answers and a wrong answer creates a downstream escalation, the bot is generating tickets, not deflecting them. The build does not need to be cheaper. The build needs to not happen.
We bring the eval, the failure modes, and the implied unit economics to the steering meeting. We recommend not building. We refund the discovery fee.
What the eval actually decides
The eval is not a quality gate. It is the decision-making instrument for whether the next ninety days of engineering exist. Most teams discover this only after they have spent the build budget. The factory rule is that we discover it in the first two weeks, on our risk, not yours.
When the eval is clean, the build is straightforward — green eval merges, red eval does not, the operating system runs itself. When the eval cannot be written, no amount of prompt engineering will make the system honest. That is the project we will not take.