zoff.tech

AI software factory · The same three engineers, every engagement · 4–6 a year

Production AI systems your engineers can operate. Built on one rule: write the eval before the prompt.

Most AI builds survive the demo and fail the first real operating week. We ship systems your on-call can operate at 3 a.m., gated by an evaluation harness your team owns and tested by a real user in week 2 while the architecture is still cheap to change.

What’s in the box

Three operating assets your team owns before handover.

Most engagements end with a repo and a good intention. Ours end with three things your team can operate without us.

eval

The eval, before the prompt

Every build starts with an eval dataset and explicit thresholds. If we can’t write a defensible eval in the first 14 days, we kill the engagement and refund the discovery fee. We’ve done it twice.

runbook

The runbook your on-call opens at 3 a.m.

Not the slide with arrows. The actual document: alerts, degradation modes, when to wake a human, how to roll back. We sign it off in the final week.

ip

IP, data, and the model bill — under your name, from day one

Your repo. Your OpenAI / Anthropic / AWS keys. Your dashboards. When we hand over, your team doesn’t ask us for credentials — they already had them, the whole time.

Factory output

Three live systems you can inspect, not sanitized case studies.

Microstax, BidGenie, DeOne — public AI systems with visible product decisions, tradeoffs, and operating constraints. Product leaders can inspect the workflow decisions. Engineers can inspect the constraints, failure modes, and operating assumptions.

See all three →

How we work

Four operating rules.

Each one has ended a sales conversation. Each one has protected a delivery.

  1. 01

    Eval before prompt. Always.

    Most teams write the prompt first and ship if "looks good." We write the eval set first — real questions, real answers, real thresholds. That eval is what decides whether a change merges. No green eval, no merge. This leaves your team owning the decision instead of trusting our opinion.

  2. 02

    No orphan PoCs. A real user is inside the system by week 2.

    A PoC with no path to production usually hides the hard decisions. We pick a real user (often someone on your team) and put what we have so far in their hands. What we learn in week 2 changes what we ship in week 8. Architecture calls get made against a transcript, not only on a whiteboard.

  3. 03

    The cheapest model that passes the eval wins.

    Quality, latency, and cost are scored together. Frontier when we must; smaller when we don’t. The eval makes that call, not a vendor relationship. In recent builds, paths that didn’t need frontier reasoning ran under $0.50 per 1M tokens — cutting model spend 80%+ versus a default-to-frontier build. The difference isn’t just cost — it’s latency, throughput, and the freedom to re-evaluate when the problem moves.

  4. 04

    Zero bait-and-switch. The same core trio from scope to handover.

    You pay for staff-level execution, not junior training. The same trio—architect, engineer, designer—who scopes your system is the one writing the code, the evals, and the runbooks. We restrict our intake to protect this focus.

The arithmetic

The arithmetic. Published so fit is clear before we spend engineering time.

Fixed fee or time-and-materials, depending on risk. We tell you which model we’re proposing on the first call, and why.

Discovery + evaluation
from $8k · 1–2 weeks
Build (small)
$40–80k · 6–8 weeks
Build (medium)
$80–160k · 10–14 weeks
Audit / review
$15–25k · 3 weeks · fixed price

We publish prices because opaque pricing wastes procurement cycles before technical fit is even clear. If these ranges align with your budget, we'll talk engineering—not sales pitches—on our first call.

What we say no to

We say no when we can’t defend the outcome.

Concrete examples from the last year:

  • "RAG over all our documents." No concrete question, no eval, no go.
  • Proofs-of-concept with no path to production. We build systems to be operated, not slide decks to be filed away.
  • An agent that replaces humans on legal, medical, or financial decisions.
  • Projects where the success rubric is "we’ll know it when we see it."
  • Projects where the prompt is treated as the product and the operating system is an afterthought.

If your project falls outside our focus, we will introduce you to teams better suited for that path. That introduction is always free.

The three shapes

Agentic apps and AI tools, built for production

We design and ship bounded agentic workflows, copilots, retrieval systems, internal AI tools, and eval harnesses that your engineers can operate after handover.

What we write

Latest insights

Essays on what the factory teaches us in production. No think pieces.

Insights

Bring us the problem, the owner, the budget range, and the date.

Here’s what we cover, in this order:

  1. What concrete problem this solves, for whom, now.
  2. What the eval would look like. Can we write it?
  3. Your budget and your date.
  4. Whether we’re a fit. If we’re not, who is.

The call is strictly engineering: no sales decks, no agency discovery theater. If your process requires a multi-page RFP before a technical brief, email us instead.

Subscribe for new writing

AI engineering essays. No marketing fluff. Unsubscribe anytime.