AI software factory · The same three engineers, every engagement · 4–6 a year
Production AI systems your engineers can operate. Built on one rule: write the eval before the prompt.
Most AI builds survive the demo and fail the first real operating week. We ship systems your on-call can operate at 3 a.m., gated by an evaluation harness your team owns and tested by a real user in week 2 while the architecture is still cheap to change.
What’s in the box
Three operating assets your team owns before handover.
Most engagements end with a repo and a good intention. Ours end with three things your team can operate without us.
eval
The eval, before the prompt
Every build starts with an eval dataset and explicit thresholds. If we can’t write a defensible eval in the first 14 days, we kill the engagement and refund the discovery fee. We’ve done it twice.
runbook
The runbook your on-call opens at 3 a.m.
Not the slide with arrows. The actual document: alerts, degradation modes, when to wake a human, how to roll back. We sign it off in the final week.
ip
IP, data, and the model bill — under your name, from day one
Your repo. Your OpenAI / Anthropic / AWS keys. Your dashboards. When we hand over, your team doesn’t ask us for credentials — they already had them, the whole time.
Factory output
Three live systems you can inspect, not sanitized case studies.
Microstax, BidGenie, DeOne — public AI systems with visible product decisions, tradeoffs, and operating constraints. Product leaders can inspect the workflow decisions. Engineers can inspect the constraints, failure modes, and operating assumptions.
BidGenie
An AI workflow that turns RFPs, DDQs, and security questionnaires into reviewable drafts in hours, with human approval built into every step.
First drafts in hours instead of days · human approval at every choke point · zero unreviewed AI text leaves the system
DeOne
A science-grounded dating platform — psychometric assessments, multi-dimensional matching, and an AI coach that understands both sides of the conversation.
50+ matching dimensions · psychometrically-grounded AI coach · crisis-detection built in before launch
Microstax
An agent-native environment runtime — isolated, governed Kubernetes sandboxes for human developers and autonomous AI agents.
< 60s spin-up · 8+ hrs saved per developer per week · onboarding cut from 2 weeks to 1 day
How we work
Four operating rules.
Each one has ended a sales conversation. Each one has protected a delivery.
01
Eval before prompt. Always.
Most teams write the prompt first and ship if "looks good." We write the eval set first — real questions, real answers, real thresholds. That eval is what decides whether a change merges. No green eval, no merge. This leaves your team owning the decision instead of trusting our opinion.
02
No orphan PoCs. A real user is inside the system by week 2.
A PoC with no path to production usually hides the hard decisions. We pick a real user (often someone on your team) and put what we have so far in their hands. What we learn in week 2 changes what we ship in week 8. Architecture calls get made against a transcript, not only on a whiteboard.
03
The cheapest model that passes the eval wins.
Quality, latency, and cost are scored together. Frontier when we must; smaller when we don’t. The eval makes that call, not a vendor relationship. In recent builds, paths that didn’t need frontier reasoning ran under $0.50 per 1M tokens — cutting model spend 80%+ versus a default-to-frontier build. The difference isn’t just cost — it’s latency, throughput, and the freedom to re-evaluate when the problem moves.
04
Zero bait-and-switch. The same core trio from scope to handover.
You pay for staff-level execution, not junior training. The same trio—architect, engineer, designer—who scopes your system is the one writing the code, the evals, and the runbooks. We restrict our intake to protect this focus.
The arithmetic
The arithmetic. Published so fit is clear before we spend engineering time.
Fixed fee or time-and-materials, depending on risk. We tell you which model we’re proposing on the first call, and why.
- Discovery + evaluation
- from $8k · 1–2 weeks
- Build (small)
- $40–80k · 6–8 weeks
- Build (medium)
- $80–160k · 10–14 weeks
- Audit / review
- $15–25k · 3 weeks · fixed price
We publish prices because opaque pricing wastes procurement cycles before technical fit is even clear. If these ranges align with your budget, we'll talk engineering—not sales pitches—on our first call.
What we say no to
We say no when we can’t defend the outcome.
Concrete examples from the last year:
- "RAG over all our documents." No concrete question, no eval, no go.
- Proofs-of-concept with no path to production. We build systems to be operated, not slide decks to be filed away.
- An agent that replaces humans on legal, medical, or financial decisions.
- Projects where the success rubric is "we’ll know it when we see it."
- Projects where the prompt is treated as the product and the operating system is an afterthought.
If your project falls outside our focus, we will introduce you to teams better suited for that path. That introduction is always free.
The three shapes
Agentic apps and AI tools, built for production
We design and ship bounded agentic workflows, copilots, retrieval systems, internal AI tools, and eval harnesses that your engineers can operate after handover.
Agentic application builds
How we do it →
Agentic system review
How we do it →
AI tools for existing products
How we do it →
What we write
Latest insights
Essays on what the factory teaches us in production. No think pieces.
The permission map every production agent needs before it calls a tool
Tool-using agents need an explicit map of what they can read, write, mutate, escalate, and never touch.Using AI to code is not the same as building AI systems
AI-assisted coding is becoming table stakes. AI systems engineering is becoming the real differentiator. Here is the difference, and why it matters.The 3 a.m. AI runbook
Production AI fails in ways ordinary app runbooks do not cover. The operating plan has to include quality drift, retrieval failure, model outages, cost spikes, and human escalation.
Bring us the problem, the owner, the budget range, and the date.
Here’s what we cover, in this order:
- What concrete problem this solves, for whom, now.
- What the eval would look like. Can we write it?
- Your budget and your date.
- Whether we’re a fit. If we’re not, who is.
The call is strictly engineering: no sales decks, no agency discovery theater. If your process requires a multi-page RFP before a technical brief, email us instead.