Latest insights

Jun 24, 2026
A model can't grade its own homework
Self-improving training loops generate their own data — and drift when the model also judges it. The versions that actually work add an external verifier, not a bigger model.
Jun 24, 2026
You don't pick a smaller model — you compress the right one
Choosing a smaller model trades capability for cost blindly. Compression keeps the model that passes your eval and trims what it doesn't use — but only an eval tells you how far you can cut.
Jun 24, 2026
Optimizing each agent doesn't optimize the system
In a multi-agent system, a locally perfect prompt can make the whole worse. The fix is to optimize prompts for the handoff, not the agent — the same reason these systems are hard to debug.
Jun 24, 2026
Prompt injection is a threat model, not a bug to patch
Prompt injection can't be filtered away — instruction and data share one channel. The defense isn't a smarter prompt; it's least privilege on the tools the agent can call.
Jun 24, 2026
A right answer from the wrong tool call is still a bug
Most agent evals score the final answer and ignore the tool calls. But an agent can be right by luck or by an unauthorized action — score the tool use separately, or it bites you in production.
Jun 16, 2026
AI agent development cost, the honest version
What an AI agent actually costs to build — real ranges, the hidden total-cost-of-ownership nobody quotes, and why we publish our prices when almost no one else will.
Jun 16, 2026
Your codebase needs an LLM wiki
AI coding agents are only as good as the context they load. A small, current, in-repo wiki written for the machine makes an agent fit your codebase instead of fighting it.
Jun 16, 2026
You don't fix hallucination, you survive it
A new paper shrinks model hallucination again — but no technique makes it zero. The production answer isn't a fix; it's the eval, the gate, the trace, and the checkpoint.
Jun 15, 2026
Boutique AI engineering firm vs Big 4 — when each one wins
When a boutique AI firm beats the Big 4 and when it doesn't — the honest split on cost, seniority, and who actually builds the system you're paying for.
Jun 13, 2026
Why AI projects fail in production — and how we don't
60–95% of AI pilots never reach production. The cause is almost never the model. Here are the real reasons they die, and the three operating rules that keep ours alive.
Jun 11, 2026
How to evaluate an AI agent before it ships
A practical checklist for deciding whether an AI agent is safe for production — building or buying. The seven things that separate a demo from a system.
Jun 9, 2026
Red flags when hiring an AI agency
The warning signs that an AI vendor will burn your budget — and the questions that separate a firm that ships production systems from one that ships a demo and a roadmap.
Jun 4, 2026
You can't debug what you can't trace
Multi-agent systems fail in ways event logs can't see. The fix is a trace — a typed tree of spans with the reasoning attached. Observability is part of the build.
Jun 4, 2026
Evolutionary divergence — when one agent learns and the rest break
When agents adapt from production feedback, one can improve locally while breaking what its peers depend on. Version-stamped traces catch the drift before users do.
Jun 4, 2026
Instrument once — OpenTelemetry is the agent tracing contract
Proprietary tracing formats are a lock-in tax. OpenTelemetry's GenAI conventions let you instrument an agent once and send the traces anywhere you choose.
Jun 4, 2026
Tracing agents on the Microsoft stack
A field guide to agent observability on Azure — Semantic Kernel, Azure AI Foundry tracing, and Application Insights, kept portable with OpenTelemetry and no lock-in.
Jun 4, 2026
Tracing agents on the Google stack
A field guide to agent observability on Google Cloud — ADK and Agent Engine on Vertex AI, Gemini, and Cloud Trace, kept portable with OpenTelemetry.
Jun 3, 2026
Chatbot vs. workforce
The difference between a chatbot and an agent system is the difference between a tool and a team. What an agent system actually is, and why it is the work that lasts.
Jun 2, 2026
AI slop is what shipping without an eval looks like
Faceless AI content channels are a business model with no quality bar, and they fail for the same reason ungated AI systems fail in production. The eval is the difference.
Jun 1, 2026
When the tool eats the service
AI content repurposing is real demand with a closing window — the tools are automating the service away. A way to think about building where the moat outlasts the model.
May 30, 2026
A chat widget is not a system
Anyone can deploy a website chat agent in an afternoon. That is exactly the problem. Where these quietly fail, and why the answer is a system, not a widget.
May 28, 2026
What it takes to put an agent on the phone line
The voice-agent demo is an afternoon. The system that answers a plumber's phone at 2 a.m. without losing the job is the actual work. Here is the gap.
May 26, 2026
If it worked, they wouldn't sell it
A field guide to AI snake oil — the trading bots, the guaranteed returns, the products that are really courses. The test is simple, and it generalizes.
May 23, 2026
The audit is the product
Most AI consulting fails before the build, in a vague assessment nobody can act on. The audit is not a sales step. It is the first deliverable, and it should stand on its own.
May 21, 2026
Using AI to code is not the same as building AI systems
AI-assisted coding is becoming table stakes. AI systems engineering is becoming the real differentiator. Here is the difference, and why it matters.
May 19, 2026
The permission map every production agent needs before it calls a tool
Tool-using agents need an explicit map of what they can read, write, mutate, escalate, and never touch — decided before the build, not during an incident.
May 12, 2026
RAG does not start with embeddings. It starts with answerability.
Before you tune retrieval, prove the question can even be answered from the source corpus — with a citation a human would accept. Answerability comes before embeddings.
May 5, 2026
The 3 a.m. AI runbook
Production AI fails in ways ordinary runbooks don't cover. The operating plan must handle quality drift, retrieval failure, model outages, cost spikes, and human escalation.
Apr 28, 2026
Cache-aware agent architecture, or why your loop is paying for the same context fifteen times
Prompt caching is no longer a performance optimization. It is an architectural constraint that decides whether a long-running agent is economic to operate.
Apr 21, 2026
MCP is becoming the production interface for agents — own it like one
The Model Context Protocol is moving from a developer convenience to the production interface between agents and your systems. Here is what changes when you treat it that way.
Apr 14, 2026
Verifier-gated agent loops — the eval, moved from CI into the runtime
A small verifier model sitting between the frontier model and the side-effect boundary is the most useful piece of agent architecture nobody is shipping yet.
Apr 7, 2026
The cheapest model that passes the eval wins
How a working eval harness picks the model — and how often the cheapest model that passes beats the frontier one the team came in expecting.
Mar 31, 2026
No orphan PoCs: put a real user in the system by week 2
A PoC with no path to production hides the hard decisions. A week-2 user forces them into the open while the architecture is still cheap to change.
Mar 24, 2026
What a real user breaks by day twelve that no spec would catch
Why a real user belongs inside the system by week 2 — and the kinds of architecture decisions they force into the open that no spec would catch.
Mar 17, 2026
The eval that's allowed to kill the build
How a discovery eval decides whether a build should happen at all — and why the honest output is sometimes 'don't build', before you've spent the build budget.
Mar 10, 2026
Shipping evaluation frameworks that survive contact with production
An evaluation harness is a product, not a notebook. Ship it like one — versioned, owned, instrumented, and able to stop a release when a run goes red.

Latest insights

A model can't grade its own homework

You don't pick a smaller model — you compress the right one

Optimizing each agent doesn't optimize the system

Prompt injection is a threat model, not a bug to patch

A right answer from the wrong tool call is still a bug

AI agent development cost, the honest version

Your codebase needs an LLM wiki

You don't fix hallucination, you survive it

Boutique AI engineering firm vs Big 4 — when each one wins

Why AI projects fail in production — and how we don't

How to evaluate an AI agent before it ships

Red flags when hiring an AI agency

You can't debug what you can't trace

Evolutionary divergence — when one agent learns and the rest break

Instrument once — OpenTelemetry is the agent tracing contract

Tracing agents on the Microsoft stack

Tracing agents on the Google stack

Chatbot vs. workforce

AI slop is what shipping without an eval looks like

When the tool eats the service

A chat widget is not a system

What it takes to put an agent on the phone line

If it worked, they wouldn't sell it

The audit is the product

Using AI to code is not the same as building AI systems

The permission map every production agent needs before it calls a tool

RAG does not start with embeddings. It starts with answerability.

The 3 a.m. AI runbook

Cache-aware agent architecture, or why your loop is paying for the same context fifteen times

MCP is becoming the production interface for agents — own it like one

Verifier-gated agent loops — the eval, moved from CI into the runtime

The cheapest model that passes the eval wins

No orphan PoCs: put a real user in the system by week 2

What a real user breaks by day twelve that no spec would catch

The eval that's allowed to kill the build

Shipping evaluation frameworks that survive contact with production