All thoughts and musings
AI-NativeJun 9, 2026 · 10 min read

Operationalize: You Built an Agent. Now Make It an Employee.

A third of companies have an agent doing real work in production. Most of them built a heroic one-off: brilliant, fragile, and understood by exactly one engineer. That's not a capability — it's a liability with good PR.

OperationalizePhase 3 · Built AI agents

This is the phase where I stop teasing companies, because getting here is real. Per the funnel in my series opener, the research finds only about a third of organizations have begun to scale AI beyond experimentation, even though 62% are experimenting with agents. That gap, between trying agents and having one do actual work in production, is this phase's entry fee. If that's you, something genuinely shipped: a system reads the ticket and resolves it, processes the invoice, triages the lead. Money or time changes hands because of a model's decisions.

And yet the Operationalize phase has its own trap, and most companies who reach it are already in it. The agent works, but nobody can say how well it works. It was built by one talented engineer in a burst of enthusiasm, it lives in a repo nobody else touches, it has no evals, no guardrails beyond hope, and no documented blast radius. It is, in other words, a heroic artifact. And heroic artifacts don't compound, they accumulate risk while everyone congratulates themselves.

The one-off trap

Here's the test I run in phase-three companies. I ask three questions about their proudest agent. One: what's its error rate this week, and is that better or worse than last week? Two: if its model provider upgrades the underlying model tomorrow, how will you know whether the agent got better or worse? Three: if the engineer who built it resigns on Friday, who runs it on Monday?

Silence on all three, almost every time. The agent is in production but not operated. It's the difference between owning a truck and running a logistics company. And the silence has a consequence that compounds quietly: because nobody trusts the agent in a way they can defend with numbers, nobody dares to extend it, replicate the pattern, or point it at anything more important. The organization's one success becomes its ceiling.

An agent in production but not operated is the difference between owning a truck and running a logistics company.

Manage agents like employees, because that's what they are

The mental model that gets companies out of this trap is one I've argued before: design agents like stupid employees. Not stupid as an insult, stupid as a design constraint. A new hire who is fast, tireless, occasionally brilliant, and prone to confidently making things up. You would never onboard that person by giving them root access and walking away. You'd give them a narrow job description, examples of done-well, boundaries on what they can decide alone, and an escalation path for everything else. Then you'd review their work, frequently at first, then by sampling.

Every piece of that maps directly onto agent architecture:

  • Job description → a narrow, explicit scope. Agents fail in proportion to the vagueness of their mandate. "Handle refunds under $200 matching these policies" works; "handle customer issues" is an incident waiting for a timestamp.
  • Probation review → an eval suite. A library of real cases with known-good outcomes, run continuously, so quality is a number on a chart instead of a vibe in a standup.
  • Decision boundaries → guardrails in code. What the agent may do autonomously, what requires human sign-off, and what it must never touch, enforced by the system, not the prompt.
  • Escalation path → an explicit uncertainty route. The single highest-leverage design feature of any agent is knowing when to say "this one's not for me" and hand off cleanly to a human.
  • Manager → a named owner in the org chart. Not "the AI team." A person whose job includes this agent's output, the same way a team lead owns a junior's output.

When companies tell me their agent "isn't reliable enough for production," the problem is almost never the model. It's that they built an employee with no job description, no manager, and no performance review, and they're surprised it behaves like one.

Testing the non-deterministic

The engineering discipline underneath all of this is evaluation, and it deserves its own emphasis because it's the thing phase-three companies most consistently skip. Traditional software testing assumes determinism: same input, same output, green check. Agents broke that assumption, and most teams responded by quietly giving up on testing rather than changing how they test. The answer, which I detailed in testing non-deterministic agents, is statistical: golden datasets, scored rubrics, pass-rate thresholds, regression runs on every prompt change and every model bump.

Evals are not a quality nicety. They are the asset that makes everything after this phase possible. Your eval suite is what lets you upgrade models the week they ship instead of the quarter after, swap providers when the price drops, and extend the agent's scope with evidence instead of courage. Companies in the Industrialize phase, the ones scaling, are running on eval suites the way traditional companies run on accounting. No books, no business.

Your eval suite is the asset. The agent is just this quarter's expression of it.

Exit criteria: from artifact to pattern

You've exited the Operationalize phase when the first agent stops being a story and becomes a template. Concretely: the second agent took a fraction of the time of the first because it inherited the scaffolding, evals, guardrails, escalation, observability, instead of re-inventing it. There's a named owner for every agent. Error rates are charted, reviewed, and tied to an error budget someone agreed to. And, the most telling sign, the line organization is asking for the next agent, with a workflow already in mind, rather than the AI team hunting for use cases.

What this looks like when I do it with you

Operationalize is the most hands-on stretch of an AI-native engagement, and it's where the white-glove part is most literal: I'm in the architecture, in the code review, and in the eval design. The work is converting your heroic artifact into a managed pattern, building the eval harness it never had, wrapping it in real guardrails and escalation, putting observability on it, and then extracting the scaffolding so the second and third agents are weeks, not quarters.

The other half of the work is organizational, and it's the half nobody budgets for: deciding who manages the agents. Someone in your line organization is about to become a manager of non-human workers, reviewing samples of their output, tuning their boundaries, owning their mistakes. I've argued that every engineer is a manager now, and this phase is where that stops being a metaphor and starts being someone's actual Tuesday. Choosing those people, training them, and adjusting what success looks like for them, that's operating-model work disguised as an engineering project, and it's exactly the seam where I work.

Next in the series: what happens when the agents multiply and the real costs show up, Industrialize: scaling agents without scaling the chaos. And if your proudest agent just failed my three questions, let's talk →

Keep reading
AI-Native · Jun 8, 2026

Experiment: Pilot Purgatory

AI-Native · Jun 10, 2026

Industrialize: Scaling Agents Without Scaling the Chaos

AI-Native · May 29, 2026

Design Your AI Agents Like Very Stupid Employees