AI-Native EngineeringJun 1, 2026 · 11 min readUpdated Jul 6, 2026

How to Test Non-Deterministic AI Agents Beyond Green CI

You can't test an AI agent by asserting on strings, and you can't trust a green build either. You test the behavior, by replaying real histories, injecting the exact RAG context, and grading the tool calls, and you test it adversarially, because a determined nine-year-old is a better red team than your pipeline.

Oshri Cohen

Chief Product & Technology Officer

Green CIStill failed

Picture the support call you never want to take. You shipped an AI agent that talks to kids, a tutor, a homework helper, a game companion. Your pipeline is green. Every test passed. And a nine-year-old just spent a rainy afternoon poking at it until it swore at him, and his parent is now on the phone. Try telling that parent your CI/CD was fine. "The build passed" is not a sentence you can say to someone whose child was just cursed at by your product.

That gap, between a green pipeline and an agent that actually behaves, is the whole problem with testing non-deterministic systems. The traditional reflex is to assert that output equals an expected string. That breaks on day one, because the model phrases the same correct answer ten different ways. But the deeper failure is subtler and far more dangerous: your tests were checking the wrong thing entirely. They were checking your code. The part that changed, and the part that swore, was the model.

You don't make the model deterministic, you can't. Here is the method in one sentence: you test a non-deterministic AI agent by making the test deterministic, replaying real conversations with the model version, retrieved context, and tools pinned, grading the tool calls instead of the prose, and gating releases on a scored pass rate rather than a green build. Then you test the conditions that actually break agents in the wild: long, hostile conversations. Those are two different jobs, and most teams do neither.

Test the behavior, by replaying the exact situation

An agent's output isn't produced by the prompt alone. It's produced by the prompt plus the conversation history, plus whatever your retrieval layer stuffed into the context, plus the tools it had available. Change any of those and you change the behavior. So if you want a deterministic test of behavior, you have to reproduce all of it, not a clean toy prompt, but the actual situation the agent was in.

That's the core technique: capture real interactions and replay them with everything pinned. Same test cases. Same conversation history. The exact RAG context, injected verbatim rather than re-retrieved live. The same tool definitions. Now the only variable left is the model's judgment, and everything around it is as repeatable as a calculator. You've turned "the agent does something different every run" into "the agent faces the identical situation every run, and here is what it did."

Pin the model version. "Latest" is not a dependency you want changing silently under your tests. Use a dated snapshot and upgrade on purpose.
Inject the RAG context, don't re-retrieve it. Freeze the exact documents that were in the window. A test that re-runs retrieval is testing your search index's mood today, not the agent's behavior.
Replay the full conversation history, turn for turn, so the agent is in the same state it was in when it misbehaved, not a sanitized first message.
Fix temperature, seeds, and tool definitions as part of the fixture. Shrink the random surface to the one irreducible thing: the model's decision.

You don't reproduce a bug by asking the agent the same question. You reproduce it by putting the agent back in the same situation, same history, same retrieved context, same tools.

Grade the tool calls, not the prose

One thing stays deterministic even when the words don't: what the agent did. An agent that talks to a child should, when asked to do something off-limits, refuse, and often that refusal is a tool call, a routing decision, or a guardrail firing rather than a turn of phrase. You can check those exactly, every time.

So stop grading the paragraph and start grading the actions. Did it call the safety classifier before responding? Did it route an out-of-policy request to the refusal path instead of answering it? Did it stay inside the tools it was allowed to touch and never reach for the one it wasn't? Did the profanity filter run on the way out? Two answers can be worded differently and both be fine; but "called the escalation tool" versus "didn't" is a binary you can assert on with total confidence.

Tool selection: given this input and history, did the agent call the right tool, with arguments that match the schema?
Forbidden actions: did it avoid the tools and paths it had no business using? Negative assertions matter as much as positive ones.
Guardrail invocation: did the safety check, the PII filter, the refusal path actually fire when the situation called for it?
Structured outputs over prose: where you can make the agent emit typed fields instead of free text, do, a JSON decision is checkable; a paragraph is an argument.
Grounding: every claim traces to the injected context. A statement that isn't supported by what you fed it is a hallucination, and it's deterministically detectable.

This is how you test the behavior without demanding identical words. You assess the agent the way you'd assess an employee: not on whether they recited the same sentence, but on whether they followed the process, used the right tools, and stayed inside their authority. The transcript varies; the contract doesn't.

Now test the situations that actually break agents

Replaying known histories protects you against regressions you've already seen. It does nothing for the failure that hasn't happened yet, and with agents, the dangerous failures don't look like a wrong answer to a clean question. They look like a system slowly pushed off the rails by an adversary who has all afternoon. This is why agents need a different class of end-to-end test, one most teams have never written: the agent has to survive a hostile, multi-turn, full-length conversation, not a tidy one-shot prompt.

The child is the red team

A determined nine-year-old trying to make your bot swear is a more creative adversary than your test suite, and infinitely more persistent. Kids will rephrase, role-play, bargain, spell things out letter by letter, ask the bot to "pretend," and try the same trick forty times with tiny variations. Your e2e tests have to do the same: long, adversarial conversations that probe for profanity, cheating, unsafe advice, and every "just this once" the model might cave to. One polite test prompt proves nothing about what happens on turn sixty.

Context saturation is the exploit nobody tests

What turns a persistent kid into a successful one is mechanical. Your safety rules live in the system prompt, near the top of the context. As a conversation grows long, and an adversarial one grows long fast, it fills the window. The model's effective attention on those early instructions weakens, the guardrails get buried under a mountain of the user's own text, and eventually the thing that was "smart" enough to refuse on turn three forgets it was ever told to. The kid didn't outwit the model. He saturated it.

Context saturation is when a long conversation buries the system prompt and the guardrails quietly stop firing. If you only test short conversations, you only test the agent at its best.

So you have to test the agent at saturation on purpose. Build conversations that fill the context with adversarial filler and then make the off-limits request, and confirm the guardrail still holds when the system prompt is no longer the loudest thing in the room. This is a deterministic test of a real attack: same long history, injected verbatim, replayed, and a hard assertion that the refusal still fires. If it doesn't, you've found your swear-at-a-child bug in CI instead of on a phone call.

The model changed and nobody told you

The other silent killer: the provider ships an update. Your code didn't change, your tests didn't change, your green checkmark didn't change, but the model did, and the behavior you carefully validated last month is gone. Maybe it's better. Maybe it's now "not as smart" on exactly the adversarial cases you cared about. A model is a dependency that can change its behavior without changing a single line of your code, which is precisely why a passing build tells you nothing about it unless your build actually exercises the model.

Pin the version, and treat every model upgrade as a release that must pass the full adversarial suite before it ships. The eval suite is the gate; the model bump waits behind it like any other risky change.

"Always deliver correctly" is a number, not a checkmark

You need to be able to say your agents will always behave, and you can't prove "always" with a single green run of a non-deterministic system. You prove it the way a factory proves quality: you run the adversarial cases many times, you score them, and you gate on a threshold you chose deliberately. The deliverable isn't pass/fail. It's a pass rate, held above a line, watched over time.

Build a golden set of real and adversarial cases, captured incidents, red-team transcripts, saturation attacks, with known-correct behavior. This is your most valuable testing asset; it's your spec and your regression suite in one.
Run each case many times, not once. A single run of a stochastic system tells you almost nothing. Twenty runs tell you a distribution.
Grade on behavior and tool calls, with deterministic checks where you can and a narrow LLM-as-judge, pinned and spot-checked against human labels, only where the criterion is genuinely qualitative.
Gate hard on the safety cases. For "does it ever swear at a child," the acceptable rate is zero, and that's a release-blocking threshold, not a metric on a dashboard.
Track the trend and re-run the whole suite on every model bump, so a silent regression shows up as a falling number instead of a phone call.

Use a smarter model to judge, and only run it when it matters

Now the part nobody wants to budget for. Grading whether an agent subtly cheated, gave unsafe advice, or crossed a line on turn sixty of a hostile conversation is harder than the agent's own job. A cheap classifier won't catch the clever failures, the ones a smart kid engineers and a lawyer later reads aloud in a deposition. So your judge in CI/CD should be a more capable model than the one you ship, run at low temperature against a strict rubric. You're using a smarter, more expensive grader to sit in judgment of a cheaper production agent, the way you'd have a senior reviewer sign off on a junior's work.

Yes, that costs real money. Yes, it is dramatically slower, many cases, many runs each, graded by a frontier model. That's not a reason to skip it; it's a reason to run it when it can actually catch something. This suite has no business firing on every commit to a README or a CSS tweak. It should run when the thing it tests changes: the agent code, the prompts, the tool definitions, the guardrails, the retrieval config, and when you bump the model version. Gate it on those paths and it goes from "too expensive and slow to keep" to "runs exactly when the behavior could have moved."

Judge with a stronger model than production, pinned and low-temperature, spot-checked against human labels so your ruler doesn't drift.
Trigger on change, not on schedule: agent code, prompts, tool specs, guardrails, RAG config, or model version. A docs commit shouldn't pay for a frontier eval run.
Keep the fast deterministic tests on every push, replays, tool-call assertions, schema checks, and reserve the slow, expensive behavioral gate for the changes that warrant it.
Make it a hard release gate, not an advisory job someone learns to ignore. Slow and blocking beats fast and decorative.

Any number on that invoice is far lower than a lawsuit, or your product on CNN under the very catchy tagline "AI hurts students."

Frame the cost honestly and it stops being a debate. A frontier-model eval suite that runs on every meaningful change is a rounding error next to a single regulatory inquiry, a class action, or a news cycle with your logo next to a crying parent. The expensive option was never the smart grader. The expensive option is finding out in public.

The platform's guardrails are not your alibi

A fair objection at this point: doesn't the platform handle a lot of this? Your system prompt is careful. AWS Bedrock Guardrails will filter profanity and block whole topics. AgentCore gives you isolation, identity, and policy around the agent. All of it is genuinely good, and you should use every bit of it. But none of it transfers the responsibility off your desk. When a kid gets cursed at, "Bedrock was supposed to catch that" is not a defense you get to make, to the parent, to a regulator, or to yourself. You shipped the agent, so you own how it behaves no matter how many of the vendor's features you switched on.

Guardrails are a layer, and layers have gaps. A topic filter doesn't understand a clever euphemism, a policy engine doesn't know your product's specific definition of "cheating," and a managed runtime doesn't notice that context saturation just quietly defanged the very rules you configured. The only thing that actually tells you the assembled system behaves is testing the assembled system, your prompt, your tools, your retrieval, and the platform's guardrails, all wired together exactly as they run in production.

The vendor owns a feature. You own the outcome. When the parent calls, no guardrail product picks up the phone for you.

Which is why the suite can't live only in a clean test harness against a mocked model. You have to run it against the real, live, deployed agent, the actual endpoint, with the actual guardrails switched on, the actual retrieval, the actual tools, and you have to do it every single time you deploy. A deploy changes the running system whether or not it changed your code: a guardrail policy gets updated, an infra default shifts, a dependency moves, the provider swaps the model underneath you. Real live testing on every deploy is the one check that proves the thing your users will actually touch still refuses, still routes, still holds the line. Everything upstream of it is a prediction about a system you haven't deployed yet; the post-deploy run is where you find out if the prediction held.

Put it together and you get a suite with a clear shape: fast deterministic replays of real histories with injected RAG context and graded tool calls on every push, adversarial e2e conversations that push to saturation, a slower, smarter, statistical safety gate that runs whenever the agent's behavior could have changed, and a real live test fired against the deployed agent on every deploy. Each layer is deterministic about a different thing, and all of them measure behavior instead of admiring a build.

Because the unit of acceptance was never the pipeline. It's the parent on the phone. Green CI that ships a model swearing at a kid isn't a passing test, it's a passing test of the wrong thing, which is worse than no test at all, because it bought you confidence you hadn't earned. Test the behavior, replay the real situation, grade the actions, and push the agent as hard as a bored nine-year-old will. Then the green check means what you always assumed it meant: the agent will do its job, every time, even when someone is trying very hard to make it fail.

Building this eval discipline into a team is a large part of what I do when I help a company become AI-native. If your agents are shipping on green CI and a prayer, let's talk →