May 1, 2026

Non-deterministic agents still have deterministic consequences

Gagan Devagiri & Ao Fu

Co-Founders, Pome

For most of software's history, the same input produced the same output. Staging environments worked because you could replay conditions. Tests passed or failed against a fixed expectation. Debugging meant following cause and effect to the end.

Agents broke that contract on purpose. The same prompt, run twice, can take different paths and land in different final states. That's what makes an agent useful instead of a script.

But the systems agents act on didn't change. A Stripe charge is still deterministic. A merged pull request is still merged. A row deleted from a database is still gone. The agent is non-deterministic. The consequence is not, and that gap is where things go wrong.

CI/CD, feature flags, and rollbacks all assume the system does the same thing twice. Agents don't, so the infrastructure doesn't transfer.

Mocks fail from a different direction. A mock Stripe API has no concept of a frozen account, a disputed charge, or a customer who got a partial refund last week. It accepts the call and returns success. Real Stripe doesn't. You find out when the customer emails you.

GitHub is the same story. A mock repo accepts any merge regardless of branch protection, required reviewers, or a conflicting commit that landed ten minutes earlier from a parallel agent run. The mock passes, production rejects, the agent retries, opens a duplicate PR, and your CI queue backs up.

So most agent testing today happens in production. Teams watch what their agents do to real users, trace back from failures, and patch. It's a reasonable response to an infrastructure gap, but it doesn't scale, and the surface area is only getting bigger. An agent with access to your GitHub, Stripe, and support inbox can reach further into your business than most employees can. Employees pause and ask before doing something irreversible. An agent calls the API and moves on.

What you need before deploying an agent is a picture of what it can touch versus what it should touch. Those are different lists, and closing the gap takes more than eval scores. It takes running the agent against realistic state, and keeping a record of every tool call and mutation so you can trace a failure back to where it started.

That's the gap we're building Pome to close. Evals, mocks, and integration tests were all built for software that behaves the same way twice. Agents are different. The tooling should be too.

← Back to blogs