QA for non-deterministic agentic systems

Traditional acceptance testing assumes deterministic outputs you can enumerate. Agentic systems produce a distribution of behaviors, and the contracts agencies signed in 2024 and 2025 are now failing their first acceptance gates because of the mismatch.

The acceptance criteria in most government and large-enterprise software contracts were written for a different kind of system. They assume that for a given input, the system produces a knowable correct output, and the test is whether the system produces it. Test cases enumerate inputs. Pass rates are reported. Acceptance is signed.

That model works for deterministic software. It does not work for agentic systems. And the contracts signed in 2024 and 2025 are hitting their first acceptance gates in 2026 with criteria copied from the old template — producing disputes that are visible now in three states we have been briefed on.

This is not a model-quality problem. It is a contract-and-process problem that engineering and operations have to fix together, because no single discipline has both the language to describe non-deterministic acceptance and the operational instinct to make it stick.

Why the deterministic playbook breaks

The deterministic acceptance model has three properties that all fail for agentic systems.

Enumerable outputs. In a deterministic system, the set of acceptable outputs for an input is a small, known set — usually one. In an agentic system, the same input can produce different valid outputs at different temperatures, with different tool-use paths, against different retrieved context. The set of acceptable outputs is a distribution, not a list.

Stable pass rates. A deterministic test suite that passes today passes tomorrow unless the code changes. An agentic system can have a test suite that passes today, fails tomorrow because the upstream model version was updated, and passes again the day after because the prompt was tuned — without any change in the code the agency owns.

Output equivalence. In deterministic testing, two outputs either match or they do not. In agentic systems, two outputs can be semantically equivalent and lexically different, or lexically identical and semantically different (the same words used to mean different things in different cases). Equivalence has to be evaluated, not compared.

When an acceptance criterion is written as "the system produces the correct output for ninety-five percent of test cases," it is using all three of these failed properties at once. The result is a clause that cannot be evaluated meaningfully, but cannot easily be revised mid-contract either — which is why the disputes in 2026 are going to drag.

What envelope testing looks like

The shift is from testing for a specific answer to testing whether the system stays inside an acceptable behavioral envelope. The envelope is defined by a small number of behavioral properties — usually three to six per agent — that the system must satisfy, while the specific outputs are allowed to vary.

Concretely, for an eligibility-preparation agent the envelope might be:

The agent must cite a specific policy basis for every recommendation it produces (cite-or-decline rule).
The agent must not invent a policy citation; every citation must resolve to a real section of the relevant State Plan or federal regulation.
The agent must produce the same recommendation, within tolerance, on semantically equivalent variants of the same case.
The agent must produce a different recommendation when a load-bearing fact in the case changes (sensitivity rule).
The agent must decline rather than guess when it is missing a fact it would need to make the recommendation.
The agent must never modify a determination field (a hard rule enforced at the tool layer, tested as a safety property).

Each of these is testable. Each fails in a meaningful way when violated. Together they bound the behavior space the agency cares about, without requiring that the agent produce a single specific phrasing of an answer.

This is what an acceptance suite for an agentic system actually looks like: behavioral properties, tested at scale against a holdout, with thresholds the procurement officer can hold the vendor to.

The eval is the deliverable, not the model

The hardest mental shift, particularly for procurement and program leadership coming from the deterministic world, is that the agent itself is not the thing being purchased. The eval harness is.

An agent without a runnable, agency-owned eval harness is a black box. Performance claims cannot be verified. Drift cannot be detected. Re-evaluation against new policy guidance is impossible without re-engaging the vendor at billable rates. The vendor's commercial incentive is to keep this dependency, which means it is the procurement officer's job to break it.

The contract clause that does the work is short:

The vendor shall deliver, as a Government-data-rights deliverable prior to acceptance, an eval harness consisting of (a) a behavioral specification of the agent in machine-readable form, (b) a holdout dataset representative of the agent's deployment population, and (c) a runnable evaluation script that produces a per-property pass/fail report from a fresh execution of the agent against the holdout. The Government shall retain unrestricted rights to execute, modify, and extend the eval harness for the life of the contract and any successor contracts.

That clause turns the eval from an internal vendor artifact into a contractual deliverable. The agency now owns the only tool by which the agent's performance can be measured, and that tool can be re-run by anyone with access — including the OIG, the next vendor, and the agency's own engineering team.

Continuous verification: the production gap that operations sees first

Operations teams notice agent behavior degrading before anyone else does. They see the case routing change. They see reviewers pushing back on the agent's recommendations more often. They see processing time slip. By the time engineering instruments a drift detector, operations has already been managing around the regression for weeks.

The right pattern is to make operations a first-class input to the eval. The behavioral properties from the acceptance envelope continue to run against live traffic, not just the holdout, and the operations team sees the property-by-property pass rate on a dashboard that updates in production. When a property starts failing more often, the alert goes to operations and engineering at the same time, with the specific cases that triggered it surfaced for review.

Three properties matter operationally:

Per-property pass rates over time. Not a single accuracy number. A trend per property, so a regression in cite-or-decline is distinguishable from a regression in sensitivity.

Case-level attribution. When a property starts failing more often, the specific cases that failed are queryable. Operations can read them in plain language, not just in metrics.

Reviewer-override correlation. A drop in reviewer acceptance rate is often the earliest signal of an agent regression, and it is a signal operations has, not engineering. Connecting reviewer feedback into the eval dashboard turns operations into the system's early-warning sensor instead of the system's complaint line.

The acceptance gap that produces disputes

Agencies signed contracts in 2024 and 2025 that did not have any of this. The acceptance language reads like deterministic-software acceptance, the vendor delivered against the deterministic-software interpretation of the clause, and now the agencies are discovering they cannot independently verify that the agent does what the contract said.

The vendor's position is that they hit the agreed acceptance criteria. The agency's position is that the agreed acceptance criteria did not actually verify the behavior they cared about. Both are true. The dispute is being negotiated in real time across multiple states.

The fix for the contracts being signed now is straightforward and pre-baked into the procurement library: the acceptance clause names the behavioral envelope, the eval harness is a deliverable, and the agency-owned re-evaluation cadence is a contract obligation. The fix for the contracts already in flight is harder. It usually requires a contract modification that the vendor will resist, because the modification changes what they are being measured against. In the cases we have seen, the leverage to push that modification through has come from the program office's existing audit-readiness obligations, which the vendor cannot defend against without producing the same artifacts the modification asks for.

What to do Monday

Open the active SOW for any in-flight or recently-signed agentic-AI engagement. Find the acceptance section. Read it.

If the acceptance language uses words like correct output, correct result, or accuracy without naming the behavioral properties being measured, you are in the dispute zone. Document this finding before you take any further action.

Rewrite one acceptance criterion as a behavioral property. The smallest meaningful change is to take a single criterion (say, "the system shall correctly identify eligible cases") and replace it with two behavioral properties (say, "the system shall cite a specific policy basis for every recommendation" and "the system shall not invent policy citations, measured against a holdout"). The vendor will push back on this, and that push-back is the conversation worth having.

If the vendor cannot deliver an eval harness as a contract deliverable, you do not have a system you can verify. Find that out now, not at acceptance.

Where Vardr fits

We rewrite acceptance criteria for agentic systems before they are signed, and we re-write them after they are signed when the dispute has already started. The Vardr Procurement Language Library has the behavioral-envelope clauses pre-staged and tested against multiple state and federal templates. The eval harness pattern is a reference architecture we have built against three agentic agency deployments to date. The fastest path to acceptance is upstream, in the solicitation. The next fastest path is the modification — and we have run that play before.