Evals as procurement deliverables

If you cannot describe how a vendor's system will be measured after award, you have bought a black box. Eval criteria belong in the solicitation, not the post-award honeymoon — and writing them well is a joint engineering-and-procurement act.

Government procurement evolved against a category of software whose behavior is fully specified before the contract is signed. The system shall do X. The vendor delivers a system that does X. The acceptance test confirms the system does X. The contract closes.

This model breaks the moment the system in question learns, updates, or composes its behavior from a model the vendor does not fully own. The acceptance gate is no longer a yes/no on whether the system does X. It is a question about whether the system continues to do X across model updates, data drift, and population shifts the agency cannot control. The question is unanswerable without a runnable eval.

The clean version of this argument is one sentence: if you cannot independently measure the system after award, you cannot hold the vendor to anything. The clean fix is also one sentence: the eval is a contract deliverable, written into the solicitation, with government data rights.

The rest of this piece is what the clean version is too short to cover — what the eval has to actually contain, what the contractual clauses look like, and how to get the eval through procurement review.

The acceptance gap GAO has been pointing at

GAO has been raising performance-measure concerns in its AI work since 2023. The 2025 follow-up reports continue to identify federal agencies that have signed AI contracts without a defined way to measure whether the deployed system is performing. The backlog has shifted in the past year from procurement concern to OIG concern. Several agencies are now writing remediation plans for systems that have been in production for over a year without a runnable performance baseline.

The agencies caught flat-footed are the ones that did one of two things wrong. They either accepted vendor-defined performance criteria that they could not independently verify, or they wrote the performance criteria as a deterministic acceptance test that does not apply to a system whose outputs are non-deterministic. Both produce the same operational outcome: a deployed system the agency cannot defend at audit.

The fix is upstream. Once a contract is signed without enforceable eval criteria, the leverage to add them after the fact is limited and expensive. The agencies that get this right in 2026 are the ones putting the eval in the solicitation.

What the eval has to contain

An eval is not a document. It is three artifacts the vendor delivers alongside the system itself, all of which must be runnable by the agency without vendor assistance.

A behavioral specification, machine-readable. A versioned, structured artifact that lists the behavioral properties the system must exhibit. Each property is independently testable. For an eligibility-preparation agent, properties might include cite-or-decline, citation-resolves, sensitivity-to-load-bearing-facts, decline-when-missing-information, and never-mutate-determination. The list is short — three to six properties typically — and is the document against which the system is measured.

A holdout dataset, representative of the deployment population. Not the vendor's training data. Not a synthetic suite assembled for demonstration. A holdout that reflects the actual population the system will see in production — including the demographic distribution, the messy data quality, the edge cases, and the population shifts that have already been observed in the historical record. The dataset is owned by the agency, with the vendor providing schema documentation and any data necessary to interpret it correctly.

A runnable evaluation script. Not a PDF describing how to evaluate. Actual code that takes the system and the holdout and produces a per-property pass/fail report. The script runs in the agency's environment, against a current version of the system, on demand. The output is a structured report with property-by-property breakdowns, demographic-group breakouts where the regulations require them, and pointers to specific cases that failed each property.

Together, these three are the eval harness. The harness is the deliverable. The system itself is a passenger on the deliverable.

The contract clause that does the work

A single clause, dropped into the solicitation, changes the entire shape of the acceptance gate:

The Vendor shall deliver, as a Government-data-rights deliverable prior to acceptance, an Eval Harness consisting of (a) a Behavioral Specification of the System in machine-readable form enumerating the behavioral properties against which the System shall be measured; (b) a Holdout Dataset representative of the deployment population, with full data documentation; and (c) a Runnable Evaluation Script that produces a per-property pass/fail report from a fresh execution of the System against the Holdout Dataset. The Government shall retain unrestricted rights, in perpetuity, to execute, modify, and extend the Eval Harness for the purposes of acceptance testing, ongoing performance monitoring, re-evaluation following Model Updates or Population Shifts, and transition to a successor contract.

The clause has three load-bearing properties.

It names what is delivered (the three components of the harness, not vague performance criteria). It names the data-rights posture (Government rights, in perpetuity). And it names the use cases (acceptance, ongoing monitoring, re-evaluation, transition) — closing the loophole that would let the vendor argue the harness is for one-time acceptance only and not for subsequent use.

This is the smallest meaningful unit of language that turns the eval from theater into a working contract.

Fairness and false-positive cost, written contractually

Benefits programs care about fairness and false-positive cost in ways that general-purpose evals do not capture. A model that has 95% accuracy on a SNAP-eligibility task can still be unacceptable if the 5% of misses are concentrated in a protected subgroup, or if the false positives are denials rather than approvals.

The contract has to name this.

The behavioral specification must require per-property pass rates broken out by demographic group when the relevant regulations apply (the protected classes under 7 CFR § 272.6(a) for SNAP, the categories under 42 CFR § 435.901 for Medicaid eligibility, the limited-English-proficiency requirements under Executive Order 13166, and similar). The pass rate is not a single number; it is a vector. A pass rate that meets the threshold in aggregate but fails the threshold within a protected group is a fail, not a pass.

The behavioral specification must also name the cost of different failure modes. A false denial is not the same as a false approval. The eval reports each separately, and the acceptance criteria treat them separately. The asymmetry is a load-bearing property of the system, not an afterthought.

These clauses can be pre-staged in the solicitation. They are not exotic. The reason most solicitations omit them is that the program office did not write them and the vendor would prefer they not be there.

Re-evaluation triggers

Acceptance is a moment. Operations is a continuous obligation. The contract has to name when re-evaluation is required and who pays for it.

Three triggers reliably matter.

Model updates from the vendor or upstream provider. If the vendor or any upstream model provider updates the model the system depends on, the system must be re-evaluated against the holdout before the update takes effect in production. This is not optional, and it is not a billable change request. It is contractually owed.

Data drift in the population. If the population distribution shifts beyond a defined threshold (we have used 8% drift on the load-bearing feature distributions as a starting point, settable per program), re-evaluation is required. The drift detection is itself part of the deliverable.

Policy guidance changes. If the regulatory or program-manual basis under which the system was evaluated changes — a new CMS State Health Official Letter, an updated SNAP Memo, a new OMB guidance — re-evaluation against the updated policy basis is required. The vendor is owed reasonable notice but not optionality.

The acceptance clause and the re-evaluation clause together produce a contract that survives the system's lifetime, not just its launch.

How to get this through procurement review

Procurement officers are not obstructionists. They are pattern-matchers, and the patterns they hold are the patterns of deterministic software acceptance. The objections we have seen to eval-as-deliverable clauses are predictable, and each has a short answer.

The vendor will object that the holdout dataset is proprietary. It is not. The holdout dataset is a representative sample of the population the agency itself is serving. The agency has the data already. The vendor is being asked to document a schema and write a script.

The vendor will object that the eval harness reveals trade secrets. It does not. The eval harness reveals what the system does, not how it does it. Knowing the property "never mutate a determination field" is satisfied tells the agency nothing about the model architecture.

The vendor will object that the in-perpetuity data-rights clause is non-standard. It is increasingly standard. CMS modular contracting language for Medicaid eligibility systems has moved this direction since 2023. The procurement library has the precedent citations pre-staged.

The vendor will object that re-evaluation on every model update is operationally infeasible. Re-evaluation can be automated. The runnable evaluation script is the thing that makes it tractable. The vendor's objection here is a tell that they cannot actually produce the script — which is itself the reason to ask for it.

What to do Monday

If you have a solicitation in draft, add the eval-as-deliverable clause to the deliverables section. Cite it as a precondition of acceptance.

If you have a contract recently signed, request a contract modification that adds the eval-harness deliverable. The leverage to push this through usually comes from the agency's existing audit-readiness obligations, which the vendor cannot defend without producing the same artifacts the modification asks for.

If you have a contract about to be awarded, hold the award until the eval-harness clause is in the document. The award cycle is the last point at which this clause can be added without paying for it as a change order.

If you have nothing in motion, write the clause anyway. It will be in the next solicitation, and the time to draft it is before the deadline pressure starts.

Where Vardr fits

We help agencies write the eval clause into the solicitation before award, and we help them retrofit the clause via modification after award when the program is already at risk. The Vardr Procurement Language Library has the eval-as-deliverable clauses pre-staged and tested against multiple state and federal templates. The Reference Architecture treats the eval harness as a first-class system component the agency owns and runs — not a vendor artifact the agency hopes to receive.

Build the eval before you build the model: procurement language for systems that learn