04Downloadable artifact
The M-25-21 Engineering Compliance Checklist
30 concrete checkpoints for federal civilian AI use cases — what the memos actually demand in code, configuration, and operations.
30 checkpoints · 7 sections · v1.0 — published May 2026 by Vardr Partners
We built this checklist for the program offices we work with: a concrete, line-by-line test for whether the M-24-10 / M-25-21 obligations are operational in code, not in a policy binder. Walk it honestly once. The boxes you cannot check are the ones the next OIG review will land on.
Section 1
1. AI Inventory as a data product
The inventory is not a spreadsheet. It is a queryable, versioned system of record for every AI use case in scope.
INV-01Inventory has a schema with explicit fields for system identifier, deployment status, last assessment date, minimum-practice coverage, vendor, model version, and pointer to the runnable evaluation.
Expectation: Schema is published, versioned, and the inventory rejects writes that fail validation.
INV-02Inventory is queryable by API, not only by humans clicking through a UI.
Expectation: Other agency systems can read inventory state programmatically.
INV-03Inventory reflects current production reality, not the last quarter's snapshot.
Expectation: Update SLA is named and monitored. Drift between inventory and production triggers an alert.
INV-04Rights-impacting and safety-impacting designations are stored as first-class fields, not narrative notes.
Expectation: Filterable, reportable, auditable. The designation drives downstream minimum-practice obligations.
Section 2
2. Pre-deployment evaluation
Impact assessments are runnable harnesses, not documents. The agency owns the harness. The vendor does not.
EVAL-01Evaluation suite is implemented as code in a repository the agency controls.
Expectation: Can be run on demand by agency staff without vendor assistance.
EVAL-02Demographic-group breakouts are part of the standard evaluation, not a one-off analysis.
Expectation: Group definitions are documented. Breakouts run automatically on every evaluation invocation.
EVAL-03Distribution-shift tests compare production data to evaluation data.
Expectation: Triggered automatically on model updates and on a scheduled cadence in production.
EVAL-04Fairness criteria are named explicitly per use case.
Expectation: The criterion is one a reviewer can verify against the output, not an aspiration.
EVAL-05Evaluation results are stored as versioned artifacts, addressable by model version + evaluation date.
Expectation: Comparable across releases. Auditable from outside the team that ran the evaluation.
Section 3
3. Per-decision provenance
Any AI-influenced decision must be reproducible — same inputs in, same decision out — on demand. Replay is the test.
PROV-01Every decision touchpoint emits an append-only event-log entry.
Expectation: Decision id, timestamp, model version, prompt template version, retrieval set hash, feature snapshot hash, user identifier, outcome.
PROV-02Feature-store entries are immutable and time-traveled.
Expectation: The value of feature X at decision time can be retrieved years later without joining against current state.
PROV-03Prompts and tool schemas are content-addressed and versioned.
Expectation: The exact prompt and tools used for a decision can be reconstructed from the event log.
PROV-04Retrieved documents are content-addressed.
Expectation: The exact text returned to the model at decision time is retrievable. Not the URL — the text.
PROV-05Caseworker action — accept, override, modify — is recorded with the override reason.
Expectation: The model's recommendation that was overridden is preserved, not replaced.
PROV-06Reconstruction time is measured and bounded.
Expectation: A decision from 24 months ago can be replayed within a published SLA, not best-effort.
Section 4
4. Minimum-practice operationalization
For each minimum practice that applies, a detector, an on-call, a remediation SLA, and a notification path. Without these four, the practice is a wish.
OPS-01Each applicable minimum practice has a named detector.
Expectation: A test, a metric, or a monitor that surfaces violations. Not a quarterly review.
OPS-02Each detector has an on-call rotation.
Expectation: A specific person or rotating role receives the alert. Not 'the team.'
OPS-03Each violation has a remediation SLA.
Expectation: Time-bounded. Tracked. Reported.
OPS-04Each violation has a notification path to the appropriate oversight body.
Expectation: The agency, the impacted population, and the OMB liaison receive the notification as the SLA prescribes.
Section 5
5. Adverse-action transparency
Notices generated by AI-influenced decisions must reference the policy basis, the path to challenge, and a plain-language description of what the system considered.
ADV-01Notice template references the specific policy text and version that bound the decision.
Expectation: Not 'in accordance with policy.' Direct citation, retrievable.
ADV-02Notice describes, in plain language, what facts the system considered.
Expectation: Reviewable by counsel against the operative regulation. Survives appeal.
ADV-03Notice explains the path to challenge.
Expectation: Form name, deadline, where to send, where to call.
ADV-04Notice templates are versioned and bound to the policy version active at decision time.
Expectation: A notice for a decision from 18 months ago renders with the template active then, not the current one.
Section 6
6. Procurement obligations
The procurement language requires the vendor to provide artifacts the agency needs to run impact assessments without the vendor.
PROC-01Contract requires the vendor to provide test datasets representative of the deployment population.
Expectation: Including demographic-group coverage. Not aspirational; a deliverable.
PROC-02Contract requires fairness criteria to be named explicitly in the SOW.
Expectation: Defined per use case. Reviewable by the program office.
PROC-03Contract requires runnable benchmarks tied to the use case.
Expectation: Executable by agency staff. Not a slide deck. Not a written report.
PROC-04Contract requires vendor to provide model artifacts in a form the agency can re-evaluate.
Expectation: If the vendor cannot meet this, the contract identifies the alternative — a black-box re-evaluation harness or a model-card-plus-prompt-version commitment.
Section 7
7. Cross-cutting governance
The connective tissue between the inventory, the evaluations, the operations, and the procurement.
GOV-01A single accountable role owns the AI minimum-practices program at the agency.
Expectation: Named, authorized, resourced. Not a steering committee.
GOV-02Quarterly review of inventory, evaluations, and incidents.
Expectation: Output of the review is recorded in the inventory. Action items have owners and dates.
GOV-03Public reporting of in-scope use cases is current with OMB requirements.
Expectation: Same source of truth as the internal inventory, with appropriate redactions only.
Want a second set of eyes on the result?
Bring the marked-up checklist to a principal-level briefing. 30 minutes, no prep, run by Frank or Payton directly. We walk the unchecked boxes and tell you which ones would actually fail a defensible OIG review.