04Downloadable artifact

The M-25-21 Engineering Compliance Checklist

30 concrete checkpoints for federal civilian AI use cases — what the memos actually demand in code, configuration, and operations.

30 checkpoints · 7 sections · v1.0 — published May 2026 by Vardr Partners

Download the PDF Schedule a principal-level briefing

We built this checklist for the program offices we work with: a concrete, line-by-line test for whether the M-24-10 / M-25-21 obligations are operational in code, not in a policy binder. Walk it honestly once. The boxes you cannot check are the ones the next OIG review will land on.

Section 1
1. AI Inventory as a data product
The inventory is not a spreadsheet. It is a queryable, versioned system of record for every AI use case in scope.
- INV-01Inventory has a schema with explicit fields for system identifier, deployment status, last assessment date, minimum-practice coverage, vendor, model version, and pointer to the runnable evaluation.
  Expectation: Schema is published, versioned, and the inventory rejects writes that fail validation.
- INV-02Inventory is queryable by API, not only by humans clicking through a UI.
  Expectation: Other agency systems can read inventory state programmatically.
- INV-03Inventory reflects current production reality, not the last quarter's snapshot.
  Expectation: Update SLA is named and monitored. Drift between inventory and production triggers an alert.
- INV-04Rights-impacting and safety-impacting designations are stored as first-class fields, not narrative notes.
  Expectation: Filterable, reportable, auditable. The designation drives downstream minimum-practice obligations.
Section 2
2. Pre-deployment evaluation
Impact assessments are runnable harnesses, not documents. The agency owns the harness. The vendor does not.
- EVAL-01Evaluation suite is implemented as code in a repository the agency controls.
  Expectation: Can be run on demand by agency staff without vendor assistance.
- EVAL-02Demographic-group breakouts are part of the standard evaluation, not a one-off analysis.
  Expectation: Group definitions are documented. Breakouts run automatically on every evaluation invocation.
- EVAL-03Distribution-shift tests compare production data to evaluation data.
  Expectation: Triggered automatically on model updates and on a scheduled cadence in production.
- EVAL-04Fairness criteria are named explicitly per use case.
  Expectation: The criterion is one a reviewer can verify against the output, not an aspiration.
- EVAL-05Evaluation results are stored as versioned artifacts, addressable by model version + evaluation date.
  Expectation: Comparable across releases. Auditable from outside the team that ran the evaluation.
Section 3
3. Per-decision provenance
Any AI-influenced decision must be reproducible — same inputs in, same decision out — on demand. Replay is the test.
- PROV-01Every decision touchpoint emits an append-only event-log entry.
  Expectation: Decision id, timestamp, model version, prompt template version, retrieval set hash, feature snapshot hash, user identifier, outcome.
- PROV-02Feature-store entries are immutable and time-traveled.
  Expectation: The value of feature X at decision time can be retrieved years later without joining against current state.
- PROV-03Prompts and tool schemas are content-addressed and versioned.
  Expectation: The exact prompt and tools used for a decision can be reconstructed from the event log.
- PROV-04Retrieved documents are content-addressed.
  Expectation: The exact text returned to the model at decision time is retrievable. Not the URL — the text.
- PROV-05Caseworker action — accept, override, modify — is recorded with the override reason.
  Expectation: The model's recommendation that was overridden is preserved, not replaced.
- PROV-06Reconstruction time is measured and bounded.
  Expectation: A decision from 24 months ago can be replayed within a published SLA, not best-effort.
Section 4
4. Minimum-practice operationalization
For each minimum practice that applies, a detector, an on-call, a remediation SLA, and a notification path. Without these four, the practice is a wish.
- OPS-01Each applicable minimum practice has a named detector.
  Expectation: A test, a metric, or a monitor that surfaces violations. Not a quarterly review.
- OPS-02Each detector has an on-call rotation.
  Expectation: A specific person or rotating role receives the alert. Not 'the team.'
- OPS-03Each violation has a remediation SLA.
  Expectation: Time-bounded. Tracked. Reported.
- OPS-04Each violation has a notification path to the appropriate oversight body.
  Expectation: The agency, the impacted population, and the OMB liaison receive the notification as the SLA prescribes.
Section 5
5. Adverse-action transparency
Notices generated by AI-influenced decisions must reference the policy basis, the path to challenge, and a plain-language description of what the system considered.
- ADV-01Notice template references the specific policy text and version that bound the decision.
  Expectation: Not 'in accordance with policy.' Direct citation, retrievable.
- ADV-02Notice describes, in plain language, what facts the system considered.
  Expectation: Reviewable by counsel against the operative regulation. Survives appeal.
- ADV-03Notice explains the path to challenge.
  Expectation: Form name, deadline, where to send, where to call.
- ADV-04Notice templates are versioned and bound to the policy version active at decision time.
  Expectation: A notice for a decision from 18 months ago renders with the template active then, not the current one.
Section 6
6. Procurement obligations
The procurement language requires the vendor to provide artifacts the agency needs to run impact assessments without the vendor.
- PROC-01Contract requires the vendor to provide test datasets representative of the deployment population.
  Expectation: Including demographic-group coverage. Not aspirational; a deliverable.
- PROC-02Contract requires fairness criteria to be named explicitly in the SOW.
  Expectation: Defined per use case. Reviewable by the program office.
- PROC-03Contract requires runnable benchmarks tied to the use case.
  Expectation: Executable by agency staff. Not a slide deck. Not a written report.
- PROC-04Contract requires vendor to provide model artifacts in a form the agency can re-evaluate.
  Expectation: If the vendor cannot meet this, the contract identifies the alternative — a black-box re-evaluation harness or a model-card-plus-prompt-version commitment.
Section 7
7. Cross-cutting governance
The connective tissue between the inventory, the evaluations, the operations, and the procurement.
- GOV-01A single accountable role owns the AI minimum-practices program at the agency.
  Expectation: Named, authorized, resourced. Not a steering committee.
- GOV-02Quarterly review of inventory, evaluations, and incidents.
  Expectation: Output of the review is recorded in the inventory. Action items have owners and dates.
- GOV-03Public reporting of in-scope use cases is current with OMB requirements.
  Expectation: Same source of truth as the internal inventory, with appropriate redactions only.

Want a second set of eyes on the result?

Bring the marked-up checklist to a principal-level briefing. 30 minutes, no prep, run by Frank or Payton directly. We walk the unchecked boxes and tell you which ones would actually fail a defensible OIG review.

Schedule a briefing Download the PDF

The M-25-21 Engineering Compliance Checklist

1. AI Inventory as a data product

2. Pre-deployment evaluation

3. Per-decision provenance

4. Minimum-practice operationalization

5. Adverse-action transparency

6. Procurement obligations

7. Cross-cutting governance

Want a second set of eyes on the result?