Skip to main content
Vardr Partners
Open Workbench

04Downloadable artifact

The M-25-21 Engineering Compliance Checklist

30 concrete checkpoints for federal civilian AI use cases — what the memos actually demand in code, configuration, and operations.

30 checkpoints · 7 sections · v1.0 — published May 2026 by Vardr Partners

We built this checklist for the program offices we work with: a concrete, line-by-line test for whether the M-24-10 / M-25-21 obligations are operational in code, not in a policy binder. Walk it honestly once. The boxes you cannot check are the ones the next OIG review will land on.

  1. Section 1

    1. AI Inventory as a data product

    The inventory is not a spreadsheet. It is a queryable, versioned system of record for every AI use case in scope.

    • INV-01Inventory has a schema with explicit fields for system identifier, deployment status, last assessment date, minimum-practice coverage, vendor, model version, and pointer to the runnable evaluation.

      Expectation: Schema is published, versioned, and the inventory rejects writes that fail validation.

    • INV-02Inventory is queryable by API, not only by humans clicking through a UI.

      Expectation: Other agency systems can read inventory state programmatically.

    • INV-03Inventory reflects current production reality, not the last quarter's snapshot.

      Expectation: Update SLA is named and monitored. Drift between inventory and production triggers an alert.

    • INV-04Rights-impacting and safety-impacting designations are stored as first-class fields, not narrative notes.

      Expectation: Filterable, reportable, auditable. The designation drives downstream minimum-practice obligations.

  2. Section 2

    2. Pre-deployment evaluation

    Impact assessments are runnable harnesses, not documents. The agency owns the harness. The vendor does not.

    • EVAL-01Evaluation suite is implemented as code in a repository the agency controls.

      Expectation: Can be run on demand by agency staff without vendor assistance.

    • EVAL-02Demographic-group breakouts are part of the standard evaluation, not a one-off analysis.

      Expectation: Group definitions are documented. Breakouts run automatically on every evaluation invocation.

    • EVAL-03Distribution-shift tests compare production data to evaluation data.

      Expectation: Triggered automatically on model updates and on a scheduled cadence in production.

    • EVAL-04Fairness criteria are named explicitly per use case.

      Expectation: The criterion is one a reviewer can verify against the output, not an aspiration.

    • EVAL-05Evaluation results are stored as versioned artifacts, addressable by model version + evaluation date.

      Expectation: Comparable across releases. Auditable from outside the team that ran the evaluation.

  3. Section 3

    3. Per-decision provenance

    Any AI-influenced decision must be reproducible — same inputs in, same decision out — on demand. Replay is the test.

    • PROV-01Every decision touchpoint emits an append-only event-log entry.

      Expectation: Decision id, timestamp, model version, prompt template version, retrieval set hash, feature snapshot hash, user identifier, outcome.

    • PROV-02Feature-store entries are immutable and time-traveled.

      Expectation: The value of feature X at decision time can be retrieved years later without joining against current state.

    • PROV-03Prompts and tool schemas are content-addressed and versioned.

      Expectation: The exact prompt and tools used for a decision can be reconstructed from the event log.

    • PROV-04Retrieved documents are content-addressed.

      Expectation: The exact text returned to the model at decision time is retrievable. Not the URL — the text.

    • PROV-05Caseworker action — accept, override, modify — is recorded with the override reason.

      Expectation: The model's recommendation that was overridden is preserved, not replaced.

    • PROV-06Reconstruction time is measured and bounded.

      Expectation: A decision from 24 months ago can be replayed within a published SLA, not best-effort.

  4. Section 4

    4. Minimum-practice operationalization

    For each minimum practice that applies, a detector, an on-call, a remediation SLA, and a notification path. Without these four, the practice is a wish.

    • OPS-01Each applicable minimum practice has a named detector.

      Expectation: A test, a metric, or a monitor that surfaces violations. Not a quarterly review.

    • OPS-02Each detector has an on-call rotation.

      Expectation: A specific person or rotating role receives the alert. Not 'the team.'

    • OPS-03Each violation has a remediation SLA.

      Expectation: Time-bounded. Tracked. Reported.

    • OPS-04Each violation has a notification path to the appropriate oversight body.

      Expectation: The agency, the impacted population, and the OMB liaison receive the notification as the SLA prescribes.

  5. Section 5

    5. Adverse-action transparency

    Notices generated by AI-influenced decisions must reference the policy basis, the path to challenge, and a plain-language description of what the system considered.

    • ADV-01Notice template references the specific policy text and version that bound the decision.

      Expectation: Not 'in accordance with policy.' Direct citation, retrievable.

    • ADV-02Notice describes, in plain language, what facts the system considered.

      Expectation: Reviewable by counsel against the operative regulation. Survives appeal.

    • ADV-03Notice explains the path to challenge.

      Expectation: Form name, deadline, where to send, where to call.

    • ADV-04Notice templates are versioned and bound to the policy version active at decision time.

      Expectation: A notice for a decision from 18 months ago renders with the template active then, not the current one.

  6. Section 6

    6. Procurement obligations

    The procurement language requires the vendor to provide artifacts the agency needs to run impact assessments without the vendor.

    • PROC-01Contract requires the vendor to provide test datasets representative of the deployment population.

      Expectation: Including demographic-group coverage. Not aspirational; a deliverable.

    • PROC-02Contract requires fairness criteria to be named explicitly in the SOW.

      Expectation: Defined per use case. Reviewable by the program office.

    • PROC-03Contract requires runnable benchmarks tied to the use case.

      Expectation: Executable by agency staff. Not a slide deck. Not a written report.

    • PROC-04Contract requires vendor to provide model artifacts in a form the agency can re-evaluate.

      Expectation: If the vendor cannot meet this, the contract identifies the alternative — a black-box re-evaluation harness or a model-card-plus-prompt-version commitment.

  7. Section 7

    7. Cross-cutting governance

    The connective tissue between the inventory, the evaluations, the operations, and the procurement.

    • GOV-01A single accountable role owns the AI minimum-practices program at the agency.

      Expectation: Named, authorized, resourced. Not a steering committee.

    • GOV-02Quarterly review of inventory, evaluations, and incidents.

      Expectation: Output of the review is recorded in the inventory. Action items have owners and dates.

    • GOV-03Public reporting of in-scope use cases is current with OMB requirements.

      Expectation: Same source of truth as the internal inventory, with appropriate redactions only.

Want a second set of eyes on the result?

Bring the marked-up checklist to a principal-level briefing. 30 minutes, no prep, run by Frank or Payton directly. We walk the unchecked boxes and tell you which ones would actually fail a defensible OIG review.