Centaur caseworker — interfaces for human–agent collaboration

Most 'the AI didn't work' postmortems are adoption failures. Caseworkers route around tools they don't trust, the metrics quietly collapse, and the procurement looks like shelfware. The fix is product design and continuous verification — and the goal is a caseworker who does more, not fewer caseworkers.

The first wave of caseworker-facing AI deployments is now eighteen months into production in several states, and the sustained-usage data is in. It is not flattering.

We have seen programs where the tool was used by under twenty percent of the licensed caseworkers six months in. Programs where the integration was technically working — telemetry showed the model was returning suggestions, the UI was rendering them — but the caseworkers were closing the suggestion panel before reading it. Programs where the procurement spent eighteen months and seven figures to deploy a tool that the caseworkers, in practice, were copy-pasting around.

The instinct is to read this as a model-quality problem. It is not. Adoption failures look like model failures from the outside but they are product, interface, and trust failures. They are also entirely fixable, with a much smaller engineering investment than building a better model. The goal is not fewer caseworkers — it is a caseworker who handles substantially more cases at higher quality with less administrative drudgery. That is the centaur model, borrowed from chess: a human paired with a system performs better than either alone.

This piece is about how to get there.

Why caseworkers route around tools

The pattern is consistent across the programs we have looked at. The tool gets installed. Caseworkers are trained on it. For the first week, they engage with the suggestions out of curiosity and a sense of obligation. By week three, two things have happened.

First, the tool has produced a small number of suggestions that were clearly wrong in ways the caseworker could spot but the system could not — usually a missing piece of context the system had no way to see (a phone call note in the caseworker's head, a community circumstance the caseworker knows, a pattern in the claimant's history the model was not trained on).

Second, the tool has slowed the caseworker down in cases where the caseworker already knew the answer. The suggestion panel is one more thing to read. The reasoning trace is one more thing to scroll past. For the cases the caseworker can handle quickly, the tool is a tax.

After these two experiences, the caseworker's mental model of the tool shifts from "this is helping me" to "this is something I have to manage." From there, the route-around is rational. The caseworker handles the easy cases without consulting the tool, consults the tool selectively when they want to confirm a hard call, and avoids the suggestion panel for anything in between.

This is not a failure of caseworker training. It is the rational response of an experienced professional to a tool that imposes a cost on them per case without proportionate value. The fix is in the tool, not in the training.

What an interface for review actually looks like

The form-based UI is the legacy. The caseworker is asked to enter the same data the system already has, in the same fields the system already populated, and then asked to confirm the system's suggestion in a panel off to the side. The cognitive load is not just the suggestion — it is the entire ritual of navigating around the suggestion.

The interface that works inverts the relationship. The form is not the primary surface. The decision packet is. The caseworker arrives at a case and is presented with:

The proposed disposition, large and central.
The two or three load-bearing facts that drove it, each one a link to the source document or data point so the caseworker can verify.
A summary of what the system did and did not have access to (so the caseworker knows whether they need to look elsewhere).
An accept / amend / reject control that takes one click for the simple cases and opens a detail view for the complex ones.

The caseworker is no longer doing data entry. They are doing review, with all the context they need surfaced in the place they are looking. The hard cases get the attention they deserve. The straightforward cases get handled in seconds.

This is not a different model. It is a different interface against the same model. The model's accuracy did not change. The caseworker's accuracy went up because they no longer skim past the suggestion to get back to the form; the suggestion is the form.

Surfacing the why, legibly

The reason a caseworker rejects an AI suggestion is rarely "the model was wrong." It is usually "the model did not consider X, and I know X applies here." When the system surfaces its reasoning in a way that lets the caseworker see whether X was considered, the caseworker can accept, amend, or reject from a position of confidence rather than suspicion.

The reasoning trace has to be three things:

Tied to evidence, not narration. "The applicant's reported income of $X falls below the SNAP threshold of $Y for a household of N, per Section 273.9, based on the wage data retrieved from State Wage Records on date Z" — not "the model thinks this person is eligible because their income looks low." The trace is the evidence pack the caseworker can re-verify in fifteen seconds.

Honest about what the model did not consider. If the system did not pull data from a particular source because the integration was unavailable, that gap is visible. If a fact in the case was outside the model's training distribution, the model is encouraged to flag it rather than guess. Trust is built by the system being honest about its limits, not by it appearing universally confident.

Scannable, not exhaustive. Three to five bullet points. Not a paragraph. Caseworkers under load do not read paragraphs. They scan for the load-bearing fact. The trace has to be optimized for the way the work is actually done.

When the why is legible, the caseworker spends less time on each case and trusts the tool on more cases. The centaur effect is real: the human plus the tool is more accurate and faster than either alone. The cases that need the caseworker's full judgment still get it. The cases that did not need it stop consuming it.

Trust as a measurable adoption variable

Trust is not a soft variable. It is measurable, and the measurements predict adoption failures before they happen.

The single best leading indicator we have seen is the acceptance rate on simple cases. When caseworkers stop accepting the system's suggestions on cases the system is clearly handling well, the trust slope has turned and the route-around is forming. The signal usually shows up four to six weeks before the program-office adoption review sees a problem in the aggregate numbers.

Three properties of the dashboard catch this.

Accept / amend / reject rates, segmented by case complexity. Tracked over time per caseworker, per office, per case type. The drop in accept rate on simple cases is the canary. The amend rate on complex cases is healthier than 100% accept — it indicates the caseworker is using the tool as a starting point and contributing real judgment.

Time-to-disposition on accepted vs. amended cases. When accepted cases take longer than they used to, the caseworker is double-checking everything because they no longer trust the system. The metric exposes the cognitive tax even when the headline accept rate looks stable.

Override-followed-by-confirmation. When a caseworker rejects a suggestion and the case later comes back through (an appeal, a quality-control review) confirming the caseworker was right, that is signal the tool can learn from. When the system was right and the caseworker overrode anyway, that is signal the trust path is broken and needs explicit attention.

These metrics belong to operations, not to engineering. They are the discipline that turns adoption from a hope into a measurable property of the deployment.

What the caseworker keeps

This is the part most procurement decks skip. The tool that works is the tool the caseworker keeps the authority on. Specifically:

The caseworker decides on every determination. The system prepares. The system never adjudicates.
The caseworker can override any suggestion without justification, and the override is logged as a normal action, not flagged as deviance.
The caseworker's judgment is the load-bearing input to the determination. The system's role is to make that judgment faster, better informed, and more consistent across cases.
The cases that need the caseworker's full attention still get it. The system never tries to make the complex cases simple; it makes the simple cases fast so the caseworker has more time for the hard ones.

The result is a caseworker who handles more cases per day, makes fewer errors, and spends a larger share of their time on the cases that genuinely need a human professional's attention. The caseload goes up. The error rate goes down. The job becomes more about judgment and less about paperwork. We have not seen a deployment that aimed for this and produced caseworker layoffs; the bottleneck in benefits delivery is not headcount, it is throughput per worker.

What to do Monday

Pull last month's data for whichever AI tool you have deployed for caseworkers. Find the per-caseworker accept rate over time. If it is dropping, especially on the simple cases, you have an adoption problem that the aggregate dashboards have not surfaced yet.

Sit with a caseworker for an hour. Watch them work. Note every place they have to switch context, copy-paste, or skim past the AI suggestion. Each of those is an interface failure you can fix in the next sprint.

Add one metric to the operations dashboard: percentage of cases on which the caseworker accepted the suggestion as-presented vs. amended vs. rejected, segmented by case complexity. A drop in the accept rate is the leading indicator of the route-around forming.

Where Vardr fits

We run adoption recoveries on caseworker-facing AI tools that are not being used in practice, and we design the interfaces upstream when the tool is still being scoped. The product and operations work pair naturally — Kyal's product instinct identifies the interface failures that the metrics will not surface for months; Sam's continuous-verification discipline catches the trust slope turning in time to do something about it. The deliverable is a working tool the caseworker reaches for, not an unused license the agency is still paying for.