Disaster spikes are the real load test for benefits systems

Benefits systems do not fail at steady state. They fail during disaster declarations, open enrollment, and rate changes. If your agent workloads cannot autoscale, degrade gracefully, and preserve due process under spike, you create errors at the worst possible moment.

The volume of a state benefits system in steady-state operation does not predict the volume of the same system seventy-two hours after a disaster declaration. The ratio is not 2x. It is not 5x. In the disasters we have worked the after-action on, the ratio has been 10x to 25x for the intake-facing systems, sustained for between two and twelve weeks.

This is not a load-testing problem. Load testing assumes you know the curve. The disaster curve is shaped by the disaster, not the system, and the disasters of the past twelve months — the FEMA-declared events from the 2025 Atlantic hurricane season, the southern winter freezes, the western wildfire SNAP-D rounds — have given several states fresh evidence that their systems do not survive their own surge profile.

GAO has flagged improper-payment surges tied to surge processing in its 2026 program-integrity reports. The combination of the operational failure and the audit consequence is what makes this category different from other engineering work. A benefits system that produces wrongful denials during a surge is not just a degraded system — it is a system the program office has to explain to the federal partner, to the OIG, and sometimes to a court.

Why steady state misleads

The architecture that handles a steady-state benefits workload is reading and writing to the same systems in proportions that have been stable for years. The cache hit rates are predictable. The database connection pool sizes are tuned to the steady-state. The queue depths sit in a range engineering knows. Operations sees a familiar dashboard.

A disaster surge breaks every one of these assumptions simultaneously.

The same applicants who would have spread their submissions across a month all submit in the same forty-eight hours. The document-upload service that normally sees 200 uploads per minute now sees 3,000. The eligibility-check service that normally calls the wage-records integration 4,000 times per hour calls it 60,000 times per hour. The agent that produces eligibility recommendations is now competing with caseworker queries, fraud-flag investigations, and document-completeness checks for the same shared resources.

The result is a cascade. The wage-records integration starts rate-limiting because the rate-limiter was set against steady-state traffic. The agent's tool calls start timing out. The caseworker UI starts spinning. The fraud-flag service queues up. The document-upload service starts dropping uploads because its queue depth exceeded the configured maximum. Each of these is recoverable in isolation. In combination, they produce the system collapse a program office is trying to explain at a press conference the next week.

The lesson is not that load tests should be larger. The lesson is that the architecture has to be built for a curve that the load test cannot accurately model — and the building has to happen before the disaster, because the only thing harder than architecting a system for a surge is architecting one during a surge.

Demand modeling from real history

Steady-state plus a multiplier is the wrong model. The right model starts from the historical disaster record, broken out by program.

For SNAP, the disaster surge is shaped by the Disaster SNAP (D-SNAP) rules. The intake window is short — usually a week to two weeks — and the volume is concentrated in dense bursts at office openings. Wage-records and household-circumstance verification is the bottleneck, because the standard records often do not reflect the disaster-affected disruption to employment and housing. Manual document review queues spike.

For UI, the surge profile depends on the disaster type. A natural disaster produces a multi-week ramp as displaced workers file. A mass-layoff event produces a single-day spike followed by a long tail. A pandemic-style event produces a sustained surge that does not return to steady-state for months. Each shape requires different engineering trade-offs.

For Medicaid, the surge is usually around the redetermination cliff or after a CMS State Health Official letter that opens a new eligibility window. The volume is lower than SNAP or UI but the determination complexity per case is higher.

The right demand model is not "what is the peak QPS." It is "what is the realistic distribution of intake volume, per program, per disaster type, per day-of-week, including the document-completeness backlog that follows." This model is buildable from the program office's own historical data, plus the FEMA disaster declarations record. It is the artifact every benefits-system architecture team should have on file and almost none do.

Backpressure that prefers human review over auto-decision

The single most important architectural property under surge is what the system does when it cannot meet the latency budget for synchronous decisions.

The wrong answer — the one we see most often in postmortems — is auto-decision. Under pressure, the system relaxes its decision criteria, makes more determinations automatically without human review, and clears its queue. Throughput goes up. Wrongful denials and approvals go up with it. The OIG audit lands eighteen months later and the program office is explaining decisions made under duress.

The right answer is queue with a tracked SLA. When the synchronous tier overflows its budget, the case routes to a human-review queue with the explicit SLA the policy actually requires (the 30-day SNAP processing window, the timeliness obligations under 42 CFR § 435.912 for Medicaid). The queue depth grows. Operations sees it. Surge staffing comes online. The system fails by being slow, not by being wrong.

This is the right failure mode for two reasons. First, the consequence of slow is recoverable; the consequence of wrong is not. Second, every applicable regulatory framework — SNAP, Medicaid, UI — treats the timeliness obligation as the floor, with exceptions for surge conditions that the agency can document. Auto-decisioning under load has no equivalent exception.

The architecture has to make queuing the easy path and auto-decisioning impossible without an explicit policy override. The override has to be a signed document, not a configuration toggle.

The pre-engineered degradation plan

There is a written artifact every benefits-system program office should have on file, signed by the relevant deputy attorney general or general counsel, before any surge event. It is the degradation plan, and it names:

Which decisions can be queued, sampled, or delayed, in priority order, with the regulatory basis for the delay.
Which decisions cannot be deferred under any circumstances, with the same basis.
Which alerts are sent to claimants when their case is delayed, with the language pre-approved by counsel and the legal basis cited.
Which manual processes activate when the automated tier is over budget, with named owners.
What happens to the data integrity and audit trail when the system is operating in degraded mode — the same evidence-store requirements apply, with the surge state itself recorded in the packet.

We have seen agencies handle the 2025 disaster declarations well and we have seen agencies handle them poorly. The single biggest correlate of "well" is the presence of this written, pre-approved plan. The agencies that did the policy work in real time, during the surge, made worse decisions and produced worse audit outcomes.

The plan does not have to be elaborate. It has to exist, it has to be signed, and the operations team has to know where it is. The first surge after a new commissioner arrives is the one most likely to find the plan missing or out of date.

Notices, appeal windows, and due process under load

The due-process surface does not soften under surge. Notice requirements still apply. Appeal windows still run. The standard for what an automated determination has to disclose is the same on a quiet Tuesday and during a hurricane.

This produces a specific operational requirement: every determination made under degraded conditions has to carry, in its evidence-store record, the fact that it was made under those conditions. Hearing officers and OIG reviewers will treat surge-context decisions with appropriate weight if the surge context is documented. If it is not documented, the surge-context defense is unavailable retroactively.

The notice language under surge can be different from the notice language under steady state, provided the difference is pre-approved by counsel and the legal basis is cited. The agencies that prepare this language ahead of time have it in production within hours of a declaration. The agencies that do not have it tend to send the standard notice and explain it at the hearing later.

What to put in the architecture review

For systems being designed now or recompeting, the resilience clauses are not vague.

The System shall implement, document, and operationally verify, on a Government-defined cadence, surge-resilience properties including (a) per-decision-class capacity targets validated against historical disaster volumes for the Program; (b) a backpressure mechanism that routes overflow cases to a queue with a tracked SLA rather than relaxing decision criteria; (c) a pre-engineered Degradation Plan, signed by the Government's General Counsel, naming which decisions are queued, sampled, or delayed under sustained surge; (d) an Evidence Store record for every Decision that captures the surge state under which the Decision was made; and (e) a tabletop exercise of the Degradation Plan, run by the Vendor with the Government, no less than annually.

The annual tabletop is the most undervalued element. It is the operational equivalent of the rollback rehearsal — the artifact that surfaces the gaps in the plan while there is still time to fix them.

What to do Monday

Pull the historical intake-volume data for your largest program, and overlay it with the FEMA disaster declarations from the past five years. The shape of the surge is in the data.

Find the degradation plan. If you do not have one, that is the work. If you have one but counsel has not signed off on it in the past twelve months, that is the work.

Schedule a tabletop. Pick a realistic scenario — a hurricane making landfall, a manufacturing plant closure, a winter freeze. Walk the plan end-to-end with operations, engineering, and counsel in the room. The tabletop will fail in places. The failures are the gift.

Where Vardr fits

We run the architecture review for surge resilience, we run the tabletop exercises that exercise the degradation plan against realistic scenarios, and we help draft the counsel-signed plan when it does not yet exist. Frank's two decades shipping high-volume real-time decisioning systems gives the engineering framing real grounding; Payton's deep experience with government delivery and due-process posture under stress makes the policy framing defensible to the people who will actually have to defend it after the surge.

Disaster spikes are the real load test for benefits agents