Four Production Reliability Patterns for AI Agents (Beyond Retry-With-Backoff)
Most agent tutorials stop at retry-with-backoff. Production stops there too, and that's how you get the $437 retry-loop incident from April 2026. The four patterns that actually keep an agent up at 3 AM are circuit breakers, partial success, human-in-the-loop, and graceful degradation, and the trick is knowing which failure signal triggers which pattern.
On April 29, 2026, a now-widely-cited incident report described an AI agent that hit a transient upstream error, retried with exponential backoff as designed, and didn't stop. The retries ran for hours overnight. The bill the next morning was $437 in API costs for thousands of identical failing tool calls. The agent's logic was correct, the retry library was correct, the backoff was correct. The pattern was wrong: retry-with-backoff is the default reliability pattern in every framework's tutorial, and it is also the cheapest way to lose money in production.
Real agent reliability needs more than one pattern. The agent failure surface has a taxonomy. Each kind of failure wants a different response. The four patterns below are the ones I've shipped in production. Each one solves a specific failure shape that retry-with-backoff doesn't.
The failure taxonomy
- ▸Transient: a single request failed, the upstream is healthy, a retry will probably succeed. Examples: a 503, a TCP reset, a 429 with a small Retry-After. Retry-with-backoff is correct here.
- ▸Persistent: the upstream is hard down or systematically rejecting your requests. Retrying is actively harmful (it bills you and adds load). Circuit breaker is correct.
- ▸Quality: the upstream technically returned a 200 OK but the content is wrong (schema violation, hallucinated tool name, refusal where there shouldn't be one). HTTP-level retry won't help; this needs validation gates and quality-aware fallback.
- ▸Partial: a batch of N items finished with K successes and N-K failures, mixed transient and persistent. You want to keep the successes and only retry the right slice of failures. Partial-success protocol is correct.
- ▸Unrecoverable: the agent genuinely doesn't have enough information or authority to make the next decision. Backing off and retrying is the wrong move; the right move is to escalate. Human-in-the-loop is correct.
- ▸Cost / rate-limit pressure: the system is functioning but the budget for the primary path is exhausted. Switching to a cheaper or slower fallback is correct: graceful degradation.
Most production agents I've reviewed handle exactly the first row (transient) and treat everything else as a transient. That's the bug.
Pattern 1: Circuit breaker (with quality awareness)
The textbook circuit breaker tracks HTTP error rate over a window: if more than X% of calls in the last N seconds fail, open the breaker, refuse new calls for a cooldown, then half-open and probe before closing. This works. It's not enough. The 2026 advance is tracking quality failures alongside HTTP failures: schema violations, refusal patterns, semantically wrong outputs. A circuit that only watches HTTP status will happily let you burn cash on 200-OK garbage.
The implementation in my error-patterns repo opens the breaker after 3 consecutive failures of either type (HTTP or schema), with a 30-second cooldown that scales up exponentially per consecutive open-state cycle (30s, 1m, 2m, 4m, capped at 5m). When the breaker is open, the orchestrator routes to a fallback path or returns a structured 'unavailable' response to the caller. The breaker is per-tool, not global: if your weather tool is degraded, your flight tool keeps working.
Pattern 2: Partial success protocol
The agent decides to call N tools, or process N items, in a single logical operation. Three of N succeed, two fail. The default behavior in most frameworks is to fail the whole operation and retry the whole batch. This is wrong for two reasons: it duplicates the side effects of the successes (sent emails, written rows) and it wastes compute / cost re-running work that already worked.
The right shape is what AWS Lambda calls 'partial batch response': the operation returns a structured result that says exactly which item IDs failed and why. The orchestrator (Trigger.dev in my case, but the pattern is platform-agnostic) re-queues only the failed slice. The successes are committed. The failed items get their own retry budget.
def process_batch(items: list[Item]) -> BatchResult:
successes, failures = [], []
for item in items:
try:
result = process_one(item) # idempotent by item.id
successes.append((item.id, result))
except RetryableError as e:
failures.append((item.id, "retryable", str(e)))
except FatalError as e:
failures.append((item.id, "fatal", str(e)))
return BatchResult(successes=successes, retryable=failures)
# Orchestrator decides:
# commit successes, re-queue retryable failures, dead-letter fatals.The non-obvious detail: process_one has to be idempotent by item.id. The agent's tool calls cache or de-dupe by the item ID, so re-running the same item twice produces the same effect once. Without idempotency, partial-success creates duplicate side effects every time you retry. With idempotency, the worst case of a stuck retry is wasted compute, not corrupted state.
Pattern 3: Human-in-the-loop, structured
There are decisions an agent should not make autonomously: irreversible writes to a customer-facing record, cost commitments above a threshold, ambiguous classification with low confidence, edge cases the eval didn't cover. The pattern most production teams reach for is 'add a confirmation step.' That's correct in spirit and almost always implemented poorly.
Poor implementation: a chat message that says 'should I do X? (yes/no)' and waits indefinitely. This blocks the queue, ages out review tickets, and gives the human reviewer almost none of the context they need.
The shape that works: a structured escalation event with a fixed schema (the question, the agent's recommendation, the evidence the agent gathered, the smallest decision the human needs to make), a deadline (after which the agent takes a configurable default action: usually 'do nothing' for irreversible ops, 'proceed' for reversible ones), and an acknowledgment loop so the agent knows the human saw it. The human's decision becomes another tool result the agent consumes; the workflow resumes deterministically from the checkpoint that was waiting on it.
Trigger.dev v4 makes this clean because the wait point is a first-class checkpoint: the workflow snapshots state, the platform persists it, the agent process can be torn down and respawn'd, and the wait resumes when the human responds. Other durable workflow platforms (Temporal, Inngest) have equivalent primitives; the pattern is the platform's, not the framework's.
Pattern 4: Graceful degradation
Rate limit or budget exhaustion is not a failure to retry past. It's a signal to take a different path. The pattern is a fallback chain, not a single primary endpoint. The chain in my reference implementation goes:
- ▸Primary: the model the agent was designed against (Claude Sonnet 4.6 in 2026 vintage). 80%+ of calls land here.
- ▸Same-provider downshift: the cheapest model from the same provider that can do the job (Claude Haiku 4.5). The agent's prompts and tool schemas are compatible; quality is lower but the structure is preserved.
- ▸Cross-provider: a different vendor's model with a translated prompt and adapted tool schema (GPT-5 or Gemini Flash). This requires that the agent code has a model-portable abstraction, which is half the work of building this fallback.
- ▸Self-hosted: a local fine-tuned model that handles the long tail when external providers are out (rare, but it's the path that kept the agent up during the November 2025 multi-vendor outage that made the news).
The trick is knowing which signal triggers which level of the chain. Rate limit returned with Retry-After under 10 seconds: stay on the primary, just wait. Hard 429 with no Retry-After: downshift to same-provider cheaper model. Authentication failure: skip to cross-provider (something is wrong with the primary credentials). Multi-second 5xx pattern: cross-provider. Sustained outage: self-hosted. The decision tree is small and explicit, not derived per-call from a generic 'try the next thing' loop.
Why Trigger.dev v4 specifically
I've shipped these patterns on Temporal, Inngest, and Trigger.dev. For agent-shaped workloads (long pauses, many waits, idempotent steps, occasional human escalations) Trigger.dev v4's checkpoint-resume model fits the work better than Temporal's replay model, in my experience. Replay-based engines re-run your workflow from the start on every wake-up, which is a correctness win for deterministic business logic and a tax for non-deterministic agent calls. Trigger.dev's snapshot-and-restore approach treats the workflow as plain TypeScript and snapshots state at await points; you don't have to wrap every LLM call in a deterministic activity primitive.
The trade-off goes the other way for high-stakes financial or compliance workflows where the determinism guarantee is the point. Temporal is the right answer there. The 'pick the platform that matches the workload's determinism profile' rule is the actual lesson, not 'always use X.'
What I would change
- ▸Add a budget guardrail per agent run, not per provider account. Today the cost circuit is at the API key level. A per-run cap that hard-stops if a single agent invocation exceeds $X would have prevented the kind of $437 retry-loop incident from the headline.
- ▸Surface the partial-success failure list as a first-class observability signal, not a log line. A dashboard that shows 'these item IDs failed in the last hour, by reason' is the lowest-overhead human review surface I know.
- ▸Treat the human-in-the-loop deadline as a tunable per escalation type. Today every escalation has the same 1-hour deadline. Some decisions (a refund) should wait longer; others (a customer is waiting for a response) should default sooner.
- ▸Add a 'replay drill' to the eval harness that simulates each kind of failure (transient, persistent, quality, partial, unrecoverable, rate-limit) and verifies the right pattern fires. Today the patterns are tested individually but not exercised as a coordinated set.
The bigger lesson
Reliability for agents is not a single technique. It's a taxonomy of failures and a small set of patterns matched to each row. The mistake most teams make is treating all failures as transient and reaching for retry-with-backoff. The opposite mistake (treat all failures as unrecoverable, escalate everything) is what makes a human-in-the-loop system unusable. The middle is the work: classify the failure, fire the matching pattern, write down the decision so the next person can read it.
If a hiring manager asks me what 'production agent infrastructure' means, this is the answer. Not because the patterns are exotic, but because shipping all four together with the right triggers, telemetry, and observability is the difference between an agent that demos and an agent that runs.
References
- ▸DEV Community: 'AI Agent Circuit Breakers: The Reliability Pattern Production Teams Are Missing' (2026)
- ▸AWS: Kinesis / Lambda partial batch failure documentation
- ▸Trigger.dev v4 vs Temporal: checkpoint-resume vs replay (trigger.dev/vs/temporal)
- ▸Inngest vs Temporal comparison (akka.io / inngest.com)
- ▸Centre for Long-Term Resilience: 180,000-transcript study on agent misalignment (Oct 2025 - Mar 2026)
- ▸Maxim AI: 'Retries, Fallbacks, and Circuit Breakers in LLM Apps: A Production Guide'
- ▸AI Agent Error-Handling Patterns (public repo, tanayshah.dev/projects/ai-agent-error-patterns/)
// RELATED READING
- POSTBuilding a Black-Box Flight Recorder for Streaming Anomalies
- POSTWhy I Built My Own Agent Eval Harness Instead of Reaching for LangSmith
- POSTMulti-Vendor Agent Design: Why One Model Isn't Enough in 2026
- POSTWhat I Learned About Anthropic's Prompt Cache From Running an Agent Loop in Production
- CASE STUDYAI Agent Error-Handling Patterns — production reliability on Trigger.dev v4