AI Agent Error-Handling Patterns.
Stop your AI agents from failing silently. Four production reliability patterns, with tests and upgrade paths.
Most AI agent tutorials show the happy path: LLM responds, task succeeds, everyone's happy. Real production systems need to survive cascading failures when the LLM provider is down, partial batch failures (95 items succeed and 5 fail — now what?), edge cases where the AI can't decide and needs human judgment, and rate limits that force you to fall back to cheaper models.
I kept seeing teams ship AI features that worked beautifully in demo and broke catastrophically in week 2 — silent retry storms burning through OpenAI credits, batches that quietly dropped 5% of items, agents that stalled on ambiguous input. The patterns to fix all of this are well-known in distributed systems but mostly absent from the AI-agent literature. So I codified them.
The repo implements all four patterns on Trigger.dev v4 (durable task runner, native to the agentic-workflow shape) with TypeScript. Each pattern has a standalone CLI test that runs in ~3ms, no server needed, so you can validate behavior in CI.
The patterns are designed to compose: a single agent task can use circuit breaker + partial success + graceful degradation simultaneously. Each pattern documents a clear production upgrade path (Redis-backed circuit breaker state, Postgres-backed batch tracking, Slack-backed human escalation, Sentry-backed observability) so the example code is the starting point, not the end state.
Trigger.dev v4 over LangGraph / Temporal
LangGraph is the right tool for in-process agent loops; Temporal is the right tool for sprawling enterprise workflows. For most agent teams, the durable-task primitive sits in between — and that's exactly Trigger.dev's shape. Picking it forced the patterns to express themselves at the right level of abstraction.
Standalone CLI tests, not just integration tests
If the patterns require a server, a database, and an API key to validate, they won't get used. The CLI test runs all four patterns in 3ms with zero setup — so the project demos itself in the time it takes to clone.
Document the upgrade path, not just the demo
Most error-handling examples are toy. The README explicitly maps each demo decision to its production version: in-memory state → Redis; console alerts → Slack/PagerDuty; mock LLM → real LLM with cost tracking. Readers can see exactly what to change when they adopt.
Purpose-built for long-running agent tasks. Beats LangGraph for orchestration sprawl and beats Temporal for AI-shaped workflows. The right primitive for production agents in 2026.
Vendor lock-in is a 2026 liability. The graceful-degradation pattern lets agents survive provider outages, rate limits, and cost spikes by automatic fallback through a provider chain.
When an agent gets stuck, durable resume tokens beat polling-based HITL by orders of magnitude in cost and latency. Standard primitive for serious AI products.
Schema-first agent design — every tool input/output is Zod-validated, every error path has a typed shape. No string-typing in production agents.
- →The most underrated pattern of the four is partial success. Teams underestimate how often batch operations are 95/5 splits, and how much pain comes from treating those as binary success/failure.
- →Graceful degradation across providers (GPT-4 → Claude → template) is more nuanced than it looks — different providers have different output formats, so the fallback chain has to either normalize outputs or accept lossy responses. Worth a separate post.
- →Human-in-the-loop with resume tokens turns out to be the right primitive for most 'agent gets stuck' situations. The pattern is well-suited to durable-task runners and surprisingly hard to retrofit onto stateless agent loops.