The
notes.
Technical notes on what shipping production AI infrastructure in 2026 actually looks like — agent compliance, multi-vendor orchestration, MCP, the trade-offs nobody warns you about.
Tessen: Building the Harness for AI Agents — From Forensic Capture to Runtime Control
Traditional observability treats an agent call like a web request. But an agent is a program that thinks, and it fails in ways a span can't show you. Tessen is the harness I'm building: two lines to capture everything your agent actually does in production, then catch the runaway loop before the bill does.
Why Vector Similarity Alone Lies in RAG (and the Rerank Step Most Pipelines Skip)
Vector top-k retrieval is the standard RAG starting point and it has a known failure mode: high cosine similarity does not equal high relevance. The fix is well-known too, and most production pipelines I read in code review skip it. The two-stage paradigm (broad recall, narrow precision) plus the right reranker buys you 17-40 percentage points of accuracy for ~120ms of latency. Here's the framework.
How to Evaluate a Founding Engineer in 2026: A Playbook for Seed-Stage Founders
Most founders pattern-match founding-engineer hires off senior-IC interview rubrics and end up with the wrong person. Here is the four-dimensional grid I use, the specific signals that predict success, and the anti-patterns that look strong in interviews but stall the team six weeks in.
Structured Outputs vs Tool Calling for LLM Data Extraction: Pick by Intent, Not by Habit
Both endpoints take a JSON schema. Both return validated structured data. They are not the same primitive and picking the wrong one is the most common shape of 'this LLM extraction is unreliable in production' I see in code review. The rule is: structured outputs for extraction and classification, tool calling for triggering actions, and structural validity is not semantic correctness. Here's the framework I use.
Why Prompt-Injection Filters Don't Save You (and What Actually Limits the Blast Radius)
OWASP LLM01:2025 ranks prompt injection as the #1 LLM risk. 73% of 2025 production deployments had a prompt-injection vulnerability and adaptive attack success rates against state-of-the-art defenses exceed 85%. The honest production answer is not a better filter; it's a defense-in-depth architecture that assumes injection happens and limits what the attacker can actually do once it does. Here's what I ship and what I'd ship.
Why Your Agent's UI Lags Behind Its Tool Calls (and the Streaming JSON Parser That Fixes It)
Anthropic streams tool-call arguments as `input_json_delta` events that don't form valid JSON until the block closes. Most agent UIs wait for the close, parse the full string, then render the result, which means the user stares at 'agent is calling a tool…' for two seconds while the args stream in. The fix is a partial-mode JSON parser that emits valid intermediate states. The fix exists in the Anthropic SDK already; the production patterns around using it correctly are what this post is about.
Picking a Durable Workflow Engine for AI Agents in 2026: Trigger.dev v4 vs Inngest vs Temporal
Temporal raised $300M at a $5B valuation in February 2026 on the back of 1.86 trillion AI-native action executions. Trigger.dev v4 went GA the same window with explicit AI-agent positioning. Inngest is the event-driven third option. The choice is not 'pick the best one' but 'pick the one whose primitives match your workflow's determinism profile.' Here's the decision framework.
Why Your iOS Streaming Chat Is Cooking the GPU (and the 30-Line Debounce Buffer That Fixes It)
MarkdownUI plus AsyncSequence is the obvious way to render a streaming Claude or GPT response in SwiftUI. It is also the way I shipped a chat app where the phone got noticeably warm during long replies. The fix is a 30-line debounce buffer that re-parses on newlines or every 80ms, whichever comes first. Same UX, GPU stays cool, the bug stops showing up in the App Store reviews.
Shipping 100+ Tools to Claude Without Bloating the Cache: Anthropic Tool Search and Deferred Loading
At 50 tools your agent's tool-selection accuracy is 84-95%. At 200 tools it falls to 41-83%. The naive fixes (semantic prefilter, RAG over tool definitions) all invalidate your prompt cache, which is the other expensive thing you're trying to avoid. Anthropic's `defer_loading: true` plus tool search is the rare feature that solves both problems at once. Here's the design and the gotchas I hit shipping it.
When to Use SSE vs WebSocket for AI Agent Streaming (and Why I Use Both)
The 'just use WebSocket' default for any real-time AI feature is wrong. Server-Sent Events is the right protocol for server-to-client token streaming (chat, agent output, tool results). WebSocket is the right protocol for client-to-server audio capture and any genuinely bidirectional channel. Same product, two protocols, and the choice between them is a memory and battery decision before it's a feature decision.
Building a Sub-2-Second Sales Coach: Two-Path Architecture for Real-Time Conversation AI
Live conversation coaching has a sub-2-second latency budget. Post-call analysis benefits from 30 seconds of frontier-model reasoning. The same agent can't satisfy both. The pattern that worked is two parallel paths feeding off the same diarized stream: a fast path on Llama 3.3 + Groq for in-the-moment nudges, a slow path on Claude Opus / Gemini 2.5 for the manager's ride-along artifact.
Anonymizing PII Client-Side Before It Reaches the LLM (Why I Don't Trust the Gateway)
The 2026 default for LLM PII protection is a server-side gateway: Presidio plus LiteLLM, Lakera Guard, Skyflow. The gateway sees the PII, redacts it, forwards the scrubbed prompt. That's a meaningful improvement over plaintext, and the gateway still has the data. The fat-client pattern moves redaction to the device: PII never leaves the trust boundary, not even to the gateway. Here's the design, the threat model that justifies it, and why I'd defend it for healthcare and legal AI products.
Skill and Memory Injection for Agent Loops: Why I Don't Let the Agent Page Its Own Memory
The agent-memory landscape in 2026 has settled into four big buckets: provider-managed, Letta-style self-paging, Mem0/Zep middleware, and Anthropic Agent Skills. The pattern I ship is none of those exactly: orchestrator-driven injection into the cached prefix, with a hard boundary between skills (how the agent does things) and memory (what the agent knows). Here's why I picked that shape and where the others are still right.
What 1M Context Actually Buys You (and What It Doesn't): Production Patterns from a 2026 Agent Loop
Anthropic made 1M context generally available on Claude Sonnet 4.6 in March 2026 with no surcharge. The marketing reads like RAG is dead. The production reality reads like: lost-in-the-middle is real, 1M is best as a tool not a default, and the right answer is usually a hybrid. Here's how I actually use the 1M window in an agent loop and where I still reach for RAG.
Why I Use a Postgres Append-Only Log for Agent Chat (Not Redis Streams)
The 2026 default for resumable LLM chat is Redis Streams behind Vercel AI SDK 5's resumeStream. It works, and it doubles your storage footprint with two systems of record: Postgres for messages, Redis for resumption. The version I shipped uses one append-only Postgres table for both. Here's the schema, the seq-number contract, and why I'd defend the choice in front of a hiring manager.
Bubblewrap, Landlock, gVisor, Firecracker: Choosing a Sandbox for AI Agent Code Execution in 2026
Anthropic uses bubblewrap for Claude Code and gVisor for Claude web. OpenAI Codex defaults to Landlock. Vercel and E2B use Firecracker microVMs. The four leading sandbox options are not interchangeable, and 'pick the strongest one' is the wrong heuristic. Here's the two-axis decision model I use, the use cases each option is right for, and why I picked bubblewrap + seccomp + AST validation for the in-house Python execution layer instead of reaching for a microVM.
Four Production Reliability Patterns for AI Agents (Beyond Retry-With-Backoff)
Most agent tutorials stop at retry-with-backoff. Production stops there too, and that's how you get the $437 retry-loop incident from April 2026. The four patterns that actually keep an agent up at 3 AM are circuit breakers, partial success, human-in-the-loop, and graceful degradation, and the trick is knowing which failure signal triggers which pattern.
Why Every Tool in Your MCP Server Needs a Different TTL
The default MCP server caches uniformly: one defaultTTL of an hour, applied across every tool. The default is wrong because every tool exposes data with a different volatility profile and a different upstream reliability profile. Here's the two-axis framework I use to pick TTLs, the three real tools that ended up at minutes / hours / days, and what the June 2026 MCP spec roadmap does (and doesn't) solve.
Building a Black-Box Flight Recorder for Streaming Anomalies
Logging every event in a high-volume streaming pipeline drowns the storage system and obscures the rare event you actually need. The flight-recorder pattern (rolling ring buffer, snapshot on trigger) trades retention for context: you keep the last N seconds in memory and dump them on anomaly. Here's the three-stage Kafka design I shipped, and the math behind 5K-events-before / 3K-events-after.
From Days to Hours: Migrating a 20M-Record Wikipedia ML Pipeline From Sync to Async
The naive asyncio.gather rewrite gets you a 3x speedup and a 503 from the upstream API. The right pattern is semaphore + httpx + bounded concurrency + a smart retry policy that respects the rate-limit headers the upstream actually returns. Here's the version that took ingestion from days to hours, and the four bottlenecks that turned out not to be I/O at all.
Why I Run Postgres Migrations on Container Startup, Not From CI
The internet consensus is that database migrations belong in your CI/CD pipeline, never in your container's entrypoint. The consensus is right for the wrong reasons. Here's the four coordination problems people are actually trying to avoid, the 30-line Postgres advisory-lock pattern that solves all four, and why container-startup migrations are the simplest deploy story for a small team that doesn't have a release engineer.
What I Learned About Anthropic's Prompt Cache From Running an Agent Loop in Production
Prompt caching is sold as a 90% cost cut. In production agent loops it can quietly become a 30% cost increase, depending on five things the docs do not put in big letters. Here are the patterns that made the math actually work for me, including the on-demand tool loading trick that keeps the cache alive.
Why I Use gRPC for the Agent-to-Sandbox Bridge (and JSON-RPC Inside It)
Most teams pick one wire protocol and use it everywhere. The right answer is to pick by trust boundary: gRPC + Protobuf for the API-to-pod hop where you control both ends, JSON-RPC over a subprocess pipe inside the sandbox where the network surface has to be zero. Here's the two-tier design and the math behind it.
Why I Built My Own Agent Eval Harness Instead of Reaching for LangSmith
The off-the-shelf agent observability tools (LangSmith, Braintrust, Phoenix, Langfuse) are excellent for what they do. They are not eval harnesses. Here's the difference, why it matters, and the ~200-line Redis-coordinated wave scheduler I wrote when I needed a real one.
Shipping an Agent iOS App From Zero in Two Weeks: What Survived, What Didn't
Native iOS clients are still the highest-fidelity surface for an AI agent — push, haptics, secure enclave, real local state. Here are the five engineering calls that survived contact with production, and the two I'd undo if I were starting over.
Building a Zero-Data-Retention Layer for Production LLM Agents
Anthropic's hosted Programmatic Tool Calling is fast, accurate, and absolutely incompatible with Zero Data Retention. Here's the request-interception pattern enterprise teams use to keep customer data on-prem while preserving model code quality.
Multi-Vendor Agent Design: Why One Model Isn't Enough in 2026
Single-vendor agent architectures are a 2024 pattern. In 2026, the right move is splitting the loop — Claude for reasoning, Gemini for high-resolution vision, Llama (via Groq) for sub-100ms hot paths. Here's the orchestration shape that actually ships.
Designing Tool Surfaces for LLM Agents: What Goes On the Tool, What Stays In the Loop
Tool-surface design is the highest-leverage knob in production agent infrastructure — and the one most engineers underweight. Here's the design language for cache-friendly, token-minimal, domain-shaped tools that scale past the demo.
Picking MCP Servers for an Agent Without Drowning the Context Window: A Selection Heuristic for 2026
An MCP server is roughly 500-1,000 tokens of context per tool, billed every turn forever. The right number for a production agent is almost always 3-5, not 15. Here's the heuristic I use and the math behind it.
What the Bubblewrap Sandbox Escape Tells Us About Agent Runtime Hardening in 2026
An autonomous agent that can disable its own sandbox is a sandbox you no longer have. Lessons from a real 2026 escape — and the four-layer model I use to reason about agent runtime isolation in production.