TANAY.SHAH
// NOTES · 30 POSTS

The
notes.

Technical notes on what shipping production AI infrastructure in 2026 actually looks like — agent compliance, multi-vendor orchestration, MCP, the trade-offs nobody warns you about.

01 · 2026-06-058 MIN READ

Tessen: Building the Harness for AI Agents — From Forensic Capture to Runtime Control

Traditional observability treats an agent call like a web request. But an agent is a program that thinks, and it fails in ways a span can't show you. Tessen is the harness I'm building: two lines to capture everything your agent actually does in production, then catch the runaway loop before the bill does.

READ POST
02 · 2026-05-107 MIN READ

Why Vector Similarity Alone Lies in RAG (and the Rerank Step Most Pipelines Skip)

Vector top-k retrieval is the standard RAG starting point and it has a known failure mode: high cosine similarity does not equal high relevance. The fix is well-known too, and most production pipelines I read in code review skip it. The two-stage paradigm (broad recall, narrow precision) plus the right reranker buys you 17-40 percentage points of accuracy for ~120ms of latency. Here's the framework.

READ POST
03 · 2026-05-1011 MIN READ

How to Evaluate a Founding Engineer in 2026: A Playbook for Seed-Stage Founders

Most founders pattern-match founding-engineer hires off senior-IC interview rubrics and end up with the wrong person. Here is the four-dimensional grid I use, the specific signals that predict success, and the anti-patterns that look strong in interviews but stall the team six weeks in.

READ POST
04 · 2026-05-097 MIN READ

Structured Outputs vs Tool Calling for LLM Data Extraction: Pick by Intent, Not by Habit

Both endpoints take a JSON schema. Both return validated structured data. They are not the same primitive and picking the wrong one is the most common shape of 'this LLM extraction is unreliable in production' I see in code review. The rule is: structured outputs for extraction and classification, tool calling for triggering actions, and structural validity is not semantic correctness. Here's the framework I use.

READ POST
05 · 2026-05-088 MIN READ

Why Prompt-Injection Filters Don't Save You (and What Actually Limits the Blast Radius)

OWASP LLM01:2025 ranks prompt injection as the #1 LLM risk. 73% of 2025 production deployments had a prompt-injection vulnerability and adaptive attack success rates against state-of-the-art defenses exceed 85%. The honest production answer is not a better filter; it's a defense-in-depth architecture that assumes injection happens and limits what the attacker can actually do once it does. Here's what I ship and what I'd ship.

READ POST
06 · 2026-05-077 MIN READ

Why Your Agent's UI Lags Behind Its Tool Calls (and the Streaming JSON Parser That Fixes It)

Anthropic streams tool-call arguments as `input_json_delta` events that don't form valid JSON until the block closes. Most agent UIs wait for the close, parse the full string, then render the result, which means the user stares at 'agent is calling a tool…' for two seconds while the args stream in. The fix is a partial-mode JSON parser that emits valid intermediate states. The fix exists in the Anthropic SDK already; the production patterns around using it correctly are what this post is about.

READ POST
07 · 2026-05-068 MIN READ

Picking a Durable Workflow Engine for AI Agents in 2026: Trigger.dev v4 vs Inngest vs Temporal

Temporal raised $300M at a $5B valuation in February 2026 on the back of 1.86 trillion AI-native action executions. Trigger.dev v4 went GA the same window with explicit AI-agent positioning. Inngest is the event-driven third option. The choice is not 'pick the best one' but 'pick the one whose primitives match your workflow's determinism profile.' Here's the decision framework.

READ POST
08 · 2026-05-057 MIN READ

Why Your iOS Streaming Chat Is Cooking the GPU (and the 30-Line Debounce Buffer That Fixes It)

MarkdownUI plus AsyncSequence is the obvious way to render a streaming Claude or GPT response in SwiftUI. It is also the way I shipped a chat app where the phone got noticeably warm during long replies. The fix is a 30-line debounce buffer that re-parses on newlines or every 80ms, whichever comes first. Same UX, GPU stays cool, the bug stops showing up in the App Store reviews.

READ POST
09 · 2026-05-047 MIN READ

Shipping 100+ Tools to Claude Without Bloating the Cache: Anthropic Tool Search and Deferred Loading

At 50 tools your agent's tool-selection accuracy is 84-95%. At 200 tools it falls to 41-83%. The naive fixes (semantic prefilter, RAG over tool definitions) all invalidate your prompt cache, which is the other expensive thing you're trying to avoid. Anthropic's `defer_loading: true` plus tool search is the rare feature that solves both problems at once. Here's the design and the gotchas I hit shipping it.

READ POST
10 · 2026-05-037 MIN READ

When to Use SSE vs WebSocket for AI Agent Streaming (and Why I Use Both)

The 'just use WebSocket' default for any real-time AI feature is wrong. Server-Sent Events is the right protocol for server-to-client token streaming (chat, agent output, tool results). WebSocket is the right protocol for client-to-server audio capture and any genuinely bidirectional channel. Same product, two protocols, and the choice between them is a memory and battery decision before it's a feature decision.

READ POST
11 · 2026-05-028 MIN READ

Building a Sub-2-Second Sales Coach: Two-Path Architecture for Real-Time Conversation AI

Live conversation coaching has a sub-2-second latency budget. Post-call analysis benefits from 30 seconds of frontier-model reasoning. The same agent can't satisfy both. The pattern that worked is two parallel paths feeding off the same diarized stream: a fast path on Llama 3.3 + Groq for in-the-moment nudges, a slow path on Claude Opus / Gemini 2.5 for the manager's ride-along artifact.

READ POST
12 · 2026-05-018 MIN READ

Anonymizing PII Client-Side Before It Reaches the LLM (Why I Don't Trust the Gateway)

The 2026 default for LLM PII protection is a server-side gateway: Presidio plus LiteLLM, Lakera Guard, Skyflow. The gateway sees the PII, redacts it, forwards the scrubbed prompt. That's a meaningful improvement over plaintext, and the gateway still has the data. The fat-client pattern moves redaction to the device: PII never leaves the trust boundary, not even to the gateway. Here's the design, the threat model that justifies it, and why I'd defend it for healthcare and legal AI products.

READ POST
13 · 2026-04-308 MIN READ

Skill and Memory Injection for Agent Loops: Why I Don't Let the Agent Page Its Own Memory

The agent-memory landscape in 2026 has settled into four big buckets: provider-managed, Letta-style self-paging, Mem0/Zep middleware, and Anthropic Agent Skills. The pattern I ship is none of those exactly: orchestrator-driven injection into the cached prefix, with a hard boundary between skills (how the agent does things) and memory (what the agent knows). Here's why I picked that shape and where the others are still right.

READ POST
14 · 2026-04-298 MIN READ

What 1M Context Actually Buys You (and What It Doesn't): Production Patterns from a 2026 Agent Loop

Anthropic made 1M context generally available on Claude Sonnet 4.6 in March 2026 with no surcharge. The marketing reads like RAG is dead. The production reality reads like: lost-in-the-middle is real, 1M is best as a tool not a default, and the right answer is usually a hybrid. Here's how I actually use the 1M window in an agent loop and where I still reach for RAG.

READ POST
15 · 2026-04-288 MIN READ

Why I Use a Postgres Append-Only Log for Agent Chat (Not Redis Streams)

The 2026 default for resumable LLM chat is Redis Streams behind Vercel AI SDK 5's resumeStream. It works, and it doubles your storage footprint with two systems of record: Postgres for messages, Redis for resumption. The version I shipped uses one append-only Postgres table for both. Here's the schema, the seq-number contract, and why I'd defend the choice in front of a hiring manager.

READ POST
16 · 2026-04-278 MIN READ

Bubblewrap, Landlock, gVisor, Firecracker: Choosing a Sandbox for AI Agent Code Execution in 2026

Anthropic uses bubblewrap for Claude Code and gVisor for Claude web. OpenAI Codex defaults to Landlock. Vercel and E2B use Firecracker microVMs. The four leading sandbox options are not interchangeable, and 'pick the strongest one' is the wrong heuristic. Here's the two-axis decision model I use, the use cases each option is right for, and why I picked bubblewrap + seccomp + AST validation for the in-house Python execution layer instead of reaching for a microVM.

READ POST
17 · 2026-04-268 MIN READ

Four Production Reliability Patterns for AI Agents (Beyond Retry-With-Backoff)

Most agent tutorials stop at retry-with-backoff. Production stops there too, and that's how you get the $437 retry-loop incident from April 2026. The four patterns that actually keep an agent up at 3 AM are circuit breakers, partial success, human-in-the-loop, and graceful degradation, and the trick is knowing which failure signal triggers which pattern.

READ POST
18 · 2026-04-257 MIN READ

Why Every Tool in Your MCP Server Needs a Different TTL

The default MCP server caches uniformly: one defaultTTL of an hour, applied across every tool. The default is wrong because every tool exposes data with a different volatility profile and a different upstream reliability profile. Here's the two-axis framework I use to pick TTLs, the three real tools that ended up at minutes / hours / days, and what the June 2026 MCP spec roadmap does (and doesn't) solve.

READ POST
19 · 2026-04-247 MIN READ

Building a Black-Box Flight Recorder for Streaming Anomalies

Logging every event in a high-volume streaming pipeline drowns the storage system and obscures the rare event you actually need. The flight-recorder pattern (rolling ring buffer, snapshot on trigger) trades retention for context: you keep the last N seconds in memory and dump them on anomaly. Here's the three-stage Kafka design I shipped, and the math behind 5K-events-before / 3K-events-after.

READ POST
20 · 2026-04-238 MIN READ

From Days to Hours: Migrating a 20M-Record Wikipedia ML Pipeline From Sync to Async

The naive asyncio.gather rewrite gets you a 3x speedup and a 503 from the upstream API. The right pattern is semaphore + httpx + bounded concurrency + a smart retry policy that respects the rate-limit headers the upstream actually returns. Here's the version that took ingestion from days to hours, and the four bottlenecks that turned out not to be I/O at all.

READ POST
21 · 2026-04-227 MIN READ

Why I Run Postgres Migrations on Container Startup, Not From CI

The internet consensus is that database migrations belong in your CI/CD pipeline, never in your container's entrypoint. The consensus is right for the wrong reasons. Here's the four coordination problems people are actually trying to avoid, the 30-line Postgres advisory-lock pattern that solves all four, and why container-startup migrations are the simplest deploy story for a small team that doesn't have a release engineer.

READ POST
22 · 2026-04-217 MIN READ

What I Learned About Anthropic's Prompt Cache From Running an Agent Loop in Production

Prompt caching is sold as a 90% cost cut. In production agent loops it can quietly become a 30% cost increase, depending on five things the docs do not put in big letters. Here are the patterns that made the math actually work for me, including the on-demand tool loading trick that keeps the cache alive.

READ POST
23 · 2026-04-207 MIN READ

Why I Use gRPC for the Agent-to-Sandbox Bridge (and JSON-RPC Inside It)

Most teams pick one wire protocol and use it everywhere. The right answer is to pick by trust boundary: gRPC + Protobuf for the API-to-pod hop where you control both ends, JSON-RPC over a subprocess pipe inside the sandbox where the network surface has to be zero. Here's the two-tier design and the math behind it.

READ POST
24 · 2026-04-198 MIN READ

Why I Built My Own Agent Eval Harness Instead of Reaching for LangSmith

The off-the-shelf agent observability tools (LangSmith, Braintrust, Phoenix, Langfuse) are excellent for what they do. They are not eval harnesses. Here's the difference, why it matters, and the ~200-line Redis-coordinated wave scheduler I wrote when I needed a real one.

READ POST
25 · 2026-04-187 MIN READ

Shipping an Agent iOS App From Zero in Two Weeks: What Survived, What Didn't

Native iOS clients are still the highest-fidelity surface for an AI agent — push, haptics, secure enclave, real local state. Here are the five engineering calls that survived contact with production, and the two I'd undo if I were starting over.

READ POST
26 · 2026-04-176 MIN READ

Building a Zero-Data-Retention Layer for Production LLM Agents

Anthropic's hosted Programmatic Tool Calling is fast, accurate, and absolutely incompatible with Zero Data Retention. Here's the request-interception pattern enterprise teams use to keep customer data on-prem while preserving model code quality.

READ POST
27 · 2026-04-165 MIN READ

Multi-Vendor Agent Design: Why One Model Isn't Enough in 2026

Single-vendor agent architectures are a 2024 pattern. In 2026, the right move is splitting the loop — Claude for reasoning, Gemini for high-resolution vision, Llama (via Groq) for sub-100ms hot paths. Here's the orchestration shape that actually ships.

READ POST
28 · 2026-04-157 MIN READ

Designing Tool Surfaces for LLM Agents: What Goes On the Tool, What Stays In the Loop

Tool-surface design is the highest-leverage knob in production agent infrastructure — and the one most engineers underweight. Here's the design language for cache-friendly, token-minimal, domain-shaped tools that scale past the demo.

READ POST
29 · 2026-04-146 MIN READ

Picking MCP Servers for an Agent Without Drowning the Context Window: A Selection Heuristic for 2026

An MCP server is roughly 500-1,000 tokens of context per tool, billed every turn forever. The right number for a production agent is almost always 3-5, not 15. Here's the heuristic I use and the math behind it.

READ POST
30 · 2026-04-137 MIN READ

What the Bubblewrap Sandbox Escape Tells Us About Agent Runtime Hardening in 2026

An autonomous agent that can disable its own sandbox is a sandbox you no longer have. Lessons from a real 2026 escape — and the four-layer model I use to reason about agent runtime isolation in production.

READ POST