Tessen: Building the Harness for AI Agents — From Forensic Capture to Runtime Control

Traditional observability treats an agent call like a web request. But an agent is a program that thinks, and it fails in ways a span can't show you. Tessen is the harness I'm building: two lines to capture everything your agent actually does in production, then catch the runaway loop before the bill does.

ByTanay Shah·Founding Engineer · NYC

It is 3am and your AI agent has spent the last two hours retrying the same broken tool call. By the time anyone looks, it has burned $2,000 in API spend on a loop nobody noticed — because the dashboard was green the whole time. Every individual request returned 200. The agent was 'working.' It just happened to be working on the same failure forty-seven times in a row. This is the failure mode traditional observability was never built to catch, and it is the reason I started building Tessen.

Why an agent isn't a web request

Observability tools treat each model call as a span: a request went out, a response came back, here is the latency. That framing is correct for a stateless API and wrong for an agent. An agent is a program that thinks — it loops, it calls tools, it retries, it falls back to a cheaper model when the good one rate-limits. The failures that page you live in the spaces *between* the calls: a loop iteration count that crept from 3 on Monday to 47 on Friday, a fallback chain that silently dropped to a worse model nobody remembers configuring, a tool that fails 12% of the time and that the agent never reports because it just tries again. A span tells you a request happened. It does not tell you your agent is stuck.

So the first question for any harness is not 'can you draw a trace,' it is 'do you actually have the whole record of what the agent did — and will you still have it next week.' Both halves of that turn out to be hard.

Capture has to be forensic, and it has to be structural

Forensic first. Tessen captures every call at full depth — the thinking blocks, the tool-use blocks, cache-read versus cache-write token counts, the retry chain, the real cost of every cache miss. Not a sampled trace, not a redacted summary: the whole record, written append-only, so when something breaks at 3am you can reconstruct exactly what the agent saw and did at each step. You cannot reason about a loop you only sampled one iteration of.

Structural second, and this is the part that almost nobody gets right. Forensic capture has a quiet enemy: vendor SDK drift. The obvious way to record an agent is to wrap each leaf method — messages.create, chat.completions.create, one patch per call site per vendor. It works on the day you ship it. Then a vendor releases a new SDK that renames a method, moves a class, or adds a streaming variant, and your capture silently goes dark. For a tool whose entire job is to never miss anything, 'silently goes dark on a Tuesday SDK bump' is the worst possible failure.

Trunk-patching: the structural answer to SDK drift

The fix is architectural, not a longer list of patches. Instead of wrapping N leaf methods per vendor, Tessen patches the one base client that every call for that vendor flows through — the trunk — and then lays a second floor at the HTTP transport layer underneath all of them. One trunk per vendor, one floor under every vendor. When a vendor ships a new method next month, it still constructs the same base client and still goes out over the same HTTP transport, so it is captured automatically — without me having written a patch for a method that didn't exist yet.

The same property is why LangGraph, langchain-anthropic, and raw vendor clients all flow through Tessen with no framework-specific code: the patch lives at the resource level, below the framework, so anything built on top of the vendor client inherits capture for free. This is the difference between an instrument you babysit and one that just keeps working. For a harness, 'keeps recording when the vendor ships a breaking change while you're asleep' is not a nice-to-have — it is the product.

Two lines

The wedge has to be near-zero friction, because the engineer who needs this is already underwater. So the install is exactly what you would want it to be:

pip install tessen

import tessen
tessen.init()  # that's it — every model call your agent makes is now captured at forensic depth

No decorators sprinkled on every call, no manual spans, no rewriting your agent to fit someone's tracing API. The patch finds the vendor clients and wraps them where they live. You write your agent the way you already do; Tessen records what it actually does. Source stays on your machine — capture lands locally first, and shipping anywhere is opt-in.

From capture to control

Capture is the wedge, not the point. The point is what you do with the record. The next layer reads the capture together with the agent's own code and produces findings an engineer can act on instead of charts an engineer has to decode: 'your agent silently retried this call 4 times last Tuesday and it cost $230', or 'this tool fails 12% of the time and the agent never surfaces it', or 'your loop iteration count grew 3x this week'. Not a dashboard you interpret. An answer you act on.

And the layer after that is the reason it is called a harness and not a viewer: active control. Runtime guards that block a runaway loop before it drains the budget. PR diffs that fix the fragility the analyzer found, against your real code. The arc is capture, then insight, then control — moving from knowing what your agent did to stopping the thing you cannot defend before it ever reaches production. A viewer tells you the beast got loose. A harness keeps it on the leash.

I am building this for one specific person: the founder or staff engineer who has been paged in the last ninety days for something their agent did that they could not defend — the $2k retry loop, the silent model downgrade, the commitment an agent made to a customer that opened legal exposure. Not the hobbyist with a weekend Slack bot. The person running a production agent who needs a control layer between an unpredictable model and production going wrong.

Where this is going

Every category of software eventually grows a control plane — the layer that sits between the unpredictable thing and the blast radius. Tessen is the bet that production AI agents need the same thing, and that the version that wins is both structural (capture that does not break when a vendor ships at 2am) and active (control, not just visibility). It is early and honest about it: the capture engine and the trunk-patching architecture are shipped and on PyPI today; the analyzer is what I am building now; the active-control layer is the horizon I am building toward.

If you run a production agent and have your own 3am story — the loop, the silent fallback, the bill you had to explain the next morning — I want to hear it, because those stories are the spec. And if you are a founder or engineer thinking about the agent-reliability problem from the other side, I am at tanayshah2024@gmail.com. Tame the beast.

← MORE NOTES OPEN COMMS →