Why I Built My Own Agent Eval Harness Instead of Reaching for LangSmith

The off-the-shelf agent observability tools (LangSmith, Braintrust, Phoenix, Langfuse) are excellent for what they do. They are not eval harnesses. Here's the difference, why it matters, and the ~200-line Redis-coordinated wave scheduler I wrote when I needed a real one.

When a team starts caring about agent quality, the first instinct is to install LangSmith. It traces every model call, scores them with an LLM-as-judge, draws a dashboard, and the eval problem appears solved. For a class of teams (LangChain-native, OpenAI-heavy, monitoring-first) that instinct is correct. For the team I was on, it would have been the wrong tool. So I wrote a custom harness instead. This post is about the difference between an observability platform and an eval harness, and the design that fell out of caring about that difference.

Observability is not evaluation

LangSmith, Braintrust, Helicone, Langfuse, Arize Phoenix: I read the docs for all of them before deciding. They are genuinely good products. They share a shape: instrument your production agent, ship spans to a backend, score the spans (rule-based or LLM-as-judge), surface the result in a dashboard. That shape is observability with eval grafted on. It answers the question 'how is my deployed agent doing right now.'

I needed answers to a different question: 'before this commit ships, how does the agent do on a fixed corpus of 196 tasks across 9 disciplines, run in clean environments, with deterministic replay, gated by a baseline.' That is not observability. That is a regression-test harness. And the second question is the one a buyer cares about, because a buyer doesn't see the dashboard: they see the artifact the agent produces, and they want to know it won't get worse on the day they renew.

What a real eval harness has to do

▸Cleanroom isolation per task. Each task starts from a fresh tmpdir, fresh database state, no carryover from prior runs. Anthropic's own engineering team has written about this: in some internal evals they observed Claude getting an unfair advantage by reading git history from previous trials. Shared state poisons the signal.
▸Deterministic replay. Same model, same seeds, same input, same output, every time. If a task fails today and the harness can't reproduce the failure tomorrow, the failure is noise and you can't act on it.
▸Wave-coordinated parallelism. You can't fan out 196 tasks at once: you'll OOM the model provider, hit rate limits, and the failure modes will be distributed across HTTP errors instead of agent errors. You need to run in coordinated waves, where wave size is chosen for your token budget and rate-limit ceiling.
▸Baseline-vs-candidate scoring. You don't compare the candidate run to an absolute floor: you compare it to the baseline run on the previous commit. The right gate is 'does this PR regress the eval set' not 'does the agent score above 80%.' The latter is a vanity metric and the former actually catches regressions.
▸Model-agnostic surface. Today's agent runs on Claude. The next eval might be Claude vs Gemini, vs a fine-tune, vs a smaller, cheaper model. The harness has to swap models without rewriting the harness.
▸Idempotent task IDs. Re-running the same eval against the same commit should produce the same artifact set. No fresh UUIDs, no timestamps in the artifact path. This is the small unsexy thing that makes diff tooling work.

A few of these every off-the-shelf tool gets right. None of them get all of them right at the time I evaluated, partly because none of them were built for this shape: a regression suite an engineer runs on their laptop or in CI before a deploy, against a private corpus, with model-agnostic comparison.

The harness, in 200 lines of Python

What I built was not novel, just specific. The whole thing fits in a single file. It takes a path to a corpus, a path to an agent factory, and a wave size. It produces a directory of per-task artifacts that a follow-up scoring pass turns into a baseline-vs-candidate diff.

corpus/                          fresh tmpdir per task
   task_001/{input, expected}      ──────►   ┌──────────────────┐
   task_002/{input, expected}                │  agent factory   │
   task_003/{input, expected}                │  (lean SDK call) │
   ...                                       └─────────┬────────┘
                                                       │
   ┌─────────────────────┐    SETNX wave_lock           ▼
   │  wave scheduler     │ ──────────────►   ┌─────────────────┐
   │  (Redis-locked)     │ ◄──── score ◄──── │  artifact write │
   └─────────────────────┘                   │  (idempotent ID)│
            │                                └─────────────────┘
            ▼                                         │
   diff vs baseline ◄────────────────────────────────┘

The interesting part is the Redis-locked wave scheduler. Tasks are not enqueued individually. They are bucketed into waves, and each wave acquires a Redis SETNX lock keyed on the wave ID before the workers can begin. If a developer runs the harness on their laptop while CI is running it on the same corpus, the second invocation waits, takes the lock when the first releases it, and the runs do not interleave. Lock TTL is a 10-minute heartbeat: if a worker dies mid-wave, the lock expires, and the next invocation knows to re-run the orphaned tasks rather than leaving them unscored.

Wave size was the lever I tuned the most. Too small (waves of 1) and a 196-task run takes 40 minutes of mostly-idle waiting on rate limits. Too large (waves of 50) and the provider's per-minute token budget rejects half the requests. Wave size 8, with each wave's slowest task gating the next wave's start, was the equilibrium for the model-and-budget combination I was running against. That number is not portable: it is the hyperparameter you tune for your specific account quotas.

Why direct SDK, not LangGraph

The agent the harness exercises is also lean: a direct Anthropic SDK loop, six tools, no framework. I want to be precise about why this matters for evaluation specifically. The harness has to be confident that a regression in score is a regression in the agent, not a regression in the framework wrapping the agent. Frameworks (LangGraph, AutoGen, CrewAI) ship interesting changes to their internals on their own release cadence. If your eval pins the framework version, you have a different problem (you're now eval-testing a frozen framework). If it floats, you have framework regressions polluting your agent regression signal.

Direct SDK keeps the surface small enough that the only moving piece is the agent code I wrote and the model I'm calling. When the score moves, I can attribute the move. That is worth the (modest) cost of writing the loop yourself.

What off-the-shelf is still right for

I want to be fair to the products I evaluated and rejected, because the rejection was specific to the use case. If you're running a chat application and you want to know how it's behaving in production, LangSmith (with LangChain), Phoenix (with anything), Helicone (with OpenAI), or Langfuse (self-hosted) will pay for themselves quickly. Production telemetry is a real problem and these tools solve it well.

What I rejected was using one of them as a regression-test harness. They are tilted toward the right wall (production observability) and the harness sits on the left wall (pre-deploy regression). The two problems share vocabulary (eval, score, dataset) but the access patterns are opposite: production is a high-volume continuous stream, regression is a low-volume coordinated batch. Trying to use one tool for both is the kind of architectural decision you regret six months later when the dashboards don't show what you need to ship.

What I'd change

▸Stand the harness up earlier. I wrote it after the agent already had real users. The team would have shipped the first three releases with more confidence if the harness existed at release zero. The opportunity cost of writing it earlier was about a week and the cost of not having it was several reverted deploys.
▸Score with multiple judges, not one. The first version used a single LLM-as-judge for scoring. Scores were noisy across judge runs (about 8% variance on the same artifact). Three independent judges with majority vote is the cheapest fix and was the version I shipped second.
▸Externalize the corpus from the harness repo. Mixing test data into the harness repo creates a temptation to edit tests when they fail. Putting the corpus in a separate, read-mostly repo with PR-gated changes adds the right friction.

The bigger lesson

The reflex 'pick a vendor' is the right reflex when the vendor is solving a problem you would otherwise have to learn. It is the wrong reflex when the vendor is solving an adjacent problem and your team is going to bend its workflow to fit the vendor's model. Eval, in 2026, is on the boundary between those two cases: the vendors are mature, but the harness shape (cleanroom, wave-coordinated, baseline-vs-candidate) is specific enough that you'll usually be better served writing the 200 lines yourself and using the vendor for what it actually does well, which is production observability.

If a hiring manager asks me how I think about agent quality, this is the post I'm going to point at. Not because the harness is special, but because the decision to write it (and not to write it for the wrong reasons) is the work.

References

▸Anthropic engineering: "Demystifying evals for AI agents" (anthropic.com/engineering)
▸LangSmith documentation: tracing, evaluation, dataset management (docs.langchain.com/langsmith)
▸Braintrust: agent evals platform with CI gates (braintrust.dev)
▸Arize Phoenix: open-source self-hosted eval (arize.com/docs/phoenix)
▸Langfuse: open-source LLM eng platform, acquired by ClickHouse Jan 2026 (langfuse.com)
▸OpenAI Evals: open framework + registry (github.com/openai/evals)
▸Helicone: LLM observability proxy (helicone.ai)

← MORE NOTES OPEN COMMS →