Designing Tool Surfaces for LLM Agents: What Goes On the Tool, What Stays In the Loop

Tool-surface design is the highest-leverage knob in production agent infrastructure — and the one most engineers underweight. Here's the design language for cache-friendly, token-minimal, domain-shaped tools that scale past the demo.

Anthropic's own engineering team published guidance recently ("Writing Tools for Agents") with a single load-bearing claim: even small refinements to tool descriptions yield dramatic improvements in agent quality. They cite a Claude Sonnet evaluation where re-shaping tool descriptions alone moved the agent toward state-of-the-art on SWE-bench Verified. Not the model. Not the prompt. The tool descriptions.

If you're building production agent infrastructure in 2026 and you've spent more time tuning your system prompt than you have tuning your tool surface, you have the priority order inverted. Tools are how the agent acts on the world. Their shape — names, descriptions, argument schemas, return values — is the actual API the agent learns to use. Treating tool design as plumbing instead of as prompt engineering is the most common mistake I see in production agent codebases.

Three forces in tension

Every tool surface optimizes against three competing pressures: token budget (every tool description sits in your prompt every turn — multiply by 100 turns a day across a thousand sessions and the math gets ugly fast), capability coverage (you need enough tools that the agent can actually do the work), and selection accuracy (the more tools you have, the more often the agent picks the wrong one). These three are in genuine tension. You can't max all three.

▸Few tools, dense descriptions (token-cheap, high accuracy, low capability) — works for narrow agents.
▸Many tools, terse descriptions (token-expensive, low accuracy, high capability) — works only with tool_search (Anthropic's tool-search-tool, available when the surface exceeds ~30 tools).
▸Many tools, dense descriptions (token-bankrupt, low accuracy, high capability) — common, broken. Don't do this.
▸Few tools, terse descriptions (token-cheap, low accuracy, low capability) — also broken; the agent flounders on edge cases.

The right answer is usually few tools (6-10), dense descriptions, sharply domain-shaped argument schemas. The constraint isn't "how many capabilities does my agent need?" — it's "what's the smallest set of tools that compose to cover the capability space?" Most teams under-decompose: they build one tool per capability, end up with 30+ tools, blow their token budget, and watch their agent's selection accuracy collapse.

Six principles that actually move the needle

▸Domain-shape the arguments. A read_page tool that takes a page number and returns OCR text + drawing metadata is meaningfully better than a read tool that takes a URL. The argument schema teaches the agent what the world looks like. Put the domain in the schema; don't paper over it in the description.
▸Reject lazy inputs at the schema level. A vision tool that accepts an optional question prompt will get fed "describe this page" 80% of the time, which is the worst question. Make question required in the schema and document what makes a good question in the description. Forcing a specific question pattern at parse time saves you 100x in downstream output quality.
▸Return high-signal, stable identifiers. Tool responses should return only the fields the agent needs to reason about its next step. Slugs and UUIDs over opaque internal references. Page numbers + drawing IDs over byte offsets. The agent will pass these back through to other tools; if the IDs are unstable, the agent's chain-of-thought breaks.
▸Namespace your tools when you have multiple services. github_list_prs, slack_send_message, pgvector_search. Service prefixes make tool selection unambiguous and dramatically improve tool_search accuracy at scale. Especially important once your surface crosses 15+ tools.
▸Cache-control the last tool. Anthropic's cache_control: {type: "ephemeral"} on the trailing tool definition discounts cached input tokens by ~90% on subsequent turns. On agentic workflows with thousands of requests this is often the single largest cost lever available. It's one line of code.
▸Engineer your error responses. When a tool errors, the response goes back to the agent verbatim. A useful error response ("Page 47 not found. Available pages: 1-46. Did you mean page 7?") teaches the agent to recover. An opaque traceback teaches it to give up.

The hardest tradeoff: structured outputs vs. natural-text outputs

Anthropic's recommendation, somewhat counter-intuitive: keep tool I/O close to what the model has seen naturally occurring in text on the internet. Heavy JSON schemas, deeply nested structured outputs, and tight grammar constraints reduce the model's effective intelligence by pushing it out of its training distribution. A grep tool that returns lines of plain text with line numbers is a better tool surface than one that returns a normalized JSON tree of matches with positional metadata. The model already knows what grep output looks like; it has to learn the JSON tree from scratch.

The exception: outputs the *next tool call* will consume programmatically. Those should be structured. The pattern that works in practice is structured-where-passed-to-machine, natural-where-read-by-the-agent. A page index returns structured pages-with-metadata for downstream tools, but the per-page summary is plain text that the agent reads.

Build the eval before you build the tool

The most consequential change in agent engineering practice in 2026: the eval comes first. Define ten tasks the agent should be able to do. Codify the expected tool-call sequence for each. Run the eval against the current tool surface. The eval IS the spec — it's what lets you prompt-engineer tool descriptions with feedback. Without the eval you're tuning blind, which means you're not tuning at all; you're just changing things and hoping.

Anthropic's own engineering posts call this out: they use Claude Code itself to automatically optimize tool definitions against an eval. That feedback loop — model → eval → suggest description tweak → re-eval — is the highest-ROI workflow in agent engineering today. If you're not running it, you're shipping tool surfaces with random performance.

What this signals to a hiring market

Tool-surface design is the kind of thing that doesn't show up on a résumé but reveals itself instantly in technical conversation. Ask any candidate "how would you design the tool surface for an agent that does X" and the strong answers all sound similar: they reach for domain-shaping, they default to fewer tools with denser descriptions, they think about cache_control, they reject the agent's underspecified inputs at the schema level, and they want to see the eval. The weak answers reach for more tools.

This is the technical detail recruiters and CTOs in the AI agent space — Adept, Sierra, Decagon, Cognition, Hebbia, Harvey, Glean's agent team — actually pattern-match on. "Has shipped a thoughtful tool surface" is the new "has shipped at scale." If you're hiring AI engineers in 2026, the question "talk me through your tool surface for X" is the highest-signal screen you can run.

← MORE NOTES OPEN COMMS →