Shipping 100+ Tools to Claude Without Bloating the Cache: Anthropic Tool Search and Deferred Loading

At 50 tools your agent's tool-selection accuracy is 84-95%. At 200 tools it falls to 41-83%. The naive fixes (semantic prefilter, RAG over tool definitions) all invalidate your prompt cache, which is the other expensive thing you're trying to avoid. Anthropic's `defer_loading: true` plus tool search is the rare feature that solves both problems at once. Here's the design and the gotchas I hit shipping it.

The 'too many tools' problem is the second-most expensive bug I've watched teams ship in production agent loops (the most expensive is the retry-loop one). The shape: someone wires up 80 tools because the agent platform wants to expose every internal capability, the model's tool-selection accuracy quietly drops from the 95% it had at 6 tools to the 60s at 80, and the team spends two weeks debugging 'why does the agent pick the wrong tool sometimes.' The answer is: it doesn't have enough attention left after parsing the catalog. The fix is well-trodden but the canonical fix (semantic prefilter, RAG over tool descriptions) trades one cost (context bloat) for another (cache invalidation). Anthropic's defer_loading: true plus the tool search tool is the cleaner answer, and this post is what I learned shipping it.

The numbers behind the problem

▸~6 tools, ~1.5K tokens of definitions: 95%+ tool-selection accuracy on most modern models. This is the sweet spot for production.
▸~50 tools, ~8K tokens: 84-95% accuracy, depending on model. Tolerable, but you can feel it in eval scores.
▸~200 tools, ~32K tokens: 41-83%. The variance opens up dramatically; some models cope, most don't. The agent starts inventing tool names that almost-match the real ones.
▸OpenAI's hard cap is 128 tools per agent. The hard cap is downstream of empirical degradation; nobody picks 128 because that's the right number, they pick it because that's where it stops working.

The conclusion the field has converged on: load 5-7 tools into the active context per turn. Anything else is a discovery problem, not a 'put it in the prompt' problem.

Why the naive fixes are wrong

▸Semantic prefilter at the orchestrator. Pre-LLM, embed the user turn, vector-search the tool catalog, send the top-k tool definitions to Claude. Effective for accuracy. Catastrophic for prompt cache: every turn has a different prefilter result, the cached prefix is invalidated each turn, you pay full input price on every call. The cost math is worse than just shipping all 80 tools.
▸Multi-step routing agent. A small / cheap model classifies the user's intent, then dispatches to a specialist agent with a smaller catalog. Adds latency (one extra inference hop), adds engineering complexity (now you have N agents to maintain), and the routing model's own accuracy ceiling becomes your accuracy ceiling.
▸Tool description compression. Cut the tool definitions to one-liners. Improves token count, hurts the model's ability to pick the right one because the descriptions were the signal it used. Net effect: usually negative.
▸Just remove tools. Sometimes correct (if you have unused tools, drop them). Doesn't generalize: real agent platforms exposing many backend systems legitimately need many tools.

The Anthropic answer: defer_loading + tool_search

Anthropic shipped two beta features in late 2025 that compose into the right design:

▸defer_loading: true on a tool definition. The tool is registered with Anthropic but stripped from the rendered tools section before the cache key is computed. The model doesn't see the full schema in the cached prefix; it sees a stub that the cache treats as a no-op.
▸tool_search_tool_bm25_20251119 (and a regex variant). A built-in tool the model can call with a natural-language query. The tool returns the names + short descriptions of matching deferred tools. The model then calls one of them; when it does, the full tool definition is appended to the message history as a tool_reference block, and the model can use it for the rest of the turn.

The combination is what makes it work. defer_loading without tool_search means the model can't discover the tools. tool_search without defer_loading means the search returns tools that are already in context, defeating the point. Together: the cached prefix stays small and stable, and the catalog scales to hundreds of tools without paying the accuracy or cache cost.

┌─────────────────────────────────────────────────────────┐
│  CACHED PREFIX (1h TTL, stable across the session)      │
│   ┌────────────────────────────────────────────────┐    │
│   │  ALWAYS-LOADED CORE TOOLS (~6)                 │    │
│   │   - find_tool (the search entry point)         │    │
│   │   - read, write, exec, list, status            │    │
│   ├────────────────────────────────────────────────┤    │
│   │  DEFERRED TOOLS (N stubs, defer_loading: true) │    │
│   │   stripped from cache key                      │    │
│   │   - github.create_pull_request                 │    │
│   │   - github.list_issues                         │    │
│   │   - slack.send_message                         │    │
│   │   - jira.create_ticket                         │    │
│   │   - ... 80+ more, never in attention until     │    │
│   │     the model searches for them                │    │
│   └────────────────────────────────────────────────┘    │
└────────────────────────────┬────────────────────────────┘
                             │
                             ▼
┌─────────────────────────────────────────────────────────┐
│  MESSAGES (append-only)                                 │
│   ... user turn ...                                     │
│   ... assistant calls find_tool("github") ...           │
│   ... tool_search returns 2 matches ...                 │
│   ... tool_reference: github.create_pull_request ...    │  ← inline
│   ... assistant calls github.create_pull_request ...    │     not in
│   ... tool_result ...                                   │     prefix
└─────────────────────────────────────────────────────────┘

The design choices I made on top of it

▸Six core tools always loaded. read, write, exec, list, status, find_tool. These cover ~80% of agent turns; only when the agent needs a domain-specific capability does it call find_tool. The 80/20 keeps the search-and-load path off the hot path for common turns.
▸Tool name conventions for searchability. I use vendor.action naming (github.create_pull_request, not create_github_pr) because BM25 search ranks the tokens 'github' and 'create' as separate signals. A tool named gh_pr_make is essentially invisible to the search.
▸Short, declarative descriptions tuned for retrieval. The deferred tool's stub description is what the search returns. 'Create a pull request on GitHub. Use when the user wants to open a PR.' beats 'Pull request creation utility.' The search ranks on natural-language match; write descriptions that match natural-language queries.
▸One catalog per tenant, not per session. The deferred-tool catalog should be stable across a tenant's sessions so the cache hashes consistently. If the catalog changes per session (per-user customization), you re-invalidate at every session start, defeating part of the point.
▸Audit log when find_tool fires. Every search query is logged. The logs become an organic dataset for 'what tools were searched for but not found' (a discovery gap) and 'what tools were searched for but never invoked' (probably a description-quality issue). This is the cheapest observability surface for the catalog.

The three-tool variant (Speakeasy's design)

Speakeasy's documented pattern goes one level further: instead of two tools (find_tool + the discovered tool), they use three: search, describe, execute. The model searches for tools (returns names), describes a chosen tool (returns full schema), then executes it. Each step is a separate tool call. The token reduction is up to 160x because the schema only loads when the model is committed to using the tool.

I haven't shipped the three-tool variant in production. It's strictly more efficient at the schema-loading layer, but it adds a turn to every tool invocation (search → describe → execute is three model turns instead of one), which can exceed the latency budget for interactive workflows. For batch / agentic workflows where each turn is already multi-second, the three-tool pattern is probably the right call. For interactive chat where each agent turn is 1-3 seconds, the two-tool variant is the better fit.

Common mistakes I've seen (and made)

▸Forgetting that defer_loading is per-message-not-per-session. If the deferred-tools list changes between turns, the cache invalidates. Add tools to the catalog as a stable rollout, not as a per-turn personalization.
▸Loading every discovered tool's full schema permanently for the rest of the session. Once a tool is referenced, the tool_reference block stays in the message stream. By turn 20 the agent is dragging 15 tool schemas around. Consider truncating: drop tool_reference blocks for tools that haven't been used in N turns.
▸Treating tool search as a router. The search is a discovery mechanism, not a classifier. If the model already knows which tool it wants to use, calling search first is just overhead. The instruction in the system prompt should be: 'use a core tool if one fits; only call find_tool when no core tool fits.'
▸Ignoring the regex variant. The BM25 search is great for natural-language queries; the regex variant is great when the model knows the exact pattern (e.g., 'find me all the github.* tools'). Both should be in the toolbox.

What I would change if I rebuilt this layer

▸Generate tool descriptions from the same source as the OpenAPI / proto specs they wrap. Today descriptions drift from the actual tool implementations. A code-gen step in CI that produces both the schema and the natural-language description from a single source eliminates a class of search-quality bugs.
▸Add an explicit retire_tool semantics so the agent can drop a tool_reference block when it's done with it. Today the block stays in the message stream until the session ends; explicit retirement would let the model manage its own working set tighter.
▸Track per-tool selection accuracy in the eval harness. The 'did the model pick the right tool given the user turn' metric is one of the highest-signal eval surfaces and it's easy to instrument once the catalog is stable.

The bigger lesson

The 'too many tools' problem is solvable, but the right solution depends on a feature most tutorials still don't mention. The pattern is small (six core tools always loaded, N deferred tools discoverable via search) and the cost reduction at scale is dramatic. The sub-mistake is to reach for a custom semantic prefilter and find out three months later that the cache was the actual budget item; the meta-mistake is to keep loading 80 tools and convince yourself the 60% accuracy is 'just how the model works.'

If a hiring manager asks me how I think about agent tool design at scale, this is the answer. Not 'we use Claude because it has tool search,' but 'here's the accuracy curve, here's the cache constraint, here's the catalog architecture that lets us scale to hundreds of tools without paying either tax.' That's the kind of architectural literacy the work demands in 2026.

References

▸Anthropic: Tool search tool documentation (platform.claude.com/docs/agents-and-tools/tool-use/tool-search-tool)
▸Anthropic: Tool reference and defer_loading (platform.claude.com/docs/agents-and-tools/tool-use/tool-reference)
▸Anthropic Engineering: 'Introducing advanced tool use on the Claude Developer'
▸Unified.to: 'Scaling MCP Tools with Anthropic's Defer Loading' (2026)
▸Speakeasy: 'Dynamic tool discovery in MCP' — three-tool search/describe/execute pattern
▸WRITER engineering: 'When too many tools become too much context'
▸Jenova.ai: 'AI Tool Overload: Why More Tools Mean Worse Performance'

← MORE NOTES OPEN COMMS →