Skill and Memory Injection for Agent Loops: Why I Don't Let the Agent Page Its Own Memory
The agent-memory landscape in 2026 has settled into four big buckets: provider-managed, Letta-style self-paging, Mem0/Zep middleware, and Anthropic Agent Skills. The pattern I ship is none of those exactly: orchestrator-driven injection into the cached prefix, with a hard boundary between skills (how the agent does things) and memory (what the agent knows). Here's why I picked that shape and where the others are still right.
If you ask 'how do production agents remember things' in 2026 you get five different answers depending on who you ask. Letta's people say 'three-tier paging, the agent manages its own memory.' Mem0 and Zep's people say 'middleware on top of your LLM calls, vector + graph.' Anthropic's people now say 'Agent Skills, filesystem-based, progressive disclosure.' OpenAI / Claude Projects people say 'just turn on memory in the product.' And a fourth crowd says 'roll your own, the abstractions don't fit our workflow.' Each answer is correct for someone. The shape I keep ending up with is none of these exactly. This post is about why.
The four architectures, briefly
- ▸Provider-managed (ChatGPT memory, Claude Projects). The platform owns the memory, the developer doesn't see it, the model uses it. Cheapest to ship, opaque to debug, hard to evaluate. Right for chatty consumer products; wrong for an agent platform you're shipping to enterprise customers who need ZDR.
- ▸Letta / MemGPT-style self-paging. The agent has tools to move information between core (in-context), archival (vector store), and recall (conversation log). The agent decides what to remember and what to evict. Stateful, self-improving, exciting. Also non-deterministic in a way that makes evals hard, and the failure mode is 'the agent forgot the thing you needed it to remember.'
- ▸Mem0 / Zep middleware. CRUD layer in front of a vector or graph store, called by the agent or by the orchestrator. Mem0's recent benchmarks: 91% lower p95 latency vs full-context, 90% fewer tokens, 6 percentage points of accuracy traded against full context. Production-ready, framework-agnostic, opinionated about extraction.
- ▸Anthropic Agent Skills (October 2025 / GA Q1 2026). Filesystem-based capability modules: a SKILL.md file with YAML frontmatter plus scripts and resources. Progressive disclosure: the agent reads the skill metadata, then loads the body only when it needs to. Open standard, cross-platform, designed for reusability across Claude.ai / Code / API.
Each of these is the right answer for some workload. The mistake is picking one because it's the trendy one in the conference talk you saw last week, when the actual question is 'what's my workload?'
The shape I ship: orchestrator-driven injection into the cached prefix
┌────────────────────────────────────────────────┐
│ ORCHESTRATOR (server-side, my code) │
│ decides at request time: │
│ - which skills are in scope for this user? │
│ - which memory items are relevant? │
│ - what's the eval rubric / policy? │
└──────────────────────┬─────────────────────────┘
│ assemble
▼
┌────────────────────────────────────────────────┐
│ CACHED PREFIX (1h TTL, cache_control) │
│ ┌──────────────────────────────────────┐ │
│ │ TOOLS (always) │ │
│ ├──────────────────────────────────────┤ │
│ │ SKILLS (this user's enabled set) │ │
│ │ - skill A: SKILL.md + helpers │ │
│ │ - skill B: SKILL.md + helpers │ │
│ ├──────────────────────────────────────┤ │
│ │ MEMORY (this user's relevant facts) │ │
│ │ - episodic: recent interactions │ │
│ │ - semantic: stable preferences │ │
│ │ - procedural: learned heuristics │ │
│ └──────────────────────────────────────┘ │
└──────────────────────┬─────────────────────────┘
│
▼
┌────────────────────────────────────────────────┐
│ CHAT HISTORY (5m TTL) + USER TURN (uncached)│
└────────────────────────────────────────────────┘Three things make this work:
- ▸The orchestrator decides what goes in. Not the agent. The agent's job is to reason and act, not to manage its own memory store. The orchestrator runs a deterministic selection (which skills are enabled for this user, which memory entries match the conversation context above a similarity threshold) and assembles the prefix before the model is invoked.
- ▸Skills and memory live in the cached prefix, not the message stream. Both are slow-moving relative to the conversation: a user's enabled skill set rarely changes mid-session, their semantic memory updates on a slower clock than each turn. Putting them in the cached prefix means the prompt-cache absorbs the cost; the marginal per-request input is just chat history + the new user turn.
- ▸Skills and memory are kept separate. Skills are how the agent does things (procedures, tool usage patterns, output schemas). Memory is what the agent knows (about this user, this domain, this session's history). Mixing them is a footgun: skills should be deterministic and reviewed; memory should be malleable and personalized.
Why injection beats self-paging for production agents
The Letta / MemGPT vision (the agent manages its own memory) is genuinely beautiful. It's also a debugging nightmare in production. Three reasons I land on injection instead:
- ▸Determinism for evals. The same input has to produce a comparable output across runs, otherwise the eval harness's diff is meaningless. If the agent silently decided to evict a memory entry between the baseline run and the candidate run, the score moved for a reason that has nothing to do with the prompt change you were testing. Orchestrator-driven injection makes the inputs reproducible.
- ▸Compliance and auditability. Enterprise customers ask 'what did the model see when it generated this answer?' If the agent paged its own memory, the audit log has to reconstruct which entries were in core vs archival at decision time, a non-trivial query. Injection makes 'what did the model see' a trivially recoverable record: the prefix as assembled, logged at request time.
- ▸Cache-friendliness. Self-paging changes the prefix on most turns (the agent paged something in or out). Each change invalidates the prompt cache for everything after it. Injection assembles a prefix that's stable across the session: skills don't change, memory updates are batched at the start of new sessions, the cache stays warm for the entire conversation. The cost math gets dramatically better.
The skill / memory boundary
Most teams I review conflate skills and memory. The conflation is expensive. A clean boundary:
- ▸Skill: a versioned, code-reviewed instruction module. 'Here is how you write a SQL query against the customer schema.' 'Here is how you format an invoice approval response.' Skills are stable, they live in the repo, they have tests. They are written by engineers and reviewed before deploy. They are not personalized.
- ▸Memory: a personalized, dynamic piece of knowledge about this user or this session. 'This user prefers concise answers.' 'This customer's invoice format is XYZ.' 'In this session, the user already said no to option A.' Memory is malleable, it lives in a database, it updates as the session progresses. It is generated, not authored.
Mixing them produces two failure modes. Skills get personalized (drift from the eval'd version, tests start failing, regressions appear). Memory gets versioned and code-reviewed (a friction nightmare for what should be a fast-moving learned signal). Keep them in separate stores, separate schemas, separate review processes. The architectural cost is small; the operational benefit is large.
How memory selection works in practice
Memory injection isn't 'shove all the user's memory into the prefix.' That's expensive and hits lost-in-the-middle problems past 100K-token memory stores. The selection pipeline I run on each request:
- ▸Step 1: pre-filter by user_id + recency. Memory is partitioned per user; entries older than the user's per-tier retention horizon are excluded.
- ▸Step 2: vector-similarity search with the user's last few turns as the query. Top-k (typically 10-30) relevant entries.
- ▸Step 3: rerank by a small model (Claude Haiku or a fine-tune) that scores 'is this memory entry actually useful for the next decision' to drop low-signal noise.
- ▸Step 4: budget-aware truncation. The memory zone has a token budget (5-15K typically). If reranked entries exceed the budget, drop the lowest-scored ones until the budget fits.
- ▸Step 5: format into the cached-prefix shape. Stable schema across requests so the cache hashes consistently.
Step 3 is the one that makes the difference. A naive top-k vector search returns plausible-looking entries that don't actually help; reranking with a small model that's read the same prompt context catches the cases where vector similarity is misleading. The rerank is cheap (Haiku call, sub-second) and adds maybe 5% latency for a meaningful quality win.
When self-paging (Letta-style) is the right answer
I want to be fair to the architecture I'm not using:
- ▸Long-running autonomous agents. An agent that runs for days or weeks without a human in the loop should manage its own memory; the orchestrator can't anticipate every state transition. Letta-style self-paging is right for research / exploratory work and for autonomous deep-research bots.
- ▸Multi-agent collaboration where memory is the coordination surface. If three agents are working together and the shared memory is part of the protocol, you want each agent to be able to write and evict autonomously.
- ▸When the cost of getting it wrong is low. A consumer chatbot that occasionally forgets a preference is annoying but not fatal. A self-paging architecture is fine here; the determinism cost doesn't matter much.
On Anthropic Agent Skills specifically
The October 2025 / Q1 2026 Agent Skills release is the closest thing to a standard the field has. Filesystem-based, progressive disclosure (the agent reads metadata first, body only when needed), open spec. I think it's the right shape for how skills should be defined, and I've migrated my skill format to match the SKILL.md + YAML frontmatter convention so future portability is cheap. The piece Skills doesn't solve is the memory side, which is correctly out of its scope: skills are about capabilities, not about per-user state. The two layers compose; you don't pick between them.
What I would change
- ▸Add explicit memory-tier promotion / demotion as a background job. Today entries move between tiers (recent / consolidated / archived) on a coarse cadence. A small daily job that promotes high-utility entries (high rerank scores across many sessions) and demotes low-utility ones would tighten the signal.
- ▸Treat memory as event-sourced too (per the append-only chat-events post). Today memory writes overwrite; an append-only history of how each fact evolved would let me debug 'why does the agent think the user prefers X' more cheaply.
- ▸Bring the rerank model in-house for the highest-volume tenants. Haiku is cheap but not free; a small distilled rerank model could drop p99 latency further at modest engineering cost.
The bigger lesson
Memory is the place agent infrastructure decisions compound the most. Get it wrong and your evals are non-reproducible, your compliance audit is impossible, and your prompt cache is permanently cold. Get it right and the agent feels like it knows the user, the costs stay bounded, and the system is debuggable. The framework choice is downstream of the architectural choice; the architectural choice is downstream of 'who decides what the agent remembers.' My answer to that question is 'the orchestrator, deterministically' for the workloads I ship. Other answers are right for other workloads. The mistake is not picking, defaulting to whatever the framework's default is, and finding out two months later that the default doesn't match your shape.
If a hiring manager asks me how I think about memory in agent infrastructure, this is the framing. Not 'we use Letta because it's the cool one' or 'we use Mem0 because it has good benchmarks,' but 'here's the deterministic injection shape, here's why it fits enterprise workloads with eval discipline, here's the path to swap it out if the workload changes.' The architectural literacy is the hire, not the framework choice.
References
- ▸Letta (formerly MemGPT): three-tier core/archival/recall self-paging architecture (letta.com)
- ▸Mem0: 'Building Production-Ready AI Agents with Scalable Long-Term Memory' (arXiv 2504.19413)
- ▸Zep: temporal knowledge-graph memory layer (getzep.com)
- ▸Anthropic Agent Skills: official docs (platform.claude.com/docs/agents-and-tools/agent-skills)
- ▸anthropics/skills: public skills repository (github.com/anthropics/skills)
- ▸LangMem: episodic / semantic / procedural memory types (langchain blog)
- ▸MIRIX: 6-type multi-agent memory architecture (arXiv 2507.07957)
// RELATED READING
- POSTWhat I Learned About Anthropic's Prompt Cache From Running an Agent Loop in Production
- POSTWhat 1M Context Actually Buys You (and What It Doesn't): Production Patterns from a 2026 Agent Loop
- POSTWhy I Use a Postgres Append-Only Log for Agent Chat (Not Redis Streams)
- POSTDesigning Tool Surfaces for LLM Agents
- CASE STUDYStructured AI — Founding Engineer Reps