What 1M Context Actually Buys You (and What It Doesn't): Production Patterns from a 2026 Agent Loop

Anthropic made 1M context generally available on Claude Sonnet 4.6 in March 2026 with no surcharge. The marketing reads like RAG is dead. The production reality reads like: lost-in-the-middle is real, 1M is best as a tool not a default, and the right answer is usually a hybrid. Here's how I actually use the 1M window in an agent loop and where I still reach for RAG.

Anthropic made the 1M-token context window generally available on Claude Sonnet 4.6 in March 2026, then on Opus 4.7 a few weeks later, with no per-token surcharge for going long. Standard rate, full window. The marketing-adjacent take that immediately appeared online was 'RAG is over, just put everything in context.' That's the version a sales-engineering deck looks like. The production version, which is what I want to talk about here, is different: 1M context is genuinely useful, sometimes load-bearing, and almost always the wrong default. Here's the math, the failure modes, and the patterns I actually ship.

What changed in March 2026, and what didn't

▸What changed: 1M context is now standard-priced on Sonnet 4.6 and Opus 4.6/4.7. Filling a million tokens is roughly $3 in input on Sonnet ($15 on Opus). Prompt caching applies, so the effective cost on the second-and-later read is closer to 30 cents (Sonnet) since cache reads are 0.1x base input. This is the cost reduction that makes long context affordable in agent loops, not the raw context size.
▸What didn't change: transformer attention. The 'lost in the middle' phenomenon Stanford documented in 2023 still applies at 1M. Across multiple frontier models, performance on retrieval-from-the-middle drops 20-30 percentage points compared to retrieval-from-the-edges. RoPE positional embedding has a long-term decay term that biases attention toward the first and last few thousand tokens.
▸What also didn't change: throughput. Time to first token on a near-1M-token request is in the tens of seconds. The model is doing real work to attend over that much. If your latency budget is under five seconds end-to-end, 1M is not your tool.

The headline (1M context, no surcharge) is technically a feature unlock. The substance (you can now afford to use it, but it still hurts to use it badly) is what you have to think about as a builder.

Three patterns I actually use the 1M window for

Not as a default, not as a 'just shove the whole repo in there' replacement for retrieval, but as a tool I reach for when the alternative is worse:

▸Pattern 1: Long-running agent state preservation. An agent that runs a multi-hour workflow makes early decisions (constraints, schemas, customer-specific rules) that have to remain visible to later steps. Pre-1M, you'd compact + summarize; the summarization itself loses fidelity, and 'agent forgot what it decided two hours ago' is a real failure mode. With 1M, the early decisions are just there, near the start of the prompt where attention is best, and the agent's later turns reach back to them deterministically.
▸Pattern 2: Whole-document reasoning where chunking breaks the task. Cross-document consistency (does this PR violate a rule defined three files away?), entity resolution across hundreds of pages, code review over a 200-file diff. These are tasks where the right unit of analysis is bigger than any reasonable RAG chunk. Long context lets the model hold the entire artifact and reason without fragmenting.
▸Pattern 3: Prompt-cache-anchored knowledge. Drop a 500K-token corpus (your product docs, your customer's manual, your eval rubric) into the cached prefix once. Every subsequent request hits the cache (0.1x base input). The corpus stays in working memory at small marginal cost. This isn't 'RAG replacement,' it's 'a different way to amortize the cost of always-available context.'

Each of these has a specific shape. None of them is 'just use long context for everything.' Knowing which pattern matches the workload is the engineering judgment that 'use long context' as a slogan obscures.

Lost-in-the-middle, in numbers I can name

I ran a small in-house probe similar to the public NIAH-style needle-in-a-haystack, but using a 'find the contradiction' task that requires the model to actually reason over the inserted information rather than just regurgitate it. Results:

▸Information at position 0-50K (start): ~96% recall on the contradiction.
▸Information at position 950K-1M (end): ~93% recall.
▸Information at position 400K-600K (middle): ~64% recall.
▸Multiple needles spread across the middle: drops to ~50%.

These numbers are specific to my probe and not a published benchmark. The shape, however, matches the broader literature: a U-shaped curve with primacy and recency biases, and a measurable degradation in the middle that scales with the size of the middle. At 1M tokens the middle is hundreds of thousands of tokens, and that's a lot of relevant information to deprioritize.

The practical consequence: you have to design the prompt with attention in mind. The most important content goes near the top or bottom. The cached corpus goes in the middle (since you'll re-prompt it many times, the absolute attention quality matters less than the cost). Time-sensitive instructions go right before the user turn, because that's where attention is highest at the moment of generation.

When I still use RAG (which is most of the time)

RAG isn't dead. RAG is the right tool when:

▸Latency matters. Retrieval + 8K-token reasoning is sub-second; 1M-context reasoning is double-digit-seconds. Customer-facing chat where 'agent is thinking...' shouldn't be a multi-second pause is RAG territory.
▸Source attribution is required. RAG returns a small set of identified passages; long context returns a synthesized answer over a large opaque region. Compliance, legal, and many enterprise use cases need to know which passage justified which claim. RAG gives you that for free.
▸The corpus changes faster than the cache can keep up. If your knowledge base updates every 30 seconds (a live product catalog, an event log), the prompt-cache lifetime is shorter than the update cycle and you're paying full input price every time. RAG with vector search hits a rebuilt index much more efficiently.
▸Cost still matters at scale. Even with caching, a 500K-token cached prefix multiplied across many requests is more expensive than a small RAG retrieval. At low scale this is invisible; at production scale it's a quarterly budget conversation.

The hybrid pattern (and why it's winning in 2026)

The configuration I find myself building over and over: a small (10-100K-token) RAG layer that retrieves the most-likely-relevant slice of the knowledge base, layered with a large (200-700K-token) cached prefix of always-relevant context (style guide, schema, policy, eval rubric). The long-context window holds the slow-moving foundation; the RAG layer brings in the fast-moving, query-specific bits. The model gets both, the cost stays bounded, and the per-request latency stays in the 2-5 second range rather than the 20-second range.

┌──────────────────────────────────────────────────────┐
│ TOOLS (cached, 1h TTL)                               │  ← top of attention,
│   stable across the entire deploy                    │     stays in cache
├──────────────────────────────────────────────────────┤
│ POLICY / RUBRIC / STYLE GUIDE (cached, 1h TTL)       │
│   slow-moving, always-relevant, 200-500K tokens      │  ← cache anchor
├──────────────────────────────────────────────────────┤
│ RAG-RETRIEVED PASSAGES (uncached, fresh per query)   │
│   query-specific, 5-20K tokens, top-k from vector DB │  ← query-relevant
├──────────────────────────────────────────────────────┤
│ CHAT HISTORY (cached for active session, 5m TTL)     │
│   dynamic, semi-recent                               │
├──────────────────────────────────────────────────────┤
│ USER TURN (uncached, current input)                  │  ← bottom of attention,
└──────────────────────────────────────────────────────┘     highest weight on
                                                             "what to do now"

What I would NOT do with 1M context

▸Treat NIAH benchmarks as a feature endorsement. Needle-in-a-haystack tests demonstrate that the model can find a verbatim phrase. Real production tasks ask the model to reason across many relevant passages, ignore many distractor passages, and synthesize an answer. NIAH is to long context what 'hello world' is to building a backend. Necessary, not sufficient.
▸Forget about cost just because there's no surcharge. A 1M-token request is roughly $3 in Sonnet input. At 10K requests per day that's $30K per month in input cost alone. Caching helps; it doesn't make the bill disappear. Track per-request cost and budget per workload type.
▸Use 1M for tasks where 200K does just as well. The performance vs context-size curve is non-linear; doubling the context doesn't double the answer quality, and at the long-context end the marginal quality gain often goes negative because of the lost-in-the-middle effect. If your eval scores plateau at 200K, run at 200K.
▸Skip the eval harness because 'long context fixes things.' If anything, long context introduces new failure modes (subtle attention biases, cache-invalidation bugs, larger blast radius for prompt changes). It needs more eval, not less.

What I'd change if I were rebuilding the long-context handling

▸Layered cache TTLs explicitly per zone. Tools and policy at 1h cache; RAG passages at 5m or no cache; chat history at 5m. The right shape is a per-zone cache_control directive, not a single global one.
▸Expose the eval harness's lost-in-the-middle probe as a CI gate. Today my probe runs ad-hoc when I change the prompt structure. It should run on every prompt-template commit so a regression in middle-context recall fails the build.
▸Track 'effective attention zone' per request: which tokens the model actually attended to (via the model's own attention APIs where available, or via probe-question audits). Today this is a manual investigation; it deserves a dashboard.

The bigger lesson

Long context is a sharper tool, not a bigger hammer. The question 'should I use 1M context here' isn't 'is the data big.' It's 'does the task need cross-document reasoning,' 'is latency tolerant,' 'does the cost math work,' and 'is the lost-in-the-middle exposure acceptable.' Most production tasks answer 'no' to at least one. The minority where the answer is 'yes' to all is where 1M earns its rate.

If a hiring manager asks me how I think about context engineering in 2026, this is the framing. Not 'we use 1M because Anthropic made it free,' but 'here's the workload, here's the failure mode each tool has, here's why the hybrid is the cheapest answer that doesn't lie.' That's the kind of decision a hiring manager is buying when they hire for AI infrastructure work.

References

▸Anthropic: Context windows and pricing for Sonnet 4.6 / Opus 4.7 (platform.claude.com/docs)
▸Stanford / Liu et al: 'Lost in the Middle: How Language Models Use Long Contexts' (2023)
▸Databricks: long-context RAG performance benchmarks across frontier models
▸MindStudio: 'Does a 1M Token Context Window Replace RAG?' (2026)
▸Anthropic: server-side compaction beta announcement (Q1 2026)
▸Diffray: 'Context Dilution: When More Tokens Hurt AI'
▸Pricing references: 1.25x cache write (5-min), 0.1x cache read

← MORE NOTES OPEN COMMS →