Why Vector Similarity Alone Lies in RAG (and the Rerank Step Most Pipelines Skip)
Vector top-k retrieval is the standard RAG starting point and it has a known failure mode: high cosine similarity does not equal high relevance. The fix is well-known too, and most production pipelines I read in code review skip it. The two-stage paradigm (broad recall, narrow precision) plus the right reranker buys you 17-40 percentage points of accuracy for ~120ms of latency. Here's the framework.
Build a RAG pipeline using the default tutorial path and you'll end up with: chunk the documents, embed them, store in a vector DB, retrieve top-k by cosine similarity, stuff the results into the prompt, ship it. It works, demos great, and starts producing wrong answers at production scale for a specific reason: vector similarity is not relevance. A document with high cosine similarity to the query can match the query's vocabulary without actually answering it. This is the most-skipped fix in production RAG and the one with the biggest accuracy payoff. The cost is ~120ms of latency and one additional model call per query. The benefit, on the published 2026 benchmarks, is 17-40 percentage points of accuracy depending on the eval. This post is the framework I use to decide which reranker, where, and what to actually measure.
Why vector similarity lies
The intuition that breaks: 'this chunk is semantically close to the query, so it must answer the query.' The chunk being close in embedding space means the chunk is in the same neighborhood as the query, which means similar topic, similar vocabulary, similar style. None of those guarantee that the chunk contains the specific information the query is asking for. A chunk discussing 'how to set up a Postgres advisory lock' is in the same neighborhood as the query 'do advisory locks work through PgBouncer?' and the chunk does not answer the query.
Cross-encoder rerankers solve this by jointly encoding the query and each candidate document with full self-attention, instead of independently encoding them and comparing the result vectors (which is what dense retrieval does). The cross-encoder sees the query and document together and produces a relevance score that reflects 'does this document answer this query.' The cost: cross-encoders are expensive per pair, so you can't run them over the whole index; you run them over a small candidate set the bi-encoder has already shortlisted.
The two-stage paradigm, in numbers
The whole-pipeline budget for a customer-facing RAG application is typically 2-5 seconds end-to-end. That breaks down: ~200ms retrieval, ~120ms rerank, ~1-3s generation, ~100ms of overhead. The reranker step is well under 10% of the budget and produces almost half the accuracy improvement.
Hybrid retrieval at stage 1
Pure dense retrieval misses exact keyword matches (a query about 'CVE-2026-21852' doesn't get a hit on a doc that mentions it because embeddings under-weight numeric tokens). Pure sparse retrieval (BM25) misses semantic synonyms ('container migration' vs 'Postgres schema deployment'). Production RAG in 2026 runs both in parallel and fuses them with Reciprocal Rank Fusion (RRF), which combines two ranked lists into one in a way that's robust to scale differences between the two scoring systems. The fusion is a one-line algorithm: sum 1/(k + rank) for each document across both lists, sort by sum.
Hybrid retrieval at stage 1 plus a reranker at stage 2 is the production-tested combination. Skipping either layer leaves accuracy on the table; skipping both produces the 'why is the LLM saying things that aren't in our docs' bug.
Picking the reranker
The 2026 reranker landscape has clear winners by axis:
- ▸Best raw accuracy: Zerank 2 (1638 ELO on the head-to-head leaderboard) and Cohere Rerank 4 Pro (1629 ELO). Both are managed APIs, both at the higher latency tier (300-600ms).
- ▸Best balance of accuracy + latency: Voyage Rerank 2.5. Strong ELO, ~600ms typical, multilingual support.
- ▸Lowest latency: Jina Reranker v3 at ~188ms with 81.33% Hit@1. Right when latency budget is tight.
- ▸Cheapest at scale: FlashRank (CPU-based, MiniLM cross-encoder), 15-30ms locally. Right when API cost dominates and you can self-host.
- ▸Open-source / self-hostable: BGE-reranker-v2-m3 or v2-Gemma. ~85% of Cohere quality, full data control. Right for ZDR-bound workloads where the document content is sensitive.
The choice within this landscape is the same shape as most other engineering decisions: blast radius (does the data have to stay on our infrastructure?), latency budget (sub-200ms or 'whatever fits in 2s'), cost (API per-call vs CPU instance), and accuracy floor (how bad does the LLM's wrong answer get when reranker accuracy drops 5%?). Most teams I review default to Cohere Rerank because it's the documented happy path; that's a fine default, just not the universal answer.
LLM-as-judge as the reranker
A different approach to reranking is to skip the cross-encoder entirely and use a small LLM (Claude Haiku, Gemini Flash, GPT-4o-mini) as the judge. The pitch: the LLM understands the query better than a cross-encoder; the cost is ~500ms-1s and ~2K input tokens per query. The cases where this wins are queries that require multi-hop reasoning or domain knowledge the cross-encoder doesn't have. The cases where this loses are high-volume RAG where the LLM bill at the rerank stage exceeds the cross-encoder API bill.
I use LLM-as-judge for the agent's skill / memory selection (per the skill-memory post on this blog) because the workload is low-volume and the relevance criterion is hard to express with a cross-encoder. I use Cohere or Voyage for high-volume document RAG where each query is straightforward 'is this passage about the question's topic.' Right tool for the right query shape.
What I would change in the default
- ▸Don't ship without a reranker. The 17-40 percentage point accuracy improvement is too big to leave on the table for the 'just retrieve' setup.
- ▸Don't ship without hybrid retrieval. The pure-dense retrieval fails on the keyword-match cases (numeric tokens, exact phrasings, code identifiers) that BM25 trivially handles. The fusion is one function call.
- ▸Measure Recall@K, MRR, and nDCG, not just 'did the LLM say the right thing.' The whole-pipeline metric is too coarse to tell you which stage is the bottleneck. Per-stage metrics turn 'the LLM is wrong sometimes' into 'the reranker is dropping the right document at position 4 because the query phrasing is unusual.'
- ▸Use the eval harness on the retrieval pipeline, not just the LLM output. Build a fixed corpus of (query, ideal-document-id) pairs; run the retrieval pipeline against it on every change; track the metrics over time. This is how you catch reranker regressions before they ship.
Common mistakes I see in production RAG
- ▸Reranking on too many candidates. The reranker's cost grows linearly with candidate count. Reranking 200 candidates wastes 4x the latency vs reranking 50 with similar final-quality. Stage 1 should narrow to 50-100; reranking 500+ is rarely worth it.
- ▸Using cosine similarity as the reranker. People notice the precision problem, then 'fix' it by rerunning vector similarity with a better embedding model. The better embedding helps Stage 1 retrieval; it does not solve the precision-vs-recall problem at Stage 2 because the query and document are still encoded separately. You need joint encoding (cross-encoder) for the precision improvement.
- ▸Treating the reranker as static. The right reranker for your workload depends on the document distribution, the query phrasing, the latency budget. The default works as a starting point; an A/B against alternatives is what tells you which one fits your data.
- ▸Skipping citation grounding. Once you have a reranker, you have a confidence score per chunk. Use it: when generating, require the LLM to cite the chunk_id for each claim, and surface the score so users can spot when the model cites a low-ranked chunk. This is the most underweighted observability surface in RAG.
What I ship
On the agent products I've shipped where RAG is part of the loop:
- ▸Stage 1: Postgres + pgvector for dense retrieval, BM25 (Postgres tsvector or a separate index) for sparse, RRF fusion in a single SQL CTE. Top 50.
- ▸Stage 2: Cohere Rerank 4 Pro on customer-facing public corpora; BGE-reranker-v2-m3 self-hosted for ZDR-bound private corpora. Top 5-10 reranked.
- ▸Stage 3: Anthropic Claude with structured outputs requiring citations, evaluated with the eval harness from the post earlier in this blog.
- ▸Per-stage metrics: Recall@50 at Stage 1, MRR@10 at Stage 2, citation-correctness at Stage 3, all tracked in the eval harness. The dashboards show which stage moved when an LLM regression appears.
The bigger lesson
RAG quality in production is dominated by the retrieval pipeline, not the LLM. Teams that obsess over the prompt while shipping naive vector retrieval get worse results than teams that ship a careful retrieval-plus-rerank pipeline with a less-tuned prompt. The prompt is the easy part; the retrieval architecture is the work. The two-stage paradigm and the reranker step are the cheapest interventions with the largest ROI; skipping them produces the median 'our RAG works most of the time' product that nobody trusts in production.
If a hiring manager asks me how I think about RAG quality in 2026, this is the framing. Not 'we use Cohere because it's the best,' but 'here's the two-stage architecture, here's the per-stage metric set, here's the reranker decision tree per workload.' Architectural literacy plus measurement discipline, in that order, is how RAG goes from demoware to product.
References
- ▸Cohere: Rerank 3.5 / 4 / 4 Pro release notes and benchmarks
- ▸Voyage AI: rerank-2 / 2.5 multilingual rerankers
- ▸Jina AI: Reranker v2 / v3 (188ms 81.33% Hit@1)
- ▸BAAI: BGE Reranker v2-m3 / v2-Gemma open-source
- ▸Zerank 2: top of the head-to-head ELO leaderboard (1638)
- ▸FlashRank: CPU-friendly cross-encoder (15-30ms)
- ▸arXiv 2502.17036: 'Language Model Re-rankers are Fooled by Lexical Similarities'
- ▸OptyxStack: 'Hybrid Search + Reranking Playbook' (2026 production write-up)
// RELATED READING
- POSTWhy I Built My Own Agent Eval Harness Instead of Reaching for LangSmith
- POSTSkill and Memory Injection for Agent Loops: Why I Don't Let the Agent Page Its Own Memory
- POSTWhat 1M Context Actually Buys You (and What It Doesn't): Production Patterns from a 2026 Agent Loop
- POSTStructured Outputs vs Tool Calling for LLM Data Extraction: Pick by Intent, Not by Habit
- CASE STUDYStructured AI — Founding Engineer Reps