Preamble
Naive top-k vector search is a decent demo and a fragile production strategy. Users paraphrase; documents use different jargon; multi-hop questions need facts that never co-occur in a single chunk. When the backend is a smaller distilled model (Mistral-class, Qwen, LLaMA derivatives), the generator is less able to “paper over” bad retrieval with parametric knowledge—so the retrieval layer has to pull its weight.
This post collects upgrades worth reaching for before touching fancier graph RAG or agent loops. They pair well with the foundations in RAG Foundations: Embeddings, Chunking, and the Retrieval Loop and with the LoRA/serving posts that follow (LoRA: Fine-Tuning Small Distilled Models Without Full Weights, Serving Distilled Models Behind an HTTP API).
Re-ranking: a second opinion on (query, passage) pairs
Bi-encoders (your embedding model) score candidates cheaply because query and document vectors are precomputed. Cross-encoders and learned re-rankers jointly attend to query and text; they are heavier per pair but far sharper at deciding whether a chunk truly answers the question.
In production the usual pattern is:
- Retrieve a wider candidate set (e.g. 50–200) with fast vectors.
- Re-rank down to 3–10 with a cross-encoder or a small reranker model.
- Pass only that short list to the LLM.
The latency trade-off is real. p95 end-to-end warrants watching, and reranking is sometimes limited to “high value” intents or gated behind a cheap ambiguity classifier.
Query expansion and HyDE
Query expansion adds terms or sub-questions so lexical and embedding gaps shrink. Simple forms include synonyms from a domain glossary or LLM-generated bullet points that rephrase the user ask.
HyDE (Hypothetical Document Embeddings) flips the idea: ask the LLM to write a fake answer as if it knew the corpus, embed that synthetic paragraph, and search with it. It can bridge vocabulary mismatch; it can also amplify hallucinated specifics if you are not careful. HyDE fits better as an optional path behind a feature flag, not the default for regulated domains.
Metadata filters and tenancy
Vectors without filters become a soup. Retrieval should be scoped by tenant, product, language, document type, or time range whenever those axes exist in the source system. That mirrors the “bounded context” thinking from earlier architecture writing: fewer collisions, easier debugging, and less chance of leaking one customer’s chunk into another’s session.
Evaluation that catches regressions
Three layers of signal beat one pretty demo:
-
Retrieval quality
Labeled (query, relevant_chunk_id) pairs—or weaker proxies like “any gold paragraph in top-k”—give you recall@k and MRR. When embedders or chunking change, this layer screams first. -
Answer groundedness
Does the final answer stick to retrieved text? LLM-as-judge helps if the judge is validated against human ratings and prompts stay versioned. A single automated score should not pass as compliance evidence alone. -
Latency and cost
Track histograms for embed, search, rerank, and generate. A “smarter” stack that misses SLOs is a failed deploy.
Automate a regression job when the corpus or models change. Same discipline as pinning dependencies: record model revisions, index build IDs, and dataset hashes.
Step-by-step: widen → re-rank → generate
This is the default production upgrade path once naive top_k plateaus.
- Vector retrieve — Pull
k_wide(50–200) ids + texts with scores from your index (same embedder revision as in RAG Foundations: Embeddings, Chunking, and the Retrieval Loop). - Re-rank — Score each
(query, passage)pair with a cross-encoder or hosted rerank API; sort descending; keepk_tight(3–10). - Prompt — Inject only
k_tightpassages into the chat template; log their ids on the response object for debugging. - Measure — Offline: recall@k on labeled data before and after rerank (rerank can hurt if the bi-encoder already ranked gold in position 80 and the cross-encoder mis-scores length-biased chunks).
Tooling comparison for re-ranking
| Approach | Examples | Pros | Cons |
|---|---|---|---|
| Cross-encoder (self-hosted) | sentence-transformers CrossEncoder, small Hub models |
Strong accuracy; no per-query API bill; full control | GPU latency; batching complexity at high QPS |
| Hosted rerank | Cohere rerank, Voyage, Jina rerank APIs | Fast integration; scales with vendor | Cost + egress; must pin API version behavior |
| LLM pointwise scoring | “Rate 0–5 how well this passage answers…” | Flexible rubric | Slow, expensive, judge drift; validate against humans |
| ColBERT-style late interaction | RAGatouille, ColBERT indexes | Excellent retrieval quality for some corpora | Heavier index + ops than flat vectors |
Cross-encoder or hosted rerank is the usual workhorse; LLM-as-judge for reranking is a last resort after simpler baselines fail.
Illustrative cross-encoder pass (after you already have candidate strings passages):
from sentence_transformers import CrossEncoder
model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
pairs = [[query, p] for p in passages]
scores = model.predict(pairs)
ranked = [p for _, p in sorted(zip(scores, passages), reverse=True)]
tight = ranked[:8]
Pin the CrossEncoder weights the same way you pin Hub LLM revisions.
HyDE and query expansion: controlled experiments
Query expansion — Maintain a domain glossary JSON; at query time append synonyms, or ask a small LLM for 3 paraphrases and embed the centroid / run multi-query fusion (retrieve per variant, merge with max score). Log which variant hit gold.
HyDE — (1) Prompt LLM: “Write a short factual paragraph that would answer: …” (2) Embed that paragraph. (3) Search. A/B behind a flag: measure precision@k on a holdout set; watch for hallucinated proper nouns in the synthetic paragraph polluting retrieval.
Lightweight eval harness (copy-paste structure)
Keep a questions.jsonl with question, gold_chunk_ids[], optional metadata_filter. For each line:
- Run retrieval (and optional rerank).
- Compute recall@k: did any gold id appear in the top k returned ids?
- Emit a CSV:
question_id, gold_hit@20, gold_rank, latency_ms.
Gate merges on recall@k not moving backward when you change chunking, embedder, or reranker. The “answer groundedness” layer in this post stacks on top once retrieval is no longer random.
Conclusion
Treat RAG as a product surface. Log which chunk IDs supported each answer; wire request IDs through search and generation; and assume distilled backends will not hide bad retrieval.
LoRA: Fine-Tuning Small Distilled Models Without Full Weights picks up LoRA on those same backends so domain tone and tools improve without full fine-tunes.