Preamble
Retrieval-Augmented Generation is the pattern everyone names when they want the model to cite your PDFs, tickets, or runbooks instead of improvising from weights alone. It is less like a single model feature and more like a small data product: you ingest documents, turn them into searchable units, pull the best candidates at query time, and only then hand text to the generator.
This month locks in vocabulary and failure modes before the later posts go deeper on re-ranking, evaluation, and the distilled-model stack paired with RAG through the rest of 2026.
The core loop, end to end
The happy path looks boring on purpose—that is a good sign.
-
Ingest
You load sources (HTML, Markdown, PDF text, Slack exports, whatever). Normalization matters: strip boilerplate headers, fix encoding, and keep enough metadata (source URL, section,updated_at) that you can filter later. -
Chunk
Models do not read whole repositories in one gulp. You split text into overlapping segments—classically hundreds to low thousands of tokens per chunk, with overlap so sentences at boundaries are not orphaned. Tables and lists punish naive “split every N characters” strategies; Structure-aware splitting is preferable when the format allows it. -
Embed
An embedding model maps each chunk to a dense vector. At query time you embed the user question (sometimes the query and the document live better with asymmetric encoders—one flavor for passages, one for questions—but many stacks use a single model for both). -
Retrieve
You score candidates with cosine similarity, dot product (when vectors are normalized differently), or a hybrid of BM25 + vectors. You return top-k chunks, optionally after metadata filters (tenant, product line, language). -
Generate
The LLM receives a system or developer instruction, the retrieved passages in a clearly delimited block, and the user message. The model’s job is to ground the answer in those passages while refusing when nothing supports a claim.
That loop is simple to diagram and surprisingly easy to get subtly wrong in production.
Chunking: where the most silent damage shows up
If chunks are too large, you dilute relevance—the vector averages over too many topics and retrieval becomes fuzzy. If chunks are too small, you lose definitions and antecedents (“it”, “the policy”) that only make sense in a wider window.
Overlap is not waste; it is insurance that a sentence sliced in half still appears whole in at least one neighbor chunk. Overlap percentage is a tunable knob alongside chunk size, not a fixed constant from a tutorial.
Tables deserve a callout. Row-wise splitting can destroy joins across columns; sometimes keeping a whole small table in one chunk is better than perfect token parity. When tables are huge, consider extracting them to a structured store and retrieving them by key while still RAG-ing the narrative docs.
Embeddings and the vector store
The embedding model defines the geometry of your search. A model trained for semantic similarity helps with paraphrases; a weaker or badly matched embedder sends the generator irrelevant paragraphs that look related in keyword space only.
Vector databases and libraries differ in filtering, hybrid search, quantization, and ops maturity. For early work, reproducibility (fixed model revision, recorded chunking parameters) matters more than shaving microseconds off single-query latency. Once traffic grows, batching, caching query embeddings, and separating hot tenants matter.
Pitfalls to plan for explicitly
Stale corpora
If the retrieval index lags the truth, the model confidently cites outdated policy. Version your index builds and expose “as of” metadata when users need it.
Lost in the middle
Stuffing ten long chunks into the context window does not mean the model weighs them evenly. Keeping k modest, summarizing long threads, or using re-ranking (RAG in Production: Re-ranking, HyDE, and Simple Evals) presents fewer, sharper passages.
Evaluation theater
Eyeballing answers in a notebook is not a test harness. Hold out question–answer pairs, measure whether the right chunk would be retrieved, and track regressions when you change embedder or chunking.
Tooling map: how teams actually wire the loop
The loop above is runtime-agnostic. In practice you glue ingest → chunk → embed → index → query → generate with one of these families (often mixed):
| Layer | Representative tools | What you trade |
|---|---|---|
| Orchestration | LangChain, LlamaIndex, Haystack, custom Python | LangChain/LlamaIndex ship fast demos and integrations; custom code minimizes magic and keeps dependencies thin. |
| Chunking | Framework splitters, Unstructured for messy PDFs, langchain_text_splitters, hand-rolled Markdown headers |
PDFs/HTML reward structure-aware parsers; naive CharacterTextSplitter is fine only for clean text. |
| Embeddings | sentence-transformers, OpenAI/Cohere/Voyage APIs, models on the Hub | APIs = ops simplicity + cost; self-hosted = reproducibility + no egress; pin model revision. |
| Vector index | FAISS (in-process), Chroma, Qdrant, pgvector, managed cloud DBs | FAISS/Chroma for prototypes; pgvector when you already own Postgres and want SQL + metadata filters; Qdrant/Milvus when you outgrow a single node. |
| Lexical / hybrid | BM25 via Elasticsearch/OpenSearch, built-in hybrid in some vector DBs | Pure vectors miss exact SKUs and codes; hybrid fixes “semantic near-miss” failures. |
| Generation | Any chat API or vLLM/TGI self-host (see Serving Distilled Models Behind an HTTP API) | Same RAG stack; swap generator without re-indexing if prompts stay stable. |
Pros / cons in one sentence each: Framework-first (LangChain/LlamaIndex) accelerates time-to-demo but can hide tokenizer/chunk bugs until production. Vector-in-Postgres reduces moving parts for teams that already run Postgres and need row-level security. Separate vector DB scales ANN search and hybrid features but adds another backup and monitoring target.
Minimal reproducible path (self-hosted embed + local index)
Goal: one machine, fixed dependency versions, same chunks every run.
- Environment — Python 3.11+,
pip install chromadb sentence-transformers(pin versions inrequirements.txt; recordpip freezehash in your ADR). For header-aware splitting later, addlangchain-text-splittersor Unstructured and record the extra dependency as part of the index version. - Corpus — Start with a folder of
.mdor.txtfiles; later swap in Unstructured for PDFs. Record file list + checksums. - Chunk — Fixed
chunk_size(tokens or characters), fixedchunk_overlap, single splitter implementation. Log the parameters beside the index name. - Embed — Load one Hub model id with
revision=; embed documents in batches; persist the Chroma collection name + embed model id. - Retrieve — For each eval question,
collection.query(query_texts=[q], n_results=k); log returned chunk ids + scores. - Generate — Build one prompt template: system instruction +
---delimited passages + user question. Pin chat model revision the same way as the embedder.
A compact indexing sketch (adapt names and paths):
# index_build.py — illustrative; pin package versions in production
import chromadb
from chromadb.utils import embedding_functions
CHUNK_SIZE = 800
CHUNK_OVERLAP = 120
def chunk_text(text: str, size: int = CHUNK_SIZE, overlap: int = CHUNK_OVERLAP) -> list[str]:
"""Character-based chunks; swap for structure-aware splitters when format allows."""
chunks: list[str] = []
start = 0
n = len(text)
while start < n:
end = min(start + size, n)
chunks.append(text[start:end])
if end == n:
break
start = max(end - overlap, start + 1)
return chunks
def load_docs(paths: list[str]) -> list[str]:
chunks: list[str] = []
for p in paths:
text = open(p, encoding="utf-8").read()
chunks.extend(chunk_text(text))
return chunks
def main():
ef = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="sentence-transformers/all-MiniLM-L6-v2"
)
client = chromadb.PersistentClient(path="./chroma_db")
col = client.get_or_create_collection("docs", embedding_function=ef)
chunks = load_docs(["./corpus/a.md", "./corpus/b.md"])
ids = [f"chunk-{i}" for i in range(len(chunks))]
col.add(ids=ids, documents=chunks)
print("indexed", len(chunks), "chunks")
if __name__ == "__main__":
main()
What “reproducible” means here: anyone with the same corpus bytes, splitter parameters, embedder revision, and index path gets the same retrieval ids for the same queries. Change any of those without a version bump and you should expect eval drift.
When to add hybrid search or metadata first
If users query IDs, error codes, or proper nouns, add BM25 or hybrid before chasing a larger embedder. If you have tenant_id or product_line, add metadata filters on the vector store before raising k. Both are cheaper than swapping a 7B generator to “fix” bad retrieval.
Conclusion
RAG is systems engineering: ingestion quality, chunk boundaries, and retrieval metrics matter as much as the chat model’s charisma. The distilled LLMs fine-tuned later in this series only shine when the passages they read are the right passages.
RAG in Production: Re-ranking, HyDE, and Simple Evals picks up re-ranking, query reformulation, and lightweight evaluation so naive top-k does not become the ceiling.