Retrieval-Augmented Generation (RAG)

Retrieval-Augmented Generation (RAG) is a pattern where a language model conditions its responses on externally retrieved, query-relevant documents. A retriever selects context from indexes (sparse, dense, or hybrid), which is serialized into the prompt to ground generation, reduce hallucinations, and enable up-to-date, source-cited answers without retraining the model.

What is Retrieval-Augmented Generation (RAG)?

RAG inserts a retrieve step into the standard generate loop. At query time, systems normalize and expand the query (rewrites, synonyms), then fetch candidates via BM25/keyword, vector embeddings, or hybrid retrieval, often followed by cross-encoder reranking. Selected passages are chunked, deduplicated, and formatted with metadata and citations. The LLM consumes this context under a schema that encourages attribution and faithfulness. Advanced designs add query routing, multi-hop retrieval, and fusion strategies (e.g., HyDE, FiD) to aggregate evidence. Indexing pipelines compute embeddings, store text and metadata, and support filters for tenant, time, and security labels.

Why it matters and where it’s used

RAG improves factuality and explainability while keeping models small and costs predictable. It powers enterprise search/Q&A, customer support, developer docs, regulatory policy assistants, biomedical literature review, and analytics copilots that must cite sources and reflect rapid content updates.

Examples

Basic pipeline: query rewrite → hybrid retrieve → rerank → cite and answer.
Multi-hop RAG: retrieve chains of evidence across documents for complex questions.
Structured context: tables/JSON flattened with headers and units for numeric fidelity.
Guarded prompting: templates that separate instructions from untrusted retrieved text.

FAQs

Does RAG eliminate hallucinations? It reduces them but cannot guarantee truth; enforce citation and use answer verification.
How many chunks should I retrieve? Start with 5–10 passages, rerank, and prune aggressively to fit context.
Which embeddings and chunk sizes work best? Domain-tuned embeddings and 200–400 token chunks with overlap are common; evaluate empirically.
How do I prevent prompt overflow? Use aggressive deduplication, summarize long passages, and compress with rerankers.
Is hybrid better than pure dense? Hybrid often improves recall; validate on your corpus.
What about security and safety? Treat retrieved text as untrusted; mitigate prompt injection with isolation, schemas, and allow/deny lists.
How to measure quality? Track faithfulness, answer relevance, citation precision/recall, and end-task success.