RAG Architecture Patterns: From Naive to Production-Grade
Retrieval-Augmented Generation lets LLMs answer questions using your private data — internal documents, codebases, product catalogs — without fine-tuning the model. However, naive RAG (chunk documents, embed, retrieve top-K, generate) produces mediocre results in production. RAG architecture patterns have evolved significantly, and this guide covers the techniques that separate a demo from a production system: advanced chunking, reranking, hybrid search, and systematic evaluation.
Naive RAG: Why the Demo Works but Production Doesn’t
The basic RAG pipeline has four steps: split documents into chunks, create vector embeddings, retrieve similar chunks for a query, and pass them to an LLM for answer generation. This works surprisingly well for simple question-answering demos. Moreover, it takes only an afternoon to build with LangChain or LlamaIndex.
The problem emerges with real-world data. Fixed-size chunks split sentences mid-thought. Semantic search misses keyword-specific queries (“What’s our SLA for enterprise clients?”). The LLM hallucinates when retrieved chunks are tangentially relevant but don’t contain the actual answer. Additionally, there’s no way to measure if the system is actually improving.
# Naive RAG — good for demos, insufficient for production
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Problem 1: Fixed-size chunks break context
splitter = RecursiveCharacterTextSplitter(chunk_size=500, chunk_overlap=50)
chunks = splitter.split_documents(documents)
# "The SLA for enterprise clients is" | "99.99% with 4-hour response time"
# Query: "What's our enterprise SLA?" — might retrieve wrong chunk
# Problem 2: Pure semantic search misses exact terms
vectorstore = Chroma.from_documents(chunks, OpenAIEmbeddings())
results = vectorstore.similarity_search("enterprise SLA", k=4)
# Returns chunks about "service agreements" and "uptime guarantees"
# but misses the chunk with the actual SLA number
# Problem 3: No evaluation — how do you know it's working?
# You don't. You demo it, it looks good, you ship it, users complain.Advanced Chunking Strategies
Chunking strategy directly impacts retrieval quality. The goal is to create chunks that are self-contained units of information — each chunk should make sense on its own and contain a complete thought.
# Strategy 1: Semantic chunking — split at natural boundaries
from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
# Splits based on embedding similarity between sentences
# Groups semantically related sentences together
semantic_splitter = SemanticChunker(
OpenAIEmbeddings(),
breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=80 # Split when similarity drops below 80th percentile
)
semantic_chunks = semantic_splitter.split_documents(documents)
# Strategy 2: Parent-child chunking — small chunks for retrieval, large for context
# Retrieve using small, specific chunks but pass the parent (larger context) to the LLM
class ParentChildRetriever:
def __init__(self, vectorstore, docstore):
self.vectorstore = vectorstore # Contains small child chunks
self.docstore = docstore # Contains large parent chunks
def retrieve(self, query, k=4):
# Search with small chunks (better precision)
child_chunks = self.vectorstore.similarity_search(query, k=k)
# Return parent chunks (better context)
parent_ids = set(chunk.metadata["parent_id"] for chunk in child_chunks)
parents = [self.docstore.get(pid) for pid in parent_ids]
return parents
# Strategy 3: Document-aware chunking — respect document structure
def chunk_by_structure(document):
chunks = []
current_section = {"title": "", "content": ""}
for line in document.split("\n"):
if line.startswith("# ") or line.startswith("## "):
if current_section["content"]:
chunks.append(current_section.copy())
current_section = {"title": line.strip("# "), "content": ""}
else:
current_section["content"] += line + "\n"
if current_section["content"]:
chunks.append(current_section)
return chunksHybrid Search and Reranking
Pure vector search fails on keyword-specific queries because embeddings capture semantic meaning, not exact terms. Hybrid search combines vector search (semantic understanding) with keyword search (BM25, exact matching) to handle both types of queries. Furthermore, reranking uses a cross-encoder model to re-score retrieved results, dramatically improving precision.
# Hybrid search: combine vector + keyword search
from rank_bm25 import BM25Okapi
import numpy as np
class HybridSearcher:
def __init__(self, vectorstore, documents):
self.vectorstore = vectorstore
# BM25 keyword index
tokenized = [doc.split() for doc in documents]
self.bm25 = BM25Okapi(tokenized)
self.documents = documents
def search(self, query, k=10, alpha=0.5):
# Vector search (semantic)
vector_results = self.vectorstore.similarity_search_with_score(query, k=k)
# BM25 search (keyword)
bm25_scores = self.bm25.get_scores(query.split())
bm25_top = np.argsort(bm25_scores)[-k:][::-1]
# Combine scores with Reciprocal Rank Fusion
combined = {}
for rank, (doc, score) in enumerate(vector_results):
combined[doc.page_content] = alpha * (1 / (rank + 60))
for rank, idx in enumerate(bm25_top):
doc = self.documents[idx]
if doc in combined:
combined[doc] += (1 - alpha) * (1 / (rank + 60))
else:
combined[doc] = (1 - alpha) * (1 / (rank + 60))
# Sort by combined score
return sorted(combined.items(), key=lambda x: x[1], reverse=True)[:k]
# Reranking: re-score with a cross-encoder for better precision
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query, documents, top_k=5):
pairs = [(query, doc) for doc in documents]
scores = reranker.predict(pairs)
ranked = sorted(zip(documents, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:top_k]]RAG Evaluation: Measuring Quality Systematically
Without evaluation, you’re guessing whether your RAG system is improving. Build an evaluation dataset with questions, expected answers, and the source documents that contain the answers. Then measure three dimensions: retrieval quality (did we find the right chunks?), generation quality (is the answer correct?), and faithfulness (is the answer grounded in the retrieved context?).
# RAG evaluation framework
class RAGEvaluator:
def __init__(self, rag_pipeline, eval_dataset):
self.pipeline = rag_pipeline
self.dataset = eval_dataset # [{question, expected_answer, source_doc_ids}]
def evaluate(self):
results = {"retrieval_recall": [], "answer_correctness": [], "faithfulness": []}
for item in self.dataset:
# Run RAG pipeline
retrieved_docs, answer = self.pipeline.query(item["question"])
retrieved_ids = [d.metadata["doc_id"] for d in retrieved_docs]
# Metric 1: Retrieval recall — did we find the right documents?
expected_ids = set(item["source_doc_ids"])
found = len(expected_ids.intersection(retrieved_ids))
recall = found / len(expected_ids) if expected_ids else 0
results["retrieval_recall"].append(recall)
# Metric 2: Answer correctness — is the answer right?
correctness = self.judge_correctness(answer, item["expected_answer"])
results["answer_correctness"].append(correctness)
# Metric 3: Faithfulness — is the answer grounded in context?
faithfulness = self.judge_faithfulness(answer, retrieved_docs)
results["faithfulness"].append(faithfulness)
return {k: sum(v)/len(v) for k, v in results.items()}Production RAG: Putting It All Together
A production RAG pipeline combines these techniques: semantic chunking with parent-child retrieval, hybrid search (vector + BM25), cross-encoder reranking, query expansion (rephrase the user’s query for better retrieval), and guardrails (detect when the system doesn’t have enough information to answer). Additionally, implement caching for repeated queries and monitoring for retrieval quality in production.
Related Reading:
Resources:
In conclusion, production RAG requires moving beyond naive chunk-and-retrieve. Semantic chunking, hybrid search, reranking, and systematic evaluation transform a demo into a reliable system. Start with evaluation first — you can’t improve what you can’t measure. Then iterate on chunking, retrieval, and generation until your metrics meet your quality bar.