Learn how to build production-grade RAG pipelines — chunking strategies, embeddings, vector databases, retrieval, reranking, and evaluation.
RAG (Retrieval-Augmented Generation) solves one of LLMs' biggest limitations: they don't know about your private data or recent events. RAG lets you give the model relevant context at query time — without expensive fine-tuning.
╔══════════════════════════════════════════════════════════════╗
║ INDEXING PIPELINE (runs once / on updates) ║
║ ║
║ Documents → Chunk → Embed → Store in Vector DB ║
║ (PDF, MD, HTML) (512 tokens) (text-embedding-3-small) ║
╚══════════════════════════════════════════════════════════════╝
╔══════════════════════════════════════════════════════════════╗
║ QUERY PIPELINE (runs on every user question) ║
║ ║
║ User Question ║
║ │ ║
║ ▼ ║
║ Embed query ──► Vector Search ──► Top-K chunks ║
║ │ ║
║ Reranker (optional) ║
║ │ ║
║ LLM + context ║
║ │ ║
║ Answer + sources ║
╚══════════════════════════════════════════════════════════════╝
The chunking strategy has the largest impact on RAG quality.
from langchain.text_splitter import RecursiveCharacterTextSplitter
# RecursiveCharacterTextSplitter tries to split on paragraph → sentence → word
splitter = RecursiveCharacterTextSplitter(
chunk_size=512, # tokens per chunk
chunk_overlap=64, # overlap prevents context loss at boundaries
separators=["\n\n", "\n", " ", ""],
)
chunks = splitter.split_text(document_text)
# Always store metadata with each chunk!
documents = [
{
"content": chunk,
"metadata": {
"source": "docs/api-reference.md",
"page": 3,
"section": "Authentication",
"chunk_index": i,
}
}
for i, chunk in enumerate(chunks)
]
| Strategy | Best for | Tradeoff |
|---|---|---|
| Fixed-size | Simple implementation | Breaks mid-sentence |
| Recursive character | General text | Good default choice |
| Semantic (sentence) | Q&A over articles | Higher quality, slower |
| Document structure | Markdown, HTML | Respects headers/sections |
| Parent-child | Dense docs | Better recall, more complex |
from openai import OpenAI
client = OpenAI()
def embed(texts: list[str]) -> list[list[float]]:
"""Embed a batch of texts. Always batch — API calls are expensive."""
response = client.embeddings.create(
input=texts,
model="text-embedding-3-small", # 1536 dims, cheap
# model="text-embedding-3-large" # 3072 dims, better for complex docs
)
return [item.embedding for item in response.data]
# Embed all chunks in batches of 100
BATCH_SIZE = 100
all_embeddings = []
for i in range(0, len(documents), BATCH_SIZE):
batch = [doc["content"] for doc in documents[i:i+BATCH_SIZE]]
all_embeddings.extend(embed(batch))
Embedding models comparison (2025):
| Model | Dimensions | Use case |
|---|---|---|
| text-embedding-3-small | 1536 | Good default, low cost |
| text-embedding-3-large | 3072 | Complex technical docs |
| Cohere embed-v3 | 1024 | Multilingual support |
| BGE-M3 | 1024 | Open-source, self-hosted |
# Using pgvector (PostgreSQL extension) — great if you already use Postgres
# pip install psycopg2 pgvector
import psycopg2
from pgvector.psycopg2 import register_vector
conn = psycopg2.connect(DATABASE_URL)
register_vector(conn)
# Create table with vector column
with conn.cursor() as cur:
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id BIGSERIAL PRIMARY KEY,
content TEXT,
embedding VECTOR(1536),
metadata JSONB
);
CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
""")
# Insert chunks with embeddings
with conn.cursor() as cur:
for doc, embedding in zip(documents, all_embeddings):
cur.execute(
"INSERT INTO documents (content, embedding, metadata) VALUES (%s, %s, %s)",
(doc["content"], embedding, json.dumps(doc["metadata"]))
)
conn.commit()
Popular vector database options:
| Database | Hosted | Best for |
|---|---|---|
| pgvector | Self / Supabase | Already using Postgres |
| Pinecone | Yes | Managed, serverless |
| Weaviate | Self / Cloud | Rich filtering + hybrid search |
| Qdrant | Self / Cloud | High performance, Rust-based |
| Chroma | Self | Local dev and prototyping |
def retrieve(query: str, top_k: int = 5) -> list[dict]:
"""Embed query and find the most similar chunks."""
query_embedding = embed([query])[0]
with conn.cursor() as cur:
cur.execute("""
SELECT content, metadata, 1 - (embedding <=> %s::vector) AS similarity
FROM documents
ORDER BY embedding <=> %s::vector
LIMIT %s
""", (query_embedding, query_embedding, top_k))
return [
{"content": row[0], "metadata": row[1], "score": row[2]}
for row in cur.fetchall()
]
## Step 5: Augmented Generation
def answer(question: str) -> str:
chunks = retrieve(question, top_k=5)
context = "\n\n---\n\n".join(
f"Source: {c['metadata']['source']}\n{c['content']}"
for c in chunks
)
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{
"role": "system",
"content": """You are a helpful assistant. Answer the user's question
using ONLY the provided context. If the answer is not in the context, say so.
Always cite the source document for your claims."""
},
{
"role": "user",
"content": f"Context:\n{context}\n\nQuestion: {question}"
}
]
)
return response.choices[0].message.content
# Pure semantic search misses exact keyword matches.
# Hybrid combines BM25 (keyword) + vector (semantic) scores.
# With pgvector + pg_trgm:
cur.execute("""
SELECT content, metadata,
(1 - (embedding <=> %s::vector)) * 0.7 -- semantic weight
+ (similarity(content, %s)) * 0.3 -- keyword weight
AS hybrid_score
FROM documents
ORDER BY hybrid_score DESC
LIMIT 5
""", (query_embedding, query))
# Retrieve 20 candidates, rerank to top 5 using a cross-encoder
# Cross-encoders are more accurate but too slow for full collection search
import cohere
co = cohere.Client(COHERE_API_KEY)
candidates = retrieve(query, top_k=20)
reranked = co.rerank(
query=query,
documents=[c["content"] for c in candidates],
model="rerank-english-v3.0",
top_n=5,
)
top_chunks = [candidates[r.index] for r in reranked.results]
The four key RAG metrics (RAGAS framework):
┌─────────────────────┬────────────────────────────────────────────────┐
│ Metric │ Measures │
├─────────────────────┼────────────────────────────────────────────────┤
│ Faithfulness │ Is the answer grounded in retrieved context? │
│ Answer Relevancy │ Does the answer address the question? │
│ Context Precision │ Are the retrieved chunks relevant to the query? │
│ Context Recall │ Are all relevant chunks being retrieved? │
└─────────────────────┴────────────────────────────────────────────────┘
Use ragas Python library or LangSmith to evaluate your pipeline before shipping to production.