Building RAG Systems: Retrieval-Augmented Generation for Developers

RAG (Retrieval-Augmented Generation) solves one of LLMs' biggest limitations: they don't know about your private data or recent events. RAG lets you give the model relevant context at query time — without expensive fine-tuning.

RAG pipeline diagram — documents flow through indexing into a vector store, then are retrieved at query time to augment the LLM prompt

How RAG Works

╔══════════════════════════════════════════════════════════════╗
║  INDEXING PIPELINE (runs once / on updates)                  ║
║                                                              ║
║  Documents → Chunk → Embed → Store in Vector DB             ║
║  (PDF, MD, HTML)   (512 tokens)  (text-embedding-3-small)   ║
╚══════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════╗
║  QUERY PIPELINE (runs on every user question)                ║
║                                                              ║
║  User Question                                               ║
║       │                                                      ║
║       ▼                                                      ║
║  Embed query ──► Vector Search ──► Top-K chunks             ║
║                                          │                   ║
║                                    Reranker (optional)       ║
║                                          │                   ║
║                                    LLM + context             ║
║                                          │                   ║
║                                    Answer + sources          ║
╚══════════════════════════════════════════════════════════════╝

Step 1: Document Ingestion and Chunking

The chunking strategy has the largest impact on RAG quality.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# RecursiveCharacterTextSplitter tries to split on paragraph → sentence → word
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # tokens per chunk
    chunk_overlap=64,      # overlap prevents context loss at boundaries
    separators=["\n\n", "\n", " ", ""],
)

chunks = splitter.split_text(document_text)

# Always store metadata with each chunk!
documents = [
    {
        "content": chunk,
        "metadata": {
            "source": "docs/api-reference.md",
            "page": 3,
            "section": "Authentication",
            "chunk_index": i,
        }
    }
    for i, chunk in enumerate(chunks)
]

Chunking Strategies

Strategy	Best for	Tradeoff
Fixed-size	Simple implementation	Breaks mid-sentence
Recursive character	General text	Good default choice
Semantic (sentence)	Q&A over articles	Higher quality, slower
Document structure	Markdown, HTML	Respects headers/sections
Parent-child	Dense docs	Better recall, more complex

Step 2: Embeddings

from openai import OpenAI

client = OpenAI()

def embed(texts: list[str]) -> list[list[float]]:
    """Embed a batch of texts. Always batch — API calls are expensive."""
    response = client.embeddings.create(
        input=texts,
        model="text-embedding-3-small",  # 1536 dims, cheap
        # model="text-embedding-3-large" # 3072 dims, better for complex docs
    )
    return [item.embedding for item in response.data]

# Embed all chunks in batches of 100
BATCH_SIZE = 100
all_embeddings = []
for i in range(0, len(documents), BATCH_SIZE):
    batch = [doc["content"] for doc in documents[i:i+BATCH_SIZE]]
    all_embeddings.extend(embed(batch))

Embedding models comparison (2025):

Model	Dimensions	Use case
text-embedding-3-small	1536	Good default, low cost
text-embedding-3-large	3072	Complex technical docs
Cohere embed-v3	1024	Multilingual support
BGE-M3	1024	Open-source, self-hosted

Step 3: Vector Database

# Using pgvector (PostgreSQL extension) — great if you already use Postgres
# pip install psycopg2 pgvector

import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect(DATABASE_URL)
register_vector(conn)

# Create table with vector column
with conn.cursor() as cur:
    cur.execute("""
        CREATE TABLE IF NOT EXISTS documents (
            id BIGSERIAL PRIMARY KEY,
            content TEXT,
            embedding VECTOR(1536),
            metadata JSONB
        );
        CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
    """)

# Insert chunks with embeddings
with conn.cursor() as cur:
    for doc, embedding in zip(documents, all_embeddings):
        cur.execute(
            "INSERT INTO documents (content, embedding, metadata) VALUES (%s, %s, %s)",
            (doc["content"], embedding, json.dumps(doc["metadata"]))
        )
conn.commit()

Popular vector database options:

Database	Hosted	Best for
pgvector	Self / Supabase	Already using Postgres
Pinecone	Yes	Managed, serverless
Weaviate	Self / Cloud	Rich filtering + hybrid search
Qdrant	Self / Cloud	High performance, Rust-based
Chroma	Self	Local dev and prototyping

Step 4: Retrieval

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    """Embed query and find the most similar chunks."""
    query_embedding = embed([query])[0]

    with conn.cursor() as cur:
        cur.execute("""
            SELECT content, metadata, 1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (query_embedding, query_embedding, top_k))

        return [
            {"content": row[0], "metadata": row[1], "score": row[2]}
            for row in cur.fetchall()
        ]


## Step 5: Augmented Generation

def answer(question: str) -> str:
    chunks = retrieve(question, top_k=5)

    context = "\n\n---\n\n".join(
        f"Source: {c['metadata']['source']}\n{c['content']}"
        for c in chunks
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant. Answer the user's question
using ONLY the provided context. If the answer is not in the context, say so.
Always cite the source document for your claims."""
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )
    return response.choices[0].message.content

Improving Retrieval Quality

Hybrid Search (Keyword + Semantic)

# Pure semantic search misses exact keyword matches.
# Hybrid combines BM25 (keyword) + vector (semantic) scores.

# With pgvector + pg_trgm:
cur.execute("""
    SELECT content, metadata,
           (1 - (embedding <=> %s::vector)) * 0.7   -- semantic weight
         + (similarity(content, %s)) * 0.3           -- keyword weight
           AS hybrid_score
    FROM documents
    ORDER BY hybrid_score DESC
    LIMIT 5
""", (query_embedding, query))

Reranking

# Retrieve 20 candidates, rerank to top 5 using a cross-encoder
# Cross-encoders are more accurate but too slow for full collection search

import cohere
co = cohere.Client(COHERE_API_KEY)

candidates = retrieve(query, top_k=20)

reranked = co.rerank(
    query=query,
    documents=[c["content"] for c in candidates],
    model="rerank-english-v3.0",
    top_n=5,
)

top_chunks = [candidates[r.index] for r in reranked.results]

Evaluating RAG Quality

The four key RAG metrics (RAGAS framework):

┌─────────────────────┬────────────────────────────────────────────────┐
│ Metric              │ Measures                                        │
├─────────────────────┼────────────────────────────────────────────────┤
│ Faithfulness        │ Is the answer grounded in retrieved context?    │
│ Answer Relevancy    │ Does the answer address the question?           │
│ Context Precision   │ Are the retrieved chunks relevant to the query? │
│ Context Recall      │ Are all relevant chunks being retrieved?        │
└─────────────────────┴────────────────────────────────────────────────┘

Use ragas Python library or LangSmith to evaluate your pipeline before shipping to production.

Common RAG Mistakes

Chunk size too large: Model buries the relevant sentence in a 2000-token block
No metadata filtering: Retrieving chunks from the wrong document version
Missing overlap: Context is lost at chunk boundaries
Embedding query ≠ document language: Embedding "What is X?" retrieves chunks that say "X is ...", but semantic match is weak. Use HyDE (hypothetical document embeddings) to improve this.
Not citing sources: Users cannot verify LLM claims without source attribution

RAG pipeline diagram — documents flow through indexing into a vector store, then are retrieved at query time to augment the LLM prompt

How RAG Works

╔══════════════════════════════════════════════════════════════╗
║  INDEXING PIPELINE (runs once / on updates)                  ║
║                                                              ║
║  Documents → Chunk → Embed → Store in Vector DB             ║
║  (PDF, MD, HTML)   (512 tokens)  (text-embedding-3-small)   ║
╚══════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════╗
║  QUERY PIPELINE (runs on every user question)                ║
║                                                              ║
║  User Question                                               ║
║       │                                                      ║
║       ▼                                                      ║
║  Embed query ──► Vector Search ──► Top-K chunks             ║
║                                          │                   ║
║                                    Reranker (optional)       ║
║                                          │                   ║
║                                    LLM + context             ║
║                                          │                   ║
║                                    Answer + sources          ║
╚══════════════════════════════════════════════════════════════╝

Step 1: Document Ingestion and Chunking

The chunking strategy has the largest impact on RAG quality.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# RecursiveCharacterTextSplitter tries to split on paragraph → sentence → word
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # tokens per chunk
    chunk_overlap=64,      # overlap prevents context loss at boundaries
    separators=["\n\n", "\n", " ", ""],
)

chunks = splitter.split_text(document_text)

# Always store metadata with each chunk!
documents = [
    {
        "content": chunk,
        "metadata": {
            "source": "docs/api-reference.md",
            "page": 3,
            "section": "Authentication",
            "chunk_index": i,
        }
    }
    for i, chunk in enumerate(chunks)
]

Chunking Strategies

Strategy	Best for	Tradeoff
Fixed-size	Simple implementation	Breaks mid-sentence
Recursive character	General text	Good default choice
Semantic (sentence)	Q&A over articles	Higher quality, slower
Document structure	Markdown, HTML	Respects headers/sections
Parent-child	Dense docs	Better recall, more complex

Step 2: Embeddings

from openai import OpenAI

client = OpenAI()

def embed(texts: list[str]) -> list[list[float]]:
    """Embed a batch of texts. Always batch — API calls are expensive."""
    response = client.embeddings.create(
        input=texts,
        model="text-embedding-3-small",  # 1536 dims, cheap
        # model="text-embedding-3-large" # 3072 dims, better for complex docs
    )
    return [item.embedding for item in response.data]

# Embed all chunks in batches of 100
BATCH_SIZE = 100
all_embeddings = []
for i in range(0, len(documents), BATCH_SIZE):
    batch = [doc["content"] for doc in documents[i:i+BATCH_SIZE]]
    all_embeddings.extend(embed(batch))

Embedding models comparison (2025):

Model	Dimensions	Use case
text-embedding-3-small	1536	Good default, low cost
text-embedding-3-large	3072	Complex technical docs
Cohere embed-v3	1024	Multilingual support
BGE-M3	1024	Open-source, self-hosted

Step 3: Vector Database

# Using pgvector (PostgreSQL extension) — great if you already use Postgres
# pip install psycopg2 pgvector

import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect(DATABASE_URL)
register_vector(conn)

# Create table with vector column
with conn.cursor() as cur:
    cur.execute("""
        CREATE TABLE IF NOT EXISTS documents (
            id BIGSERIAL PRIMARY KEY,
            content TEXT,
            embedding VECTOR(1536),
            metadata JSONB
        );
        CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
    """)

# Insert chunks with embeddings
with conn.cursor() as cur:
    for doc, embedding in zip(documents, all_embeddings):
        cur.execute(
            "INSERT INTO documents (content, embedding, metadata) VALUES (%s, %s, %s)",
            (doc["content"], embedding, json.dumps(doc["metadata"]))
        )
conn.commit()

Popular vector database options:

Database	Hosted	Best for
pgvector	Self / Supabase	Already using Postgres
Pinecone	Yes	Managed, serverless
Weaviate	Self / Cloud	Rich filtering + hybrid search
Qdrant	Self / Cloud	High performance, Rust-based
Chroma	Self	Local dev and prototyping

Step 4: Retrieval

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    """Embed query and find the most similar chunks."""
    query_embedding = embed([query])[0]

    with conn.cursor() as cur:
        cur.execute("""
            SELECT content, metadata, 1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (query_embedding, query_embedding, top_k))

        return [
            {"content": row[0], "metadata": row[1], "score": row[2]}
            for row in cur.fetchall()
        ]


## Step 5: Augmented Generation

def answer(question: str) -> str:
    chunks = retrieve(question, top_k=5)

    context = "\n\n---\n\n".join(
        f"Source: {c['metadata']['source']}\n{c['content']}"
        for c in chunks
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant. Answer the user's question
using ONLY the provided context. If the answer is not in the context, say so.
Always cite the source document for your claims."""
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )
    return response.choices[0].message.content

Improving Retrieval Quality

Hybrid Search (Keyword + Semantic)

# Pure semantic search misses exact keyword matches.
# Hybrid combines BM25 (keyword) + vector (semantic) scores.

# With pgvector + pg_trgm:
cur.execute("""
    SELECT content, metadata,
           (1 - (embedding <=> %s::vector)) * 0.7   -- semantic weight
         + (similarity(content, %s)) * 0.3           -- keyword weight
           AS hybrid_score
    FROM documents
    ORDER BY hybrid_score DESC
    LIMIT 5
""", (query_embedding, query))

Reranking

# Retrieve 20 candidates, rerank to top 5 using a cross-encoder
# Cross-encoders are more accurate but too slow for full collection search

import cohere
co = cohere.Client(COHERE_API_KEY)

candidates = retrieve(query, top_k=20)

reranked = co.rerank(
    query=query,
    documents=[c["content"] for c in candidates],
    model="rerank-english-v3.0",
    top_n=5,
)

top_chunks = [candidates[r.index] for r in reranked.results]

Evaluating RAG Quality

The four key RAG metrics (RAGAS framework):

┌─────────────────────┬────────────────────────────────────────────────┐
│ Metric              │ Measures                                        │
├─────────────────────┼────────────────────────────────────────────────┤
│ Faithfulness        │ Is the answer grounded in retrieved context?    │
│ Answer Relevancy    │ Does the answer address the question?           │
│ Context Precision   │ Are the retrieved chunks relevant to the query? │
│ Context Recall      │ Are all relevant chunks being retrieved?        │
└─────────────────────┴────────────────────────────────────────────────┘

Use ragas Python library or LangSmith to evaluate your pipeline before shipping to production.

Common RAG Mistakes

Chunk size too large: Model buries the relevant sentence in a 2000-token block
No metadata filtering: Retrieving chunks from the wrong document version
Missing overlap: Context is lost at chunk boundaries
Embedding query ≠ document language: Embedding "What is X?" retrieves chunks that say "X is ...", but semantic match is weak. Use HyDE (hypothetical document embeddings) to improve this.
Not citing sources: Users cannot verify LLM claims without source attribution

How RAG Works

Step 1: Document Ingestion and Chunking

Chunking Strategies

Step 2: Embeddings

Step 3: Vector Database

Step 4: Retrieval

Improving Retrieval Quality

Hybrid Search (Keyword + Semantic)

Reranking

Evaluating RAG Quality

Common RAG Mistakes

Try These Tools

Related Articles

Building RAG Systems: Retrieval-Augmented Generation for Developers

How RAG Works

Step 1: Document Ingestion and Chunking

Chunking Strategies

Step 2: Embeddings

Step 3: Vector Database

Step 4: Retrieval

Improving Retrieval Quality

Hybrid Search (Keyword + Semantic)

Reranking

Evaluating RAG Quality

Common RAG Mistakes

Try These Tools

Related Articles