HeoLab
ToolsBlogAboutContact
HeoLab

Free developer tools with AI enhancement. Built for developers who ship.

Tools

  • JSON Formatter
  • JWT Decoder
  • Base64 Encoder
  • Timestamp Converter
  • Regex Tester
  • All Tools →

Resources

  • Blog
  • What is JSON?
  • JWT Deep Dive
  • Base64 Explained

Company

  • About
  • Contact
  • Privacy Policy
  • Terms of Service

© 2026 HeoLab. All rights reserved.

Tools work in your browser. Zero data retention.

HomeBlogBuilding RAG Systems: Retrieval-Augmented Generation for Developers
Table of Contents▾
  • How RAG Works
  • Step 1: Document Ingestion and Chunking
  • RecursiveCharacterTextSplitter tries to split on paragraph → sentence → word
  • Always store metadata with each chunk!
  • Chunking Strategies
  • Step 2: Embeddings
  • Embed all chunks in batches of 100
  • Step 3: Vector Database
  • Using pgvector (PostgreSQL extension) — great if you already use Postgres
  • pip install psycopg2 pgvector
  • Create table with vector column
  • Insert chunks with embeddings
  • Step 4: Retrieval
  • Step 5: Augmented Generation
  • Improving Retrieval Quality
  • Hybrid Search (Keyword + Semantic)
  • Pure semantic search misses exact keyword matches.
  • Hybrid combines BM25 (keyword) + vector (semantic) scores.
  • With pgvector + pg_trgm:
  • Reranking
  • Retrieve 20 candidates, rerank to top 5 using a cross-encoder
  • Cross-encoders are more accurate but too slow for full collection search
  • Evaluating RAG Quality
  • Common RAG Mistakes
tutorials#ai#rag#llm

Building RAG Systems: Retrieval-Augmented Generation for Developers

Learn how to build production-grade RAG pipelines — chunking strategies, embeddings, vector databases, retrieval, reranking, and evaluation.

Trong Ngo
February 25, 2026
6 min read

RAG (Retrieval-Augmented Generation) solves one of LLMs' biggest limitations: they don't know about your private data or recent events. RAG lets you give the model relevant context at query time — without expensive fine-tuning.

RAG pipeline diagram — documents flow through indexing into a vector store, then are retrieved at query time to augment the LLM prompt

How RAG Works

╔══════════════════════════════════════════════════════════════╗
║  INDEXING PIPELINE (runs once / on updates)                  ║
║                                                              ║
║  Documents → Chunk → Embed → Store in Vector DB             ║
║  (PDF, MD, HTML)   (512 tokens)  (text-embedding-3-small)   ║
╚══════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════╗
║  QUERY PIPELINE (runs on every user question)                ║
║                                                              ║
║  User Question                                               ║
║       │                                                      ║
║       ▼                                                      ║
║  Embed query ──► Vector Search ──► Top-K chunks             ║
║                                          │                   ║
║                                    Reranker (optional)       ║
║                                          │                   ║
║                                    LLM + context             ║
║                                          │                   ║
║                                    Answer + sources          ║
╚══════════════════════════════════════════════════════════════╝

Step 1: Document Ingestion and Chunking

The chunking strategy has the largest impact on RAG quality.

from langchain.text_splitter import RecursiveCharacterTextSplitter

# RecursiveCharacterTextSplitter tries to split on paragraph → sentence → word
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,        # tokens per chunk
    chunk_overlap=64,      # overlap prevents context loss at boundaries
    separators=["\n\n", "\n", " ", ""],
)

chunks = splitter.split_text(document_text)

# Always store metadata with each chunk!
documents = [
    {
        "content": chunk,
        "metadata": {
            "source": "docs/api-reference.md",
            "page": 3,
            "section": "Authentication",
            "chunk_index": i,
        }
    }
    for i, chunk in enumerate(chunks)
]

Chunking Strategies

StrategyBest forTradeoff
Fixed-sizeSimple implementationBreaks mid-sentence
Recursive characterGeneral textGood default choice
Semantic (sentence)Q&A over articlesHigher quality, slower
Document structureMarkdown, HTMLRespects headers/sections
Parent-childDense docsBetter recall, more complex

Step 2: Embeddings

from openai import OpenAI

client = OpenAI()

def embed(texts: list[str]) -> list[list[float]]:
    """Embed a batch of texts. Always batch — API calls are expensive."""
    response = client.embeddings.create(
        input=texts,
        model="text-embedding-3-small",  # 1536 dims, cheap
        # model="text-embedding-3-large" # 3072 dims, better for complex docs
    )
    return [item.embedding for item in response.data]

# Embed all chunks in batches of 100
BATCH_SIZE = 100
all_embeddings = []
for i in range(0, len(documents), BATCH_SIZE):
    batch = [doc["content"] for doc in documents[i:i+BATCH_SIZE]]
    all_embeddings.extend(embed(batch))

Embedding models comparison (2025):

ModelDimensionsUse case
text-embedding-3-small1536Good default, low cost
text-embedding-3-large3072Complex technical docs
Cohere embed-v31024Multilingual support
BGE-M31024Open-source, self-hosted

Step 3: Vector Database

# Using pgvector (PostgreSQL extension) — great if you already use Postgres
# pip install psycopg2 pgvector

import psycopg2
from pgvector.psycopg2 import register_vector

conn = psycopg2.connect(DATABASE_URL)
register_vector(conn)

# Create table with vector column
with conn.cursor() as cur:
    cur.execute("""
        CREATE TABLE IF NOT EXISTS documents (
            id BIGSERIAL PRIMARY KEY,
            content TEXT,
            embedding VECTOR(1536),
            metadata JSONB
        );
        CREATE INDEX ON documents USING hnsw (embedding vector_cosine_ops);
    """)

# Insert chunks with embeddings
with conn.cursor() as cur:
    for doc, embedding in zip(documents, all_embeddings):
        cur.execute(
            "INSERT INTO documents (content, embedding, metadata) VALUES (%s, %s, %s)",
            (doc["content"], embedding, json.dumps(doc["metadata"]))
        )
conn.commit()

Popular vector database options:

DatabaseHostedBest for
pgvectorSelf / SupabaseAlready using Postgres
PineconeYesManaged, serverless
WeaviateSelf / CloudRich filtering + hybrid search
QdrantSelf / CloudHigh performance, Rust-based
ChromaSelfLocal dev and prototyping

Step 4: Retrieval

def retrieve(query: str, top_k: int = 5) -> list[dict]:
    """Embed query and find the most similar chunks."""
    query_embedding = embed([query])[0]

    with conn.cursor() as cur:
        cur.execute("""
            SELECT content, metadata, 1 - (embedding <=> %s::vector) AS similarity
            FROM documents
            ORDER BY embedding <=> %s::vector
            LIMIT %s
        """, (query_embedding, query_embedding, top_k))

        return [
            {"content": row[0], "metadata": row[1], "score": row[2]}
            for row in cur.fetchall()
        ]


## Step 5: Augmented Generation

def answer(question: str) -> str:
    chunks = retrieve(question, top_k=5)

    context = "\n\n---\n\n".join(
        f"Source: {c['metadata']['source']}\n{c['content']}"
        for c in chunks
    )

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {
                "role": "system",
                "content": """You are a helpful assistant. Answer the user's question
using ONLY the provided context. If the answer is not in the context, say so.
Always cite the source document for your claims."""
            },
            {
                "role": "user",
                "content": f"Context:\n{context}\n\nQuestion: {question}"
            }
        ]
    )
    return response.choices[0].message.content

Improving Retrieval Quality

Hybrid Search (Keyword + Semantic)

# Pure semantic search misses exact keyword matches.
# Hybrid combines BM25 (keyword) + vector (semantic) scores.

# With pgvector + pg_trgm:
cur.execute("""
    SELECT content, metadata,
           (1 - (embedding <=> %s::vector)) * 0.7   -- semantic weight
         + (similarity(content, %s)) * 0.3           -- keyword weight
           AS hybrid_score
    FROM documents
    ORDER BY hybrid_score DESC
    LIMIT 5
""", (query_embedding, query))

Reranking

# Retrieve 20 candidates, rerank to top 5 using a cross-encoder
# Cross-encoders are more accurate but too slow for full collection search

import cohere
co = cohere.Client(COHERE_API_KEY)

candidates = retrieve(query, top_k=20)

reranked = co.rerank(
    query=query,
    documents=[c["content"] for c in candidates],
    model="rerank-english-v3.0",
    top_n=5,
)

top_chunks = [candidates[r.index] for r in reranked.results]

Evaluating RAG Quality

The four key RAG metrics (RAGAS framework):

┌─────────────────────┬────────────────────────────────────────────────┐
│ Metric              │ Measures                                        │
├─────────────────────┼────────────────────────────────────────────────┤
│ Faithfulness        │ Is the answer grounded in retrieved context?    │
│ Answer Relevancy    │ Does the answer address the question?           │
│ Context Precision   │ Are the retrieved chunks relevant to the query? │
│ Context Recall      │ Are all relevant chunks being retrieved?        │
└─────────────────────┴────────────────────────────────────────────────┘

Use ragas Python library or LangSmith to evaluate your pipeline before shipping to production.

Common RAG Mistakes

  • Chunk size too large: Model buries the relevant sentence in a 2000-token block
  • No metadata filtering: Retrieving chunks from the wrong document version
  • Missing overlap: Context is lost at chunk boundaries
  • Embedding query ≠ document language: Embedding "What is X?" retrieves chunks that say "X is ...", but semantic match is weak. Use HyDE (hypothetical document embeddings) to improve this.
  • Not citing sources: Users cannot verify LLM claims without source attribution

Try These Tools

JSON Formatter & Validator

Format, validate, and beautify JSON data instantly. Detect errors with precise line numbers.

Base64 Encoder / Decoder

Encode text and binary data to Base64 or decode Base64 strings. Supports URL-safe variant.

Related Articles

What Are AI Agents? A Complete Guide for Developers

5 min read

Multi-Agent Systems: Orchestration Patterns and Real-World Examples

5 min read

How AI is Changing Developer Tools in 2025

4 min read

Back to Blog

Table of Contents

  • How RAG Works
  • Step 1: Document Ingestion and Chunking
  • RecursiveCharacterTextSplitter tries to split on paragraph → sentence → word
  • Always store metadata with each chunk!
  • Chunking Strategies
  • Step 2: Embeddings
  • Embed all chunks in batches of 100
  • Step 3: Vector Database
  • Using pgvector (PostgreSQL extension) — great if you already use Postgres
  • pip install psycopg2 pgvector
  • Create table with vector column
  • Insert chunks with embeddings
  • Step 4: Retrieval
  • Step 5: Augmented Generation
  • Improving Retrieval Quality
  • Hybrid Search (Keyword + Semantic)
  • Pure semantic search misses exact keyword matches.
  • Hybrid combines BM25 (keyword) + vector (semantic) scores.
  • With pgvector + pg_trgm:
  • Reranking
  • Retrieve 20 candidates, rerank to top 5 using a cross-encoder
  • Cross-encoders are more accurate but too slow for full collection search
  • Evaluating RAG Quality
  • Common RAG Mistakes

Related Articles

What Are AI Agents? A Complete Guide for Developers

5 min read

Multi-Agent Systems: Orchestration Patterns and Real-World Examples

5 min read

How AI is Changing Developer Tools in 2025

4 min read