Stage 04 — RAG (Retrieval-Augmented Generation)

RAG (Retrieval-Augmented Generation) · Comprehensive Technical Training · ⏱ 8–12 hours

Learning Objectives

By the end of this stage you will be able to:

Understand vector embeddings and why they work for semantic search
Build document chunking strategies for optimal retrieval
Set up vector databases (Chroma, FAISS) for embedding storage
Implement retrieval-augmented generation (RAG) pipelines
Evaluate RAG systems and diagnose failures
Implement hybrid search combining semantic and keyword matching
Handle large-scale document ingestion and reranking

Section 1: Vector Embeddings Fundamentals

Embeddings are dense vector representations of text that capture semantic meaning. Words or phrases with similar meanings cluster together in embedding space. This enables semantic search—finding relevant documents based on meaning rather than keyword matching.

How Embeddings Work

An embedding model (typically a transformer) encodes text into a fixed-size vector (e.g., 1536 dimensions for OpenAI's text-embedding-3-large). The model is trained to place semantically similar texts near each other in the vector space.

Example:

"What is machine learning?" → [0.12, -0.05, 0.89, ..., 0.34]
"How does machine learning work?" → [0.11, -0.04, 0.88, ..., 0.35]
"The capital of France" → [-0.45, 0.72, 0.01, ..., -0.23]

The first two vectors are close together (high cosine similarity ≈ 0.99). The third is far away (low similarity ≈ 0.2).

Common Embedding Models

Model	Provider	Dims	Cost	Best For
text-embedding-3-large	OpenAI	3072	$0.02/M	Highest quality, multimodal OK
text-embedding-3-small	OpenAI	1536	$0.02/M	Fast, cheap, good quality
Multilingual-E5-Large	Hugging Face	1024	Free	Multiple languages, self-hosted
BAAI/bge-small-en-v1.5	Hugging Face	384	Free	Lightweight, fast

Section 2: Document Chunking Strategies

Raw documents are too large to embed as single vectors. They must be split into chunks. Poorly chunked documents lead to broken context and bad retrieval.

Simple Fixed-Size Chunking

def chunk_text_fixed(text, chunk_size=512, overlap=50):
    """Split text into overlapping chunks of fixed size."""
    chunks = []
    for i in range(0, len(text), chunk_size - overlap):
        chunks.append(text[i:i + chunk_size])
    return chunks

Issue: Chunks may split sentences or paragraphs awkwardly.

Semantic Chunking

def chunk_text_semantic(text, target_chunk_size=512):
    """Split on sentence boundaries, respecting target size."""
    sentences = text.split(". ")
    chunks = []
    current_chunk = ""

    for sentence in sentences:
        if len(current_chunk) + len(sentence) < target_chunk_size:
            current_chunk += sentence + ". "
        else:
            if current_chunk:
                chunks.append(current_chunk.strip())
            current_chunk = sentence + ". "

    if current_chunk:
        chunks.append(current_chunk.strip())

    return chunks

Better: Respects sentence boundaries, maintaining context.

Recursive Chunking with LangChain

from langchain.text_splitter import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    separators=["

", "
", ". ", " ", ""],  # Try these in order
)

chunks = splitter.split_text(document_text)

Best practice: Tries sentence-level splitting first, then paragraph, then character.

Section 3: Vector Database Setup

Chroma: Lightweight Vector Database

import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction

# Initialize Chroma
embedding_fn = OpenAIEmbeddingFunction(api_key="sk-...")
client = chromadb.EphemeralClient()  # In-memory; use PersistentClient for disk
collection = client.create_collection(
    name="documents",
    embedding_function=embedding_fn,
)

# Add documents with embeddings
collection.add(
    ids=["doc_1", "doc_2"],
    metadatas=[{"source": "file1.pdf"}, {"source": "file2.pdf"}],
    documents=["Chunk 1 text here...", "Chunk 2 text here..."],
)

# Query (Chroma handles embedding)
results = collection.query(
    query_texts=["What is machine learning?"],
    n_results=3,
)

for i, doc in enumerate(results["documents"][0]):
    print(f"{i+1}. {doc} (score: {results['distances'][0][i]:.3f})")

Section 4: Complete RAG Pipeline

from langchain.document_loaders import PDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI

# Load documents
loader = PDFLoader("document.pdf")
documents = loader.load()

# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(documents)

# Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)

# Create RAG chain
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",  # Simple concatenation
    retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
)

# Query
answer = qa_chain.run("What does the document say about X?")
print(answer)

Section 5: RAG Evaluation and Debugging

Common Failure Modes

Retrieval failure: Wrong documents retrieved. Fix: adjust chunk size, improve chunking strategy, verify embeddings quality.
Context window exceeded: Too many chunks to fit context. Fix: retrieve fewer documents (k=1 instead of k=5), use more selective retrieval.
Hallucination: LLM invents information not in retrieved documents. Fix: enforce grounding, penalize "I don't know," use structured outputs.

Evaluation Metrics

from ragas.metrics import answer_relevancy, context_precision

# Evaluate if retrieved context contains answer
context_precision = context_precision(query, reference_answer, retrieved_docs)

# Evaluate if answer is relevant to query
answer_relevancy = answer_relevancy(query, generated_answer)

What's Next

Stage 5 teaches fine-tuning: adapting pre-trained models to your specific domain or task using QLoRA (efficient fine-tuning), dramatically improving quality for specialized applications.

Lock In Founding Member Access

Get full access to every course on TechNodeX — AI, cybersecurity, Python, and everything we build next. $9/month, price locked forever.

Become a Founding Member →

← Previous Stage Stage 4 of 6 Next: Fine-Tuning & Model Adaptation →