Stage 04 — RAG (Retrieval-Augmented Generation)
RAG (Retrieval-Augmented Generation) · Comprehensive Technical Training · ⏱ 8–12 hours
Learning Objectives
By the end of this stage you will be able to:
- Understand vector embeddings and why they work for semantic search
- Build document chunking strategies for optimal retrieval
- Set up vector databases (Chroma, FAISS) for embedding storage
- Implement retrieval-augmented generation (RAG) pipelines
- Evaluate RAG systems and diagnose failures
- Implement hybrid search combining semantic and keyword matching
- Handle large-scale document ingestion and reranking
Section 1: Vector Embeddings Fundamentals
Embeddings are dense vector representations of text that capture semantic meaning. Words or phrases with similar meanings cluster together in embedding space. This enables semantic search—finding relevant documents based on meaning rather than keyword matching.
How Embeddings Work
An embedding model (typically a transformer) encodes text into a fixed-size vector (e.g., 1536 dimensions for OpenAI's text-embedding-3-large). The model is trained to place semantically similar texts near each other in the vector space.
Example:
- "What is machine learning?" → [0.12, -0.05, 0.89, ..., 0.34]
- "How does machine learning work?" → [0.11, -0.04, 0.88, ..., 0.35]
- "The capital of France" → [-0.45, 0.72, 0.01, ..., -0.23]
The first two vectors are close together (high cosine similarity ≈ 0.99). The third is far away (low similarity ≈ 0.2).
Common Embedding Models
| Model | Provider | Dims | Cost | Best For |
|---|---|---|---|---|
| text-embedding-3-large | OpenAI | 3072 | $0.02/M | Highest quality, multimodal OK |
| text-embedding-3-small | OpenAI | 1536 | $0.02/M | Fast, cheap, good quality |
| Multilingual-E5-Large | Hugging Face | 1024 | Free | Multiple languages, self-hosted |
| BAAI/bge-small-en-v1.5 | Hugging Face | 384 | Free | Lightweight, fast |
Section 2: Document Chunking Strategies
Raw documents are too large to embed as single vectors. They must be split into chunks. Poorly chunked documents lead to broken context and bad retrieval.
Simple Fixed-Size Chunking
def chunk_text_fixed(text, chunk_size=512, overlap=50):
"""Split text into overlapping chunks of fixed size."""
chunks = []
for i in range(0, len(text), chunk_size - overlap):
chunks.append(text[i:i + chunk_size])
return chunks
Issue: Chunks may split sentences or paragraphs awkwardly.
Semantic Chunking
def chunk_text_semantic(text, target_chunk_size=512):
"""Split on sentence boundaries, respecting target size."""
sentences = text.split(". ")
chunks = []
current_chunk = ""
for sentence in sentences:
if len(current_chunk) + len(sentence) < target_chunk_size:
current_chunk += sentence + ". "
else:
if current_chunk:
chunks.append(current_chunk.strip())
current_chunk = sentence + ". "
if current_chunk:
chunks.append(current_chunk.strip())
return chunks
Better: Respects sentence boundaries, maintaining context.
Recursive Chunking with LangChain
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["
", "
", ". ", " ", ""], # Try these in order
)
chunks = splitter.split_text(document_text)
Best practice: Tries sentence-level splitting first, then paragraph, then character.
Section 3: Vector Database Setup
Chroma: Lightweight Vector Database
import chromadb
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
# Initialize Chroma
embedding_fn = OpenAIEmbeddingFunction(api_key="sk-...")
client = chromadb.EphemeralClient() # In-memory; use PersistentClient for disk
collection = client.create_collection(
name="documents",
embedding_function=embedding_fn,
)
# Add documents with embeddings
collection.add(
ids=["doc_1", "doc_2"],
metadatas=[{"source": "file1.pdf"}, {"source": "file2.pdf"}],
documents=["Chunk 1 text here...", "Chunk 2 text here..."],
)
# Query (Chroma handles embedding)
results = collection.query(
query_texts=["What is machine learning?"],
n_results=3,
)
for i, doc in enumerate(results["documents"][0]):
print(f"{i+1}. {doc} (score: {results['distances'][0][i]:.3f})")
Section 4: Complete RAG Pipeline
from langchain.document_loaders import PDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.embeddings.openai import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains import RetrievalQA
from langchain.llms import OpenAI
# Load documents
loader = PDFLoader("document.pdf")
documents = loader.load()
# Split into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=512, chunk_overlap=50)
chunks = splitter.split_documents(documents)
# Create embeddings and vector store
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(chunks, embeddings)
# Create RAG chain
llm = OpenAI(model="gpt-3.5-turbo", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # Simple concatenation
retriever=vectorstore.as_retriever(search_kwargs={"k": 3}),
)
# Query
answer = qa_chain.run("What does the document say about X?")
print(answer)
Section 5: RAG Evaluation and Debugging
Common Failure Modes
- Retrieval failure: Wrong documents retrieved. Fix: adjust chunk size, improve chunking strategy, verify embeddings quality.
- Context window exceeded: Too many chunks to fit context. Fix: retrieve fewer documents (k=1 instead of k=5), use more selective retrieval.
- Hallucination: LLM invents information not in retrieved documents. Fix: enforce grounding, penalize "I don't know," use structured outputs.
Evaluation Metrics
from ragas.metrics import answer_relevancy, context_precision
# Evaluate if retrieved context contains answer
context_precision = context_precision(query, reference_answer, retrieved_docs)
# Evaluate if answer is relevant to query
answer_relevancy = answer_relevancy(query, generated_answer)
What's Next
Stage 5 teaches fine-tuning: adapting pre-trained models to your specific domain or task using QLoRA (efficient fine-tuning), dramatically improving quality for specialized applications.
Lock In Founding Member Access
Get full access to every course on TechNodeX — AI, cybersecurity, Python, and everything we build next. $9/month, price locked forever.
Become a Founding Member →