Stage 04 — Building RAG Systems with Vector Databases
Grounding AI in Your Own Documents · Technical + Architecture · ⏱ 8–12 hours
Learning Objectives
By the end of this stage you will be able to:
- Explain what RAG is and why it solves core LLM limitations
- Understand vector embeddings and semantic similarity
- Set up and query ChromaDB (local) and Pinecone (hosted) vector stores
- Build a complete document ingestion pipeline
- Implement chunking strategies and understand their trade-offs
- Build a production-ready Q&A system over custom documents
- Evaluate and improve RAG quality
Section 1: The RAG Problem Statement
Large language models have two fundamental limitations for production use:
- Knowledge cutoff — they don't know about events after training
- No access to private data — they can't answer questions about your documents, codebase, customer records, or internal systems
Retrieval-Augmented Generation (RAG) solves this by:
- Converting your documents into vector embeddings
- At query time, retrieving the most relevant document chunks
- Injecting those chunks as context into the LLM's prompt
The LLM still does the reasoning and synthesis — it just has accurate, up-to-date facts to work from.
User Query → Embed Query → Search Vector DB → Retrieve Top-K Chunks →
→ Build Prompt [chunks + query] → LLM → Grounded Answer
Section 2: Vector Embeddings Deep Dive
An embedding is a dense vector (array of floating-point numbers) that represents the semantic meaning of text. Texts with similar meanings produce vectors that are geometrically close.
How Embeddings Work
Embedding models (distinct from generative LLMs) transform text into vectors:
from anthropic import Anthropic
# Using OpenAI embeddings (most common)
from openai import OpenAI
client = OpenAI()
def embed(text: str, model: str = "text-embedding-3-small") -> list[float]:
"""Convert text to embedding vector."""
response = client.embeddings.create(
input=text,
model=model
)
return response.data[0].embedding
# Example
vector = embed("machine learning for cybersecurity")
print(f"Vector dimensions: {len(vector)}") # 1536 for text-embedding-3-small
print(f"First 5 values: {vector[:5]}")
Cosine Similarity
Similarity is measured by the angle between vectors:
import numpy as np
def cosine_similarity(a: list[float], b: list[float]) -> float:
"""Calculate cosine similarity between two vectors."""
a, b = np.array(a), np.array(b)
return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
# Semantic similarity test
v1 = embed("deep learning neural networks")
v2 = embed("artificial intelligence machine learning")
v3 = embed("cooking pasta recipes")
print(f"AI vs AI: {cosine_similarity(v1, v2):.3f}") # ~0.85 (similar)
print(f"AI vs cooking: {cosine_similarity(v1, v3):.3f}") # ~0.15 (different)
Embedding Model Comparison
| Model | Dimensions | Cost/1M tokens | Best For |
|---|---|---|---|
| text-embedding-3-small | 1536 | $0.02 | General use, cost-efficient |
| text-embedding-3-large | 3072 | $0.13 | Maximum accuracy |
| text-embedding-ada-002 | 1536 | $0.10 | Legacy, still common |
| Voyage-2 (Anthropic) | 1024 | $0.10 | Claude-optimized RAG |
| BAAI/bge-large (local) | 1024 | Free | Self-hosted, no API cost |
Section 3: ChromaDB — Local Vector Store
ChromaDB is the easiest way to start with RAG — it runs locally with no external dependencies.
Installation
pip install chromadb sentence-transformers
Basic Operations
import chromadb
from chromadb.utils import embedding_functions
# Initialize client (persisted to disk)
client = chromadb.PersistentClient(path="./chroma_db")
# Create embedding function (uses local model, no API key needed)
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
# Create or get collection
collection = client.get_or_create_collection(
name="technodex_docs",
embedding_function=embedding_fn,
metadata={"hnsw:space": "cosine"} # Use cosine distance
)
# Add documents
collection.add(
documents=[
"Python is a high-level programming language known for its readability.",
"Kali Linux is a Debian-based distribution for penetration testing.",
"SQL injection is a web attack that exploits unsanitized database queries.",
"Docker containers isolate applications with their dependencies.",
"HTTPS uses TLS to encrypt web traffic between client and server."
],
ids=["doc1", "doc2", "doc3", "doc4", "doc5"],
metadatas=[
{"course": "python", "stage": 1},
{"course": "kali", "stage": 2},
{"course": "ethical-hacking", "stage": 7},
{"course": "devops", "stage": 3},
{"course": "networking", "stage": 4}
]
)
# Query
results = collection.query(
query_texts=["how to protect web applications from attacks"],
n_results=3
)
for doc, meta, distance in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
):
print(f"Distance: {distance:.3f} | Course: {meta['course']}")
print(f" {doc[:100]}...")
print()
Section 4: Document Ingestion Pipeline
Chunking Strategies
Chunking splits large documents into pieces small enough for context windows but large enough to contain useful information.
from dataclasses import dataclass
from typing import Optional
import re
@dataclass
class Chunk:
text: str
source: str
chunk_index: int
total_chunks: int
metadata: dict
def chunk_by_size(
text: str,
source: str,
chunk_size: int = 500,
overlap: int = 50,
metadata: Optional[dict] = None
) -> list[Chunk]:
"""Split text into fixed-size chunks with overlap."""
words = text.split()
chunks = []
start = 0
while start < len(words):
end = start + chunk_size
chunk_text = " ".join(words[start:end])
chunks.append(Chunk(
text=chunk_text,
source=source,
chunk_index=len(chunks),
total_chunks=-1, # Will update after
metadata=metadata or {}
))
start += chunk_size - overlap
for chunk in chunks:
chunk.total_chunks = len(chunks)
return chunks
def chunk_by_heading(text: str, source: str, metadata: Optional[dict] = None) -> list[Chunk]:
"""Split markdown by headings — preserves document structure."""
sections = re.split(r'\n(?=#{1,3} )', text)
chunks = []
for i, section in enumerate(sections):
if section.strip():
chunks.append(Chunk(
text=section.strip(),
source=source,
chunk_index=i,
total_chunks=len(sections),
metadata=metadata or {}
))
return chunks
def chunk_by_sentence(
text: str,
source: str,
max_sentences: int = 5,
overlap_sentences: int = 1,
metadata: Optional[dict] = None
) -> list[Chunk]:
"""Sentence-boundary chunking — better for factual Q&A."""
sentences = re.split(r'(?<=[.!?])\s+', text)
chunks = []
start = 0
while start < len(sentences):
end = min(start + max_sentences, len(sentences))
chunk_text = " ".join(sentences[start:end])
chunks.append(Chunk(
text=chunk_text,
source=source,
chunk_index=len(chunks),
total_chunks=-1,
metadata=metadata or {}
))
start += max_sentences - overlap_sentences
for chunk in chunks:
chunk.total_chunks = len(chunks)
return chunks
Chunking Strategy Selection Guide
| Strategy | Use When | Avoid When |
|---|---|---|
| Fixed size | Fast ingestion, uniform documents | Splits mid-sentence |
| By heading | Markdown/structured docs | Plain prose |
| By sentence | Q&A over facts | Very long paragraphs |
| Semantic | Best retrieval quality | Slow, requires extra model |
| Recursive | Balanced default choice | When structure is known |
Full Ingestion Pipeline
import os
import hashlib
from pathlib import Path
import chromadb
from chromadb.utils import embedding_functions
class DocumentIngester:
"""Ingest documents into a ChromaDB collection."""
def __init__(self, collection_name: str, db_path: str = "./chroma_db"):
self.db = chromadb.PersistentClient(path=db_path)
self.embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
self.collection = self.db.get_or_create_collection(
name=collection_name,
embedding_function=self.embedding_fn,
metadata={"hnsw:space": "cosine"}
)
def ingest_file(
self,
file_path: str,
chunk_strategy: str = "size",
chunk_size: int = 500,
extra_metadata: Optional[dict] = None
) -> int:
"""Ingest a single file. Returns number of chunks added."""
path = Path(file_path)
with open(path, "r", encoding="utf-8") as f:
text = f.read()
metadata = {
"filename": path.name,
"file_type": path.suffix,
**(extra_metadata or {})
}
if chunk_strategy == "heading" and path.suffix in [".md", ".txt"]:
chunks = chunk_by_heading(text, str(path), metadata)
elif chunk_strategy == "sentence":
chunks = chunk_by_sentence(text, str(path), metadata=metadata)
else:
chunks = chunk_by_size(text, str(path), chunk_size, metadata=metadata)
# Generate deterministic IDs to avoid duplicates
ids = []
for chunk in chunks:
content_hash = hashlib.md5(
f"{file_path}:{chunk.chunk_index}:{chunk.text[:50]}".encode()
).hexdigest()[:12]
ids.append(f"{path.stem}_{content_hash}")
# Add to collection
self.collection.upsert(
documents=[c.text for c in chunks],
ids=ids,
metadatas=[{**c.metadata, "chunk_index": c.chunk_index, "source": c.source}
for c in chunks]
)
print(f"Ingested {path.name}: {len(chunks)} chunks")
return len(chunks)
def ingest_directory(self, directory: str, pattern: str = "*.txt", **kwargs) -> int:
"""Ingest all matching files in a directory."""
total = 0
for file_path in Path(directory).glob(pattern):
total += self.ingest_file(str(file_path), **kwargs)
return total
def search(self, query: str, n_results: int = 5, where: Optional[dict] = None) -> list[dict]:
"""Search collection and return results."""
kwargs = {"query_texts": [query], "n_results": n_results}
if where:
kwargs["where"] = where
results = self.collection.query(**kwargs)
return [
{
"text": doc,
"metadata": meta,
"distance": dist
}
for doc, meta, dist in zip(
results["documents"][0],
results["metadatas"][0],
results["distances"][0]
)
]
Section 5: Building a RAG Q&A System
import anthropic
from typing import Optional
class RAGSystem:
"""Complete RAG Q&A system over a document collection."""
def __init__(self, ingester: DocumentIngester, llm_client: anthropic.Anthropic):
self.ingester = ingester
self.llm = llm_client
self.conversation_history = []
def answer(
self,
question: str,
n_results: int = 5,
course_filter: Optional[str] = None,
show_sources: bool = True
) -> dict:
"""Answer a question using retrieved context."""
# Retrieve relevant chunks
where = {"course": course_filter} if course_filter else None
retrieved = self.ingester.search(question, n_results=n_results, where=where)
if not retrieved:
return {
"answer": "I couldn't find relevant information in the knowledge base.",
"sources": [],
"retrieved_chunks": 0
}
# Build context from retrieved chunks
context_parts = []
for i, result in enumerate(retrieved, 1):
meta = result["metadata"]
source = meta.get("filename", meta.get("source", "Unknown"))
context_parts.append(f"[Source {i}: {source}]\n{result['text']}")
context = "\n\n---\n\n".join(context_parts)
# Build prompt
system_prompt = """You are a knowledgeable technical assistant for TechNodeX, an online learning platform.
Answer questions based ONLY on the provided context. If the context doesn't contain enough information to answer confidently, say so clearly.
Rules:
- Base your answer exclusively on the provided context
- Cite source numbers when referencing specific information [Source N]
- If the context is contradictory, note the contradiction
- Keep answers focused and precise
- Use code examples when relevant"""
user_message = f"""Context from knowledge base:
{context}
---
Question: {question}"""
# Call LLM
response = self.llm.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system=system_prompt,
messages=[{"role": "user", "content": user_message}]
)
answer = response.content[0].text
sources = [
{
"file": r["metadata"].get("filename", "unknown"),
"chunk": r["metadata"].get("chunk_index", 0),
"relevance": 1 - r["distance"]
}
for r in retrieved
]
return {
"answer": answer,
"sources": sources,
"retrieved_chunks": len(retrieved),
"tokens_used": response.usage.input_tokens + response.usage.output_tokens
}
def chat(self, question: str, **kwargs) -> str:
"""Multi-turn RAG chat with conversation history."""
# For multi-turn, consider appending conversation history to the query
# to capture context ("What about the previous topic?")
result = self.answer(question, **kwargs)
print(f"\n{result['answer']}")
if result["sources"]:
print(f"\n[Retrieved {result['retrieved_chunks']} chunks from: "
f"{', '.join(set(s['file'] for s in result['sources']))}]")
return result["answer"]
# Usage example
def build_course_qa_system(docs_directory: str) -> RAGSystem:
"""Build a RAG system over TechNodeX course documents."""
ingester = DocumentIngester("technodex_courses")
ingester.ingest_directory(docs_directory, pattern="*.md", chunk_strategy="heading")
llm = anthropic.Anthropic()
rag = RAGSystem(ingester, llm)
return rag
# Interactive Q&A
if __name__ == "__main__":
rag = build_course_qa_system("./course_docs")
print("TechNodeX Course Assistant (type 'quit' to exit)")
print("=" * 50)
while True:
question = input("\nYour question: ").strip()
if question.lower() == "quit":
break
if question:
rag.chat(question)
Section 6: Pinecone — Production Vector Store
For production systems with large datasets, use a managed vector database.
pip install pinecone-client
from pinecone import Pinecone, ServerlessSpec
import os
# Initialize
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))
# Create index (one-time setup)
index_name = "technodex-courses"
if index_name not in pc.list_indexes().names():
pc.create_index(
name=index_name,
dimension=1536, # text-embedding-3-small dimensions
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1")
)
index = pc.Index(index_name)
# Upsert vectors
vectors_to_upsert = []
for chunk in chunks:
embedding = embed(chunk.text) # From Section 2
vectors_to_upsert.append({
"id": f"doc_{chunk.chunk_index}",
"values": embedding,
"metadata": {
"text": chunk.text,
"source": chunk.source,
"chunk_index": chunk.chunk_index
}
})
# Batch upsert (max 100 per call)
batch_size = 100
for i in range(0, len(vectors_to_upsert), batch_size):
batch = vectors_to_upsert[i:i + batch_size]
index.upsert(vectors=batch)
# Query
query_embedding = embed("how do I configure Kali Linux")
results = index.query(
vector=query_embedding,
top_k=5,
include_metadata=True
)
for match in results.matches:
print(f"Score: {match.score:.3f}")
print(f"Source: {match.metadata['source']}")
print(f"Text: {match.metadata['text'][:100]}...")
print()
Section 7: RAG Quality Improvement
Common RAG Failure Modes
| Problem | Symptom | Solution |
|---|---|---|
| Chunk too small | Missing context | Increase chunk size or overlap |
| Chunk too large | LLM ignores most of chunk | Decrease chunk size |
| Wrong retrieval | Answer not based on docs | Improve query embedding, try hybrid search |
| Hallucination despite retrieval | LLM adds information not in context | Stricter system prompt, temperature=0 |
| Irrelevant chunks | Good search but poor reranking | Add reranker model (cross-encoder) |
Hybrid Search
Combine semantic search (embeddings) with keyword search (BM25) for better recall:
# Conceptual hybrid search pattern
def hybrid_search(query: str, collection, n_results: int = 5) -> list[dict]:
"""Combine semantic and keyword search."""
# Semantic search
semantic_results = collection.query(query_texts=[query], n_results=n_results * 2)
# BM25 keyword search (simplified)
# In production: use Elasticsearch, Weaviate hybrid, or Pinecone sparse-dense
keyword_results = bm25_search(query, n_results=n_results * 2)
# Reciprocal Rank Fusion
scores = {}
k = 60 # RRF constant
for rank, result in enumerate(semantic_results):
doc_id = result["id"]
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
for rank, result in enumerate(keyword_results):
doc_id = result["id"]
scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
# Sort by combined score and return top N
sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
return [get_doc_by_id(id) for id in sorted_ids[:n_results]]
Checkpoint Assessment
- What two problems does RAG solve that a plain LLM cannot?
- Explain cosine similarity in plain language. What does a similarity of 0 mean? Of 1?
- A user reports that the RAG system sometimes answers with information that contradicts the source documents. What is the most likely cause?
- Why is chunk overlap important? What happens without it?
- Your document collection is 50,000 pages of legal contracts. You need to answer questions about specific clauses. What chunking strategy would you use and why?
- What is hybrid search and when would you prefer it over pure semantic search?
Project: TechNodeX Course Q&A Bot
Build a complete RAG system that:
- Ingests the TechNodeX course markdown files (download from GitHub)
- Stores them in ChromaDB with course-aware metadata
- Handles multi-turn conversation (remember what was asked earlier)
- Filters results by course when the user specifies one
- Reports sources with each answer
- Evaluates itself: generate 10 test questions with known answers, measure how often the correct answer appears in the top-3 retrieved chunks (Recall@3)
What's Next
Stage 5 covers AI Agents and Agentic Workflows — building systems where LLMs take sequences of actions, call tools, make decisions, and collaborate with other agents to complete complex tasks.
Lock In Founding Member Access
Get full access to every course on TechNodeX — AI, cybersecurity, Python, and everything we build next. $9/month, price locked forever.
Become a Founding Member →