Stage 04 — Building RAG Systems with Vector Databases

Grounding AI in Your Own Documents · Technical + Architecture · ⏱ 8–12 hours

Learning Objectives

By the end of this stage you will be able to:

Explain what RAG is and why it solves core LLM limitations
Understand vector embeddings and semantic similarity
Set up and query ChromaDB (local) and Pinecone (hosted) vector stores
Build a complete document ingestion pipeline
Implement chunking strategies and understand their trade-offs
Build a production-ready Q&A system over custom documents
Evaluate and improve RAG quality

Section 1: The RAG Problem Statement

Large language models have two fundamental limitations for production use:

Knowledge cutoff — they don't know about events after training
No access to private data — they can't answer questions about your documents, codebase, customer records, or internal systems

Retrieval-Augmented Generation (RAG) solves this by:

Converting your documents into vector embeddings
At query time, retrieving the most relevant document chunks
Injecting those chunks as context into the LLM's prompt

The LLM still does the reasoning and synthesis — it just has accurate, up-to-date facts to work from.

User Query → Embed Query → Search Vector DB → Retrieve Top-K Chunks → 
    → Build Prompt [chunks + query] → LLM → Grounded Answer

Section 2: Vector Embeddings Deep Dive

An embedding is a dense vector (array of floating-point numbers) that represents the semantic meaning of text. Texts with similar meanings produce vectors that are geometrically close.

How Embeddings Work

Embedding models (distinct from generative LLMs) transform text into vectors:

from anthropic import Anthropic

# Using OpenAI embeddings (most common)
from openai import OpenAI

client = OpenAI()

def embed(text: str, model: str = "text-embedding-3-small") -> list[float]:
    """Convert text to embedding vector."""
    response = client.embeddings.create(
        input=text,
        model=model
    )
    return response.data[0].embedding

# Example
vector = embed("machine learning for cybersecurity")
print(f"Vector dimensions: {len(vector)}")   # 1536 for text-embedding-3-small
print(f"First 5 values: {vector[:5]}")

Cosine Similarity

Similarity is measured by the angle between vectors:

import numpy as np

def cosine_similarity(a: list[float], b: list[float]) -> float:
    """Calculate cosine similarity between two vectors."""
    a, b = np.array(a), np.array(b)
    return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))

# Semantic similarity test
v1 = embed("deep learning neural networks")
v2 = embed("artificial intelligence machine learning")
v3 = embed("cooking pasta recipes")

print(f"AI vs AI: {cosine_similarity(v1, v2):.3f}")  # ~0.85 (similar)
print(f"AI vs cooking: {cosine_similarity(v1, v3):.3f}")  # ~0.15 (different)

Embedding Model Comparison

Model	Dimensions	Cost/1M tokens	Best For
text-embedding-3-small	1536	$0.02	General use, cost-efficient
text-embedding-3-large	3072	$0.13	Maximum accuracy
text-embedding-ada-002	1536	$0.10	Legacy, still common
Voyage-2 (Anthropic)	1024	$0.10	Claude-optimized RAG
BAAI/bge-large (local)	1024	Free	Self-hosted, no API cost

Section 3: ChromaDB — Local Vector Store

ChromaDB is the easiest way to start with RAG — it runs locally with no external dependencies.

Installation

pip install chromadb sentence-transformers

Basic Operations

import chromadb
from chromadb.utils import embedding_functions

# Initialize client (persisted to disk)
client = chromadb.PersistentClient(path="./chroma_db")

# Create embedding function (uses local model, no API key needed)
embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

# Create or get collection
collection = client.get_or_create_collection(
    name="technodex_docs",
    embedding_function=embedding_fn,
    metadata={"hnsw:space": "cosine"}  # Use cosine distance
)

# Add documents
collection.add(
    documents=[
        "Python is a high-level programming language known for its readability.",
        "Kali Linux is a Debian-based distribution for penetration testing.",
        "SQL injection is a web attack that exploits unsanitized database queries.",
        "Docker containers isolate applications with their dependencies.",
        "HTTPS uses TLS to encrypt web traffic between client and server."
    ],
    ids=["doc1", "doc2", "doc3", "doc4", "doc5"],
    metadatas=[
        {"course": "python", "stage": 1},
        {"course": "kali", "stage": 2},
        {"course": "ethical-hacking", "stage": 7},
        {"course": "devops", "stage": 3},
        {"course": "networking", "stage": 4}
    ]
)

# Query
results = collection.query(
    query_texts=["how to protect web applications from attacks"],
    n_results=3
)

for doc, meta, distance in zip(
    results["documents"][0],
    results["metadatas"][0],
    results["distances"][0]
):
    print(f"Distance: {distance:.3f} | Course: {meta['course']}")
    print(f"  {doc[:100]}...")
    print()

Section 4: Document Ingestion Pipeline

Chunking Strategies

Chunking splits large documents into pieces small enough for context windows but large enough to contain useful information.

from dataclasses import dataclass
from typing import Optional
import re

@dataclass
class Chunk:
    text: str
    source: str
    chunk_index: int
    total_chunks: int
    metadata: dict


def chunk_by_size(
    text: str,
    source: str,
    chunk_size: int = 500,
    overlap: int = 50,
    metadata: Optional[dict] = None
) -> list[Chunk]:
    """Split text into fixed-size chunks with overlap."""
    words = text.split()
    chunks = []
    
    start = 0
    while start < len(words):
        end = start + chunk_size
        chunk_text = " ".join(words[start:end])
        chunks.append(Chunk(
            text=chunk_text,
            source=source,
            chunk_index=len(chunks),
            total_chunks=-1,  # Will update after
            metadata=metadata or {}
        ))
        start += chunk_size - overlap
    
    for chunk in chunks:
        chunk.total_chunks = len(chunks)
    
    return chunks


def chunk_by_heading(text: str, source: str, metadata: Optional[dict] = None) -> list[Chunk]:
    """Split markdown by headings — preserves document structure."""
    sections = re.split(r'\n(?=#{1,3} )', text)
    chunks = []
    
    for i, section in enumerate(sections):
        if section.strip():
            chunks.append(Chunk(
                text=section.strip(),
                source=source,
                chunk_index=i,
                total_chunks=len(sections),
                metadata=metadata or {}
            ))
    
    return chunks


def chunk_by_sentence(
    text: str,
    source: str,
    max_sentences: int = 5,
    overlap_sentences: int = 1,
    metadata: Optional[dict] = None
) -> list[Chunk]:
    """Sentence-boundary chunking — better for factual Q&A."""
    sentences = re.split(r'(?<=[.!?])\s+', text)
    chunks = []
    
    start = 0
    while start < len(sentences):
        end = min(start + max_sentences, len(sentences))
        chunk_text = " ".join(sentences[start:end])
        chunks.append(Chunk(
            text=chunk_text,
            source=source,
            chunk_index=len(chunks),
            total_chunks=-1,
            metadata=metadata or {}
        ))
        start += max_sentences - overlap_sentences
    
    for chunk in chunks:
        chunk.total_chunks = len(chunks)
    
    return chunks

Chunking Strategy Selection Guide

Strategy	Use When	Avoid When
Fixed size	Fast ingestion, uniform documents	Splits mid-sentence
By heading	Markdown/structured docs	Plain prose
By sentence	Q&A over facts	Very long paragraphs
Semantic	Best retrieval quality	Slow, requires extra model
Recursive	Balanced default choice	When structure is known

Full Ingestion Pipeline

import os
import hashlib
from pathlib import Path
import chromadb
from chromadb.utils import embedding_functions


class DocumentIngester:
    """Ingest documents into a ChromaDB collection."""
    
    def __init__(self, collection_name: str, db_path: str = "./chroma_db"):
        self.db = chromadb.PersistentClient(path=db_path)
        self.embedding_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
            model_name="all-MiniLM-L6-v2"
        )
        self.collection = self.db.get_or_create_collection(
            name=collection_name,
            embedding_function=self.embedding_fn,
            metadata={"hnsw:space": "cosine"}
        )
    
    def ingest_file(
        self,
        file_path: str,
        chunk_strategy: str = "size",
        chunk_size: int = 500,
        extra_metadata: Optional[dict] = None
    ) -> int:
        """Ingest a single file. Returns number of chunks added."""
        path = Path(file_path)
        
        with open(path, "r", encoding="utf-8") as f:
            text = f.read()
        
        metadata = {
            "filename": path.name,
            "file_type": path.suffix,
            **(extra_metadata or {})
        }
        
        if chunk_strategy == "heading" and path.suffix in [".md", ".txt"]:
            chunks = chunk_by_heading(text, str(path), metadata)
        elif chunk_strategy == "sentence":
            chunks = chunk_by_sentence(text, str(path), metadata=metadata)
        else:
            chunks = chunk_by_size(text, str(path), chunk_size, metadata=metadata)
        
        # Generate deterministic IDs to avoid duplicates
        ids = []
        for chunk in chunks:
            content_hash = hashlib.md5(
                f"{file_path}:{chunk.chunk_index}:{chunk.text[:50]}".encode()
            ).hexdigest()[:12]
            ids.append(f"{path.stem}_{content_hash}")
        
        # Add to collection
        self.collection.upsert(
            documents=[c.text for c in chunks],
            ids=ids,
            metadatas=[{**c.metadata, "chunk_index": c.chunk_index, "source": c.source} 
                      for c in chunks]
        )
        
        print(f"Ingested {path.name}: {len(chunks)} chunks")
        return len(chunks)
    
    def ingest_directory(self, directory: str, pattern: str = "*.txt", **kwargs) -> int:
        """Ingest all matching files in a directory."""
        total = 0
        for file_path in Path(directory).glob(pattern):
            total += self.ingest_file(str(file_path), **kwargs)
        return total
    
    def search(self, query: str, n_results: int = 5, where: Optional[dict] = None) -> list[dict]:
        """Search collection and return results."""
        kwargs = {"query_texts": [query], "n_results": n_results}
        if where:
            kwargs["where"] = where
        
        results = self.collection.query(**kwargs)
        
        return [
            {
                "text": doc,
                "metadata": meta,
                "distance": dist
            }
            for doc, meta, dist in zip(
                results["documents"][0],
                results["metadatas"][0],
                results["distances"][0]
            )
        ]

Section 5: Building a RAG Q&A System

import anthropic
from typing import Optional


class RAGSystem:
    """Complete RAG Q&A system over a document collection."""
    
    def __init__(self, ingester: DocumentIngester, llm_client: anthropic.Anthropic):
        self.ingester = ingester
        self.llm = llm_client
        self.conversation_history = []
    
    def answer(
        self,
        question: str,
        n_results: int = 5,
        course_filter: Optional[str] = None,
        show_sources: bool = True
    ) -> dict:
        """Answer a question using retrieved context."""
        
        # Retrieve relevant chunks
        where = {"course": course_filter} if course_filter else None
        retrieved = self.ingester.search(question, n_results=n_results, where=where)
        
        if not retrieved:
            return {
                "answer": "I couldn't find relevant information in the knowledge base.",
                "sources": [],
                "retrieved_chunks": 0
            }
        
        # Build context from retrieved chunks
        context_parts = []
        for i, result in enumerate(retrieved, 1):
            meta = result["metadata"]
            source = meta.get("filename", meta.get("source", "Unknown"))
            context_parts.append(f"[Source {i}: {source}]\n{result['text']}")
        
        context = "\n\n---\n\n".join(context_parts)
        
        # Build prompt
        system_prompt = """You are a knowledgeable technical assistant for TechNodeX, an online learning platform.

Answer questions based ONLY on the provided context. If the context doesn't contain enough information to answer confidently, say so clearly.

Rules:
- Base your answer exclusively on the provided context
- Cite source numbers when referencing specific information [Source N]
- If the context is contradictory, note the contradiction
- Keep answers focused and precise
- Use code examples when relevant"""
        
        user_message = f"""Context from knowledge base:
{context}

---

Question: {question}"""
        
        # Call LLM
        response = self.llm.messages.create(
            model="claude-sonnet-4-5",
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": user_message}]
        )
        
        answer = response.content[0].text
        
        sources = [
            {
                "file": r["metadata"].get("filename", "unknown"),
                "chunk": r["metadata"].get("chunk_index", 0),
                "relevance": 1 - r["distance"]
            }
            for r in retrieved
        ]
        
        return {
            "answer": answer,
            "sources": sources,
            "retrieved_chunks": len(retrieved),
            "tokens_used": response.usage.input_tokens + response.usage.output_tokens
        }
    
    def chat(self, question: str, **kwargs) -> str:
        """Multi-turn RAG chat with conversation history."""
        # For multi-turn, consider appending conversation history to the query
        # to capture context ("What about the previous topic?")
        result = self.answer(question, **kwargs)
        print(f"\n{result['answer']}")
        
        if result["sources"]:
            print(f"\n[Retrieved {result['retrieved_chunks']} chunks from: "
                  f"{', '.join(set(s['file'] for s in result['sources']))}]")
        
        return result["answer"]


# Usage example
def build_course_qa_system(docs_directory: str) -> RAGSystem:
    """Build a RAG system over TechNodeX course documents."""
    
    ingester = DocumentIngester("technodex_courses")
    ingester.ingest_directory(docs_directory, pattern="*.md", chunk_strategy="heading")
    
    llm = anthropic.Anthropic()
    rag = RAGSystem(ingester, llm)
    
    return rag

# Interactive Q&A
if __name__ == "__main__":
    rag = build_course_qa_system("./course_docs")
    
    print("TechNodeX Course Assistant (type 'quit' to exit)")
    print("=" * 50)
    
    while True:
        question = input("\nYour question: ").strip()
        if question.lower() == "quit":
            break
        if question:
            rag.chat(question)

Section 6: Pinecone — Production Vector Store

For production systems with large datasets, use a managed vector database.

pip install pinecone-client

from pinecone import Pinecone, ServerlessSpec
import os

# Initialize
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

# Create index (one-time setup)
index_name = "technodex-courses"

if index_name not in pc.list_indexes().names():
    pc.create_index(
        name=index_name,
        dimension=1536,  # text-embedding-3-small dimensions
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

index = pc.Index(index_name)

# Upsert vectors
vectors_to_upsert = []
for chunk in chunks:
    embedding = embed(chunk.text)  # From Section 2
    vectors_to_upsert.append({
        "id": f"doc_{chunk.chunk_index}",
        "values": embedding,
        "metadata": {
            "text": chunk.text,
            "source": chunk.source,
            "chunk_index": chunk.chunk_index
        }
    })

# Batch upsert (max 100 per call)
batch_size = 100
for i in range(0, len(vectors_to_upsert), batch_size):
    batch = vectors_to_upsert[i:i + batch_size]
    index.upsert(vectors=batch)

# Query
query_embedding = embed("how do I configure Kali Linux")

results = index.query(
    vector=query_embedding,
    top_k=5,
    include_metadata=True
)

for match in results.matches:
    print(f"Score: {match.score:.3f}")
    print(f"Source: {match.metadata['source']}")
    print(f"Text: {match.metadata['text'][:100]}...")
    print()

Section 7: RAG Quality Improvement

Common RAG Failure Modes

Problem	Symptom	Solution
Chunk too small	Missing context	Increase chunk size or overlap
Chunk too large	LLM ignores most of chunk	Decrease chunk size
Wrong retrieval	Answer not based on docs	Improve query embedding, try hybrid search
Hallucination despite retrieval	LLM adds information not in context	Stricter system prompt, temperature=0
Irrelevant chunks	Good search but poor reranking	Add reranker model (cross-encoder)

Hybrid Search

Combine semantic search (embeddings) with keyword search (BM25) for better recall:

# Conceptual hybrid search pattern
def hybrid_search(query: str, collection, n_results: int = 5) -> list[dict]:
    """Combine semantic and keyword search."""
    
    # Semantic search
    semantic_results = collection.query(query_texts=[query], n_results=n_results * 2)
    
    # BM25 keyword search (simplified)
    # In production: use Elasticsearch, Weaviate hybrid, or Pinecone sparse-dense
    keyword_results = bm25_search(query, n_results=n_results * 2)
    
    # Reciprocal Rank Fusion
    scores = {}
    k = 60  # RRF constant
    
    for rank, result in enumerate(semantic_results):
        doc_id = result["id"]
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    
    for rank, result in enumerate(keyword_results):
        doc_id = result["id"]
        scores[doc_id] = scores.get(doc_id, 0) + 1 / (k + rank + 1)
    
    # Sort by combined score and return top N
    sorted_ids = sorted(scores.keys(), key=lambda x: scores[x], reverse=True)
    return [get_doc_by_id(id) for id in sorted_ids[:n_results]]

Checkpoint Assessment

What two problems does RAG solve that a plain LLM cannot?
Explain cosine similarity in plain language. What does a similarity of 0 mean? Of 1?
A user reports that the RAG system sometimes answers with information that contradicts the source documents. What is the most likely cause?
Why is chunk overlap important? What happens without it?
Your document collection is 50,000 pages of legal contracts. You need to answer questions about specific clauses. What chunking strategy would you use and why?
What is hybrid search and when would you prefer it over pure semantic search?

Project: TechNodeX Course Q&A Bot

Build a complete RAG system that:

Ingests the TechNodeX course markdown files (download from GitHub)
Stores them in ChromaDB with course-aware metadata
Handles multi-turn conversation (remember what was asked earlier)
Filters results by course when the user specifies one
Reports sources with each answer
Evaluates itself: generate 10 test questions with known answers, measure how often the correct answer appears in the top-3 retrieved chunks (Recall@3)

What's Next

Stage 5 covers AI Agents and Agentic Workflows — building systems where LLMs take sequences of actions, call tools, make decisions, and collaborate with other agents to complete complex tasks.

Lock In Founding Member Access

Get full access to every course on TechNodeX — AI, cybersecurity, Python, and everything we build next. $9/month, price locked forever.

Become a Founding Member →

← Previous Stage Stage 04 of 6 Next: AI Agents →