Stage 01 — How LLMs Work

How LLMs Work  ·  Comprehensive Technical Training  ·  ⏱ 4–6 hours

Learning Objectives

By the end of this stage you will be able to:

  • Explain what a Large Language Model is and understand the transformer architecture conceptually
  • Describe how tokens work and why context windows matter for model capabilities
  • Understand the training pipeline: pre-training, supervised fine-tuning (SFT), RLHF, and Constitutional AI
  • Compare major model families (GPT, Claude, Gemini, Llama, Mistral) and their use cases
  • Recognize common LLM failure modes (hallucinations, knowledge cutoffs, context drift) and mitigations
  • Evaluate models using benchmarks and vibes-based assessment
  • Build mental models that will serve as the foundation for the entire course

Section 1: What Is an LLM?

A Large Language Model is a neural network trained on massive amounts of text to predict the next token in a sequence. The "large" refers both to the scale of training data (trillions of tokens) and the number of model parameters (billions to trillions). Think of an LLM as a sophisticated autocomplete system—given a sequence of text, it predicts the most statistically likely continuation, one token at a time.

The remarkable emergence of this simple approach is that it produces something that appears to reason, summarize, translate, code, and converse. This is not because the model "understands" anything. Rather, it has learned the statistical relationships between words and concepts from its training data. When an LLM confidently produces false information (a hallucination), it's not lying—it's generating text that was probable in the training distribution, regardless of accuracy.

Why This Matters for Your Work

Understanding that LLMs are pattern-completion systems fundamentally changes how you use them. It explains why they hallucinate, why prompt engineering works, why context windows matter, and why you can't rely on them as sole sources of truth. But it also explains their incredible power—by modeling statistical relationships across billions of examples, they learn deep patterns about language, code, logic, and the world.


Section 2: The Transformer Architecture

Modern LLMs are built on the Transformer architecture, introduced in 2017 by Vaswani et al. in the paper "Attention Is All You Need." Transformers revolutionized sequence modeling by replacing recurrent layers (which process one token at a time) with self-attention mechanisms that process entire sequences in parallel.

Tokenization: Text as Numbers

Before an LLM can process text, it must convert text into numbers:

  1. Tokenization: Text is split into tokens (roughly 3–4 characters each on average)
  2. Token ID mapping: Each token is assigned an integer ID from a vocabulary (GPT-4 uses ~100,000 tokens)
  3. Embedding: Each token ID is converted to a high-dimensional vector (e.g., 4,096 dimensions for GPT-4)

Example: "Hello, world!" → ["Hello", ",", " world", "!"] → [15496, 11, 995, 0] → [dense vectors]

Critical insight: Tokens are not words. The word "unbelievable" becomes ["un", "bel", "iev", "able"]—4 tokens. Non-English text requires more tokens per character. This is why long documents cost more and why token limits constrain what models can process.

Self-Attention: The Core Innovation

Self-attention allows each token to "attend to" (look at) every other token in the input, computing relevance weights. When processing the word "bank" in "I deposited money at the bank," the attention mechanism looks at all surrounding words and determines that "deposited" and "money" are highly relevant (suppressing the river interpretation).

This happens in parallel across multiple "attention heads," each learning different patterns—grammatical structure, semantic relationships, coreference, etc. This parallel processing is why transformers scale better than sequential models.

The Forward Pass (Simplified)

  1. Input text → tokenizer → token IDs
  2. Token IDs → embedding layer → dense vectors
  3. Vectors pass through N transformer blocks (stacked layers):
  4. — Self-attention: each token attends to all others
  5. — Feed-forward network: per-token non-linear transformations
  6. — Layer normalization & residual connections: training stability
  7. Final layer produces logits (unnormalized scores) for every token in vocabulary
  8. Softmax converts logits to probabilities
  9. Sample from distribution (or take argmax for deterministic output)
  10. Append sampled token and repeat

This process repeats until a stop token is sampled or a max length is reached. Despite its sequential nature, training happens in parallel over the entire sequence.


Section 3: Training Phases

Phase 1: Pre-Training

A base model is trained on massive text corpora to predict the next token. This self-supervised approach requires no human labels—just "predict the next word." The model learns statistics from Common Crawl, books, code repositories, Wikipedia, and proprietary datasets. Training is enormously expensive: GPT-3 reportedly cost $4–5M, GPT-4 cost $50–100M+.

The result is a "foundation model"—powerful but raw, with no particular alignment toward being helpful or safe.

Phase 2: Supervised Fine-Tuning (SFT)

The pre-trained model is fine-tuned on curated instruction-following examples:

User: Summarize this article in 3 bullet points...
Assistant: • Key point 1
• Key point 2
• Key point 3

SFT teaches the model to follow instructions, adopt a helpful tone, and format outputs usefully. The model learns the pattern of "assistant responding to user" and becomes more conversational.

Phase 3: RLHF (Reinforcement Learning from Human Feedback)

Humans rank pairs of model outputs: "This response is better than that one." A reward model is trained to predict these preferences. The LLM is then fine-tuned using reinforcement learning (PPO algorithm) to maximize the reward model's score.

RLHF transforms a raw language model into something that feels safe, helpful, and honest. It's why Claude, ChatGPT, and Gemini refuse harmful requests and maintain conversational style.

Constitutional AI (Anthropic's Alternative)

Rather than pure human feedback, Constitutional AI uses:

  1. The model critiques and revises its own outputs based on constitutional principles
  2. AI feedback replaces some human feedback in preference ranking

This scales better and allows more transparent value alignment.


Section 4: Context Windows and Limitations

The context window is the maximum number of tokens an LLM can process in one call—including input and output combined. There is no persistent memory by default. When a conversation exceeds the limit, earlier messages are either truncated or omitted.

Current Context Window Sizes (2024–2025)

ModelContextCost Tier
GPT-4o128KMedium
Claude 3.5 Sonnet200KMedium
Claude 3 Opus200KExpensive
Gemini 1.5 Pro1MHigh
Llama 3.1 70B128KSelf-hosted

The "Lost in the Middle" Problem

Models perform worse on information in the middle of long contexts. Information at the beginning and end receives stronger attention. In a 200K token document, critical information should be at the start or end, not buried in the middle.


Section 5: Major Model Families

OpenAI: GPT Series

  • GPT-4o: Multimodal (text, images, audio). 128K context. Excellent reasoning and code.
  • o1/o3: "Reasoning" models that think step-by-step internally. Better at math, code, complex logic.

Anthropic: Claude

  • Claude 3.5 Sonnet: Fast, capable, balanced. Top choice for most tasks.
  • Claude 3 Opus: Most capable. Higher cost but best reasoning.

Google: Gemini

  • Gemini 1.5 Pro: 1M token context. Excellent for very long documents.

Open Source: Llama, Mistral, DeepSeek

  • Llama 3.1 405B: Most capable open-source model. Can be self-hosted.
  • Mistral Large: European-friendly, enterprise-aligned.
  • DeepSeek: Strong reasoning, newer architecture.

Section 6: Failure Modes

Hallucinations

LLMs confidently generate false information because they optimize for plausible-sounding text, not truth. Hallucinations are common for:

  • Specific facts (dates, numbers, citations)
  • Obscure topics with little training data
  • Recent events after knowledge cutoff
  • Long lists of items

Mitigation: Use RAG to ground models in verified sources. Ask models to cite sources. Never use LLMs as sole source of truth.

Knowledge Cutoffs

Training data has a cutoff date. GPT-4o's cutoff is early 2024; Claude 3.5's is early 2024. Models know nothing about events after these dates.

Context Drift

In very long conversations, models can forget earlier instructions or context. Instructions may be selectively followed—technically complying while violating intent.


Key Vocabulary

TermDefinition
TokenBasic unit of LLM input/output; ~3–4 characters average
EmbeddingVector representation of a token in high-dimensional space
Self-attentionMechanism where each token attends to all others
TransformerNeural network architecture using self-attention
Context windowMaximum tokens an LLM can process in one call
Pre-trainingUnsupervised next-token prediction on massive corpus
SFTSupervised fine-tuning on instruction-following examples
RLHFTraining using human preference rankings to align behavior
HallucinationConfident generation of false information
Foundation modelLarge pre-trained model as base for downstream tasks

What's Next

Stage 2 explores choosing and running frontier models (GPT, Claude, Gemini) as well as open-source models locally with Ollama. You'll understand model selection criteria, benchmarks, and when to use which model for which task.

Lock In Founding Member Access

Get full access to every course on TechNodeX — AI, cybersecurity, Python, and everything we build next. $9/month, price locked forever.

Become a Founding Member →