Stage 01 — How LLMs Work

Concepts, Architecture & The Modern AI Landscape · Conceptual Foundation · ⏱ 4–6 hours

Learning Objectives

By the end of this stage you will be able to:

Explain what a Large Language Model is and how it differs from earlier AI approaches
Describe the transformer architecture at a conceptual level (without needing the math)
Understand tokens, context windows, and why they matter for how you use AI
Compare major model families: GPT-4, Claude, Gemini, Llama, Mistral
Identify the key training phases: pre-training, fine-tuning, RLHF, and constitutional AI
Recognize common failure modes: hallucinations, knowledge cutoffs, context loss
Know how to evaluate model outputs and when not to trust them

Section 1: What Is a Large Language Model?

A Large Language Model is a type of neural network trained on massive amounts of text to predict and generate human language. The "large" refers to both the scale of training data (often trillions of tokens) and the number of parameters in the model (billions to trillions of numerical weights).

The simplest mental model: an LLM is an extremely sophisticated autocomplete system. Given a sequence of text, it predicts the most statistically likely continuation — over and over, one token at a time. The emergent result is something that appears to reason, summarize, translate, code, and converse.

This distinction matters: LLMs don't "know" things the way humans do. They model the statistical relationships between words and concepts based on what appeared in their training data. When an LLM says something confidently, it's not because it verified it — it's because that pattern appeared frequently enough in training to become a strong prediction.

What LLMs Are NOT

Not a search engine — they don't look things up at inference time (unless explicitly given tools)
Not a database — they don't have reliable recall of specific facts
Not sentient — there is no understanding, experience, or awareness
Not deterministic — the same prompt can produce different outputs (by design)

Section 2: The Transformer Architecture

Modern LLMs are built on the Transformer architecture, introduced in the 2017 paper "Attention Is All You Need" by Vaswani et al. at Google. Before transformers, language models used recurrent networks (RNNs/LSTMs) that processed text sequentially — token by token. Transformers changed this by processing the entire input at once using a mechanism called self-attention.

Tokenization: How Text Becomes Numbers

Before an LLM can process text, it must be converted to numbers. This is done via tokenization:

Text is split into tokens — roughly 3–4 characters each on average (words, word parts, punctuation)
Each token is mapped to an integer ID from a vocabulary (GPT-4 uses ~100,000 tokens)
Each token ID is then converted to a high-dimensional vector called an embedding

Example tokenization:

"Hello, world!" → ["Hello", ",", " world", "!"] → [15496, 11, 995, 0]

The word "unbelievable" might become ["un", "bel", "iev", "able"] — 4 tokens. This is why LLMs charge by token usage and why long documents cost more.

Key insight: Tokens are not words. Code, punctuation, and non-English text are often tokenized differently. Chinese characters typically require more tokens per character than English. A 1,000-word English document is roughly 1,300–1,500 tokens.

Embeddings: The Geometry of Meaning

Each token is embedded as a vector in high-dimensional space (e.g., 4,096 dimensions for GPT-3). The remarkable property of embeddings is that semantic relationships become geometric relationships.

Classic example:

King - Man + Woman ≈ Queen

Words with similar meaning cluster together. Analogies become vector arithmetic. The model learns this structure not by being told what things mean, but by observing how words appear together across billions of sentences.

Self-Attention: The Core Mechanism

Self-attention is what makes transformers powerful. For every token in the input, the model learns to attend to other tokens — computing how relevant each other token is when predicting or representing the current one.

Think of it like this: when processing the word "bank" in the sentence "I deposited money at the bank", the attention mechanism looks at all surrounding tokens and decides that "deposited" and "money" are highly relevant (river context is suppressed).

This happens in parallel across the entire input, in multiple "attention heads" simultaneously. Each head can learn to attend to different patterns — one might focus on grammatical structure, another on semantic relationships, another on coreference.

The Full Forward Pass (Simplified)

Input text → tokenizer → sequence of token IDs
Token IDs → embedding layer → sequence of vectors
Vectors pass through N transformer blocks (stacked layers):

- Self-attention: each token attends to all others

- Feed-forward network: non-linear transformations per token

- Layer normalization + residual connections: stabilize training

Final layer → logits for every token in vocabulary
Logits → softmax → probability distribution over next token
Sample from distribution (temperature controls randomness)
Append sampled token → repeat from step 3 until done

Section 3: Training Phases

Phase 1: Pre-Training

A base model is trained on a massive corpus of text (Common Crawl, books, code, Wikipedia, etc.) to predict the next token. This is self-supervised — no human labels needed. The model just learns to predict text from text.

This phase is enormously expensive. GPT-3 training reportedly cost ~$4–5M in compute. GPT-4 reportedly cost $50–100M. The result is a "foundation model" — powerful but raw, no particular alignment to being helpful.

Phase 2: Fine-Tuning (SFT)

The pre-trained model is fine-tuned on curated instruction-following examples:

Prompt: "Summarize this article in 3 bullet points..."
Response: [high-quality human-written summary]

This teaches the model to follow instructions, adopt a helpful tone, and format outputs usefully. The model is no longer just predicting text — it's learning the pattern of "assistant responding to user."

Phase 3: RLHF — Reinforcement Learning from Human Feedback

Humans rank pairs of model outputs: "This response is better than that one." A reward model is trained to predict these preferences. The LLM is then fine-tuned using reinforcement learning (PPO algorithm) to maximize the reward model's score.

RLHF is what transforms a raw language model into something that feels safe, helpful, and harmless. It's why Claude, ChatGPT, and Gemini refuse certain requests and maintain helpful conversation styles.

Constitutional AI (Anthropic's Approach)

Anthropic developed an alternative/complement to RLHF called Constitutional AI (CAI). Rather than pure human feedback:

The model critiques and revises its own outputs based on a set of principles (the "constitution")
AI feedback replaces some human feedback in the preference ranking stage

This scales better than pure human labeling and allows more transparent value alignment.

Section 4: Context Windows

The context window is the maximum number of tokens an LLM can process in a single call — both input (your message, documents, conversation history) and output (the response).

Context Window Sizes by Model (2024–2025)

Model	Context Window
GPT-3.5-turbo	16K tokens
GPT-4o	128K tokens
Claude 3.5 Sonnet	200K tokens
Claude 3 Opus	200K tokens
Gemini 1.5 Pro	1M tokens
Llama 3.1 70B	128K tokens

Why Context Windows Matter

Everything the model "knows" during a conversation must fit in the context window. There is no persistent memory by default. When you have a long conversation:

The entire conversation history counts against the token limit
When the limit is reached, earlier messages are either truncated or summarized
Models often perform worse near the limits of their context (the "lost in the middle" problem)

Practical rule: Just because a model has a 200K context window doesn't mean it reasons equally well across all 200K tokens. Critical information should be at the beginning or end of long prompts.

Section 5: Major Model Families

OpenAI — GPT Series

GPT-4o: Most capable widely-available OpenAI model. Multimodal (text, images, audio). 128K context. Strong at code, reasoning, structured outputs.
GPT-4o mini: Faster, cheaper. Strong for most tasks that don't require maximum reasoning.
o1/o3: "Reasoning" models. Take longer to respond but work through problems step-by-step internally before answering. Dramatically better at math, code, and logic.

Anthropic — Claude Series

Claude 3.5 Sonnet: Excellent balance of speed, capability, and cost. Top choice for most coding and writing tasks.
Claude 3 Opus: Most capable Claude model. Best for complex reasoning. Higher cost.
Claude 3 Haiku: Fastest, cheapest Claude model. Good for high-volume simple tasks.

Google — Gemini Series

Gemini 1.5 Pro: 1M token context window. Excellent for very long documents. Strong multimodal capabilities.
Gemini 1.5 Flash: Faster, cheaper Gemini. Good for latency-sensitive applications.

Meta — Llama Series (Open Source)

Llama 3.1 405B: Most capable open-source model. Can run self-hosted.
Llama 3.1 70B/8B: Smaller, faster, widely deployed. Good for fine-tuning.

Mistral (Open Source)

Mistral Large: Strong European alternative. Good for enterprise use cases needing data residency.
Mistral 7B/8x7B: Excellent small models for self-hosting and fine-tuning.

Section 6: Failure Modes to Know

Hallucinations

LLMs confidently generate false information. This happens because the model is optimizing for plausible-sounding text, not accuracy. Hallucinations are particularly common for:

Specific facts (dates, numbers, citations)
Obscure topics with little training data
Recent events after the knowledge cutoff
Requests to "list all X" when the full list is long

Mitigation: Never use LLMs as your only source for factual claims. Use retrieval (RAG) to ground models in verified sources. Ask models to cite their sources and verify them.

Knowledge Cutoffs

Training data has a cutoff date. Models don't know about events after that date. GPT-4o's cutoff is early 2024; Claude 3.5's is early 2024; Gemini 1.5's is November 2023.

Mitigation: Use search-augmented systems (web browsing, RAG) for time-sensitive queries.

Context Drift and Instruction Following

In long conversations, models can gradually forget earlier instructions or contexts. They may also selectively "follow" instructions — technically complying while violating the intent.

Sycophancy

Models trained on human feedback can learn to tell users what they want to hear rather than what's accurate. If you express a strong opinion, the model may agree with it even if it previously said the opposite.

Mitigation: Ask for counterarguments explicitly. Use evaluation prompts that test consistency.

Token Counting ≠ Understanding

Models process tokens, not sentences or concepts. Very long or repetitive inputs can degrade performance. The structure and position of information in the prompt matters significantly.

Section 7: How Models Are Evaluated

Benchmarks

MMLU (Massive Multitask Language Understanding): 57-subject multiple choice test covering STEM, humanities, professional domains
HumanEval: Code generation benchmark
MATH: Competition-level math problems
HellaSwag: Common-sense reasoning
GPQA (Graduate-Level Google-Proof Q&A): Expert-level science questions

Caution: Benchmarks can be gamed. Models are sometimes trained on or near benchmark data ("benchmark contamination"). Real-world performance doesn't always match benchmark rankings.

Vibes-Based Evaluation

Many practitioners evaluate models by testing on their own use cases. This is called "vibes-based eval" — informal but often more practically useful than benchmarks. In Stage 6, we'll cover systematic evaluation methods.

Checkpoint Assessment

Check your understanding before moving on:

What is a token, and roughly how many tokens are in a 1,000-word document?
Explain self-attention in plain language without using the word "attention."
What is the difference between a base model and an instruction-tuned model?
Why does a model with a 200K context window still struggle with very long documents?
Your team wants to use an LLM to answer questions about your company's internal documents. What failure mode is most relevant, and what architecture would you use?
A user reports that Claude gave different answers to the same question in two sessions. Is this a bug or expected behavior? Explain why.

Answers (write these out before reading):

A token is roughly 3–4 characters; a 1,000-word document is approximately 1,300–1,500 tokens.
Self-attention is the mechanism by which each word in a sentence can "look at" every other word to determine context. When processing any word, the model computes how relevant all other words in the input are, then uses that relevance to build a richer representation of the current word.
A base model is trained only to predict the next token in text — it has no particular behavior toward being helpful or following instructions. An instruction-tuned model has been fine-tuned (via SFT and RLHF) to follow instructions, maintain a helpful conversation style, and produce safe outputs.
The "lost in the middle" problem: models attend more strongly to information near the beginning and end of context. Information in the middle of very long inputs is often poorly retrieved. Additionally, processing and reasoning across very long contexts requires more computation and produces more errors.
Hallucinations and knowledge cutoffs are most relevant. The appropriate architecture is Retrieval-Augmented Generation (RAG) — the model retrieves relevant passages from the actual documents at query time and uses those as context, grounding its responses in verified source material.
Expected behavior. LLMs are non-deterministic by default — a "temperature" parameter controls randomness in the sampling process. The model isn't recalling a previous answer; each session starts fresh. This is by design to produce varied, creative responses. Temperature can be set to 0 for more deterministic outputs.

Hands-On Exercise: Model Comparison

No code required. Do this experiment:

Go to chat.openai.com, claude.ai, and gemini.google.com
Ask each the same prompt: "Explain quantum entanglement to a 10-year-old, then to a PhD physicist. Use a completely different vocabulary for each."
Then ask: "What is today's date and what happened in the news yesterday?"
Observe: How do they differ in tone, depth, formatting, and honesty about knowledge limits?

Write a 3–4 sentence observation for each model. These are not right/wrong — you're building intuition about model personalities.

Key Vocabulary

Term	Definition
Token	The basic unit of LLM input/output; roughly 3–4 characters
Embedding	Vector representation of a token in high-dimensional space
Self-attention	Mechanism allowing each token to attend to all other tokens
Transformer	Neural network architecture using self-attention; basis of modern LLMs
Context window	Maximum tokens an LLM can process in one call
Pre-training	Unsupervised next-token prediction on massive text corpus
Fine-tuning (SFT)	Supervised training on instruction-following examples
RLHF	Training method using human preference rankings to align model behavior
Hallucination	Confident generation of false information
Temperature	Parameter controlling randomness in token sampling
Knowledge cutoff	Date after which a model has no training data
Foundation model	Large pre-trained model as a base for downstream tasks

What's Next

Stage 2 dives into Prompt Engineering Fundamentals — the practical art of getting LLMs to do exactly what you want. You'll learn zero-shot vs. few-shot prompting, chain-of-thought reasoning, system prompts, and how to structure prompts for reliability.

Lock In Founding Member Access

Get full access to every course on TechNodeX — AI, cybersecurity, Python, and everything we build next. $9/month, price locked forever.

Become a Founding Member →

Stage 01 of 6 Next: Prompt Engineering →