Stage 05 — Fine-Tuning & Model Adaptation

Fine-Tuning & Model Adaptation · Comprehensive Technical Training · ⏱ 10–14 hours

Learning Objectives

By the end of this stage you will be able to:

Understand when and why to fine-tune vs. use prompting or RAG
Implement LoRA (Low-Rank Adaptation) for efficient fine-tuning
Use QLoRA to fine-tune 7B-70B models on consumer hardware
Set up training hyperparameters and optimize convergence
Monitor training with Weights & Biases
Evaluate fine-tuned models and measure improvement
Deploy fine-tuned models locally or via API

Section 1: When to Fine-Tune

Fine-tuning is not always the answer. Evaluate alternatives first:

Fine-tuning vs. Other Approaches

Approach	Cost	Latency	Customization	When to Use
Better prompting	Low	Fast	Limited	Try first; often sufficient
RAG	Medium	Moderate	High	Need grounding in documents
Fine-tuning	High	Offline	Very High	Need domain-specific behavior
Training from scratch	Extreme	N/A	Complete	Almost never worth it

Fine-tune when: Base model struggles with your specific domain/task format despite prompt engineering.

Don't fine-tune if: Better prompting or RAG could solve the problem cheaper.

Section 2: LoRA and QLoRA Fundamentals

The Problem: Efficiency

Fine-tuning a 7B parameter model typically requires 24GB+ VRAM. Not practical for most practitioners.

LoRA Solution

LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices. Instead of updating all 7B parameters, you update only a few million LoRA parameters:

Original model: frozen (no gradient updates)
LoRA adapters: trainable low-rank matrices
At inference: merge adapters into base weights

Result: 99% fewer parameters to train, 10x less VRAM needed.

QLoRA: Quantization + LoRA

QLoRA quantizes the base model to 4-bit precision, further reducing memory:

from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# LoRA adapter
lora_config = LoraConfig(
    r=16,  # LoRA rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Apply LoRA
model = get_peft_model(model, lora_config)

Total VRAM needed: ~8-12GB for 7B model. ~16-20GB for 70B. Practical for RTX 4090, A100, etc.

Section 3: Training Setup with TRL

from trl import SFTTrainer
from transformers import TrainingArguments

# Prepare training data (list of examples with "text" field)
train_data = [
    {"text": "instruction: Summarize this.\ninput: Long text...\noutput: Short summary"},
    # ... more examples
]

# Training arguments
training_args = TrainingArguments(
    output_dir="./lora_output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=100,
    weight_decay=0.01,
    logging_steps=10,
    learning_rate=2e-4,
    bf16=True,  # Use bfloat16 if available
    save_strategy="epoch",
    eval_strategy="epoch",
    report_to="wandb",  # Log to Weights & Biases
)

# Trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    args=training_args,
    peft_config=lora_config,
)

# Train
trainer.train()

Section 4: Hyperparameter Tuning

Learning rate: 1e-4 to 5e-4 typical. Too high → divergence. Too low → slow convergence.

Batch size: 4-8 per device with gradient accumulation. Larger batch = more stable gradients.

LoRA rank (r): 8-32 typical. Larger rank = more capacity but slower training.

Epochs: 1-3 typical. More epochs can lead to overfitting.

Warmup: 5-10% of total steps. Gradual learning rate increase prevents exploding gradients.

Section 5: Evaluation and Deployment

Compare Base vs. Fine-Tuned

# Generate with base model
base_output = base_model.generate(prompt, max_new_tokens=100)

# Load fine-tuned adapter
model = AutoModelForCausalLM.from_pretrained(base_model_name, ...)
model = PeftModel.from_pretrained(model, "./lora_output/final_model")

# Generate with fine-tuned
finetuned_output = model.generate(prompt, max_new_tokens=100)

# Human evaluation or automated metrics
# BLEU, ROUGE, BERTScore, or task-specific metrics

Merge and Deploy

# Merge LoRA adapters into base weights
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model")

# Now deployable as standard model via Ollama, HuggingFace, Modal, etc.

What's Next

Stage 6 covers evaluation, deployment, and production patterns for AI applications. You'll learn structured outputs, safety guardrails, serverless deployment, and how to build systems people can actually use.

Lock In Founding Member Access

Get full access to every course on TechNodeX — AI, cybersecurity, Python, and everything we build next. $9/month, price locked forever.

Become a Founding Member →

← Previous Stage Stage 5 of 6 Next: Evaluation, Deployment & Production →