Stage 05 — Fine-Tuning & Model Adaptation
Fine-Tuning & Model Adaptation · Comprehensive Technical Training · ⏱ 10–14 hours
Learning Objectives
By the end of this stage you will be able to:
- Understand when and why to fine-tune vs. use prompting or RAG
- Implement LoRA (Low-Rank Adaptation) for efficient fine-tuning
- Use QLoRA to fine-tune 7B-70B models on consumer hardware
- Set up training hyperparameters and optimize convergence
- Monitor training with Weights & Biases
- Evaluate fine-tuned models and measure improvement
- Deploy fine-tuned models locally or via API
Section 1: When to Fine-Tune
Fine-tuning is not always the answer. Evaluate alternatives first:
Fine-tuning vs. Other Approaches
| Approach | Cost | Latency | Customization | When to Use |
|---|---|---|---|---|
| Better prompting | Low | Fast | Limited | Try first; often sufficient |
| RAG | Medium | Moderate | High | Need grounding in documents |
| Fine-tuning | High | Offline | Very High | Need domain-specific behavior |
| Training from scratch | Extreme | N/A | Complete | Almost never worth it |
Fine-tune when: Base model struggles with your specific domain/task format despite prompt engineering.
Don't fine-tune if: Better prompting or RAG could solve the problem cheaper.
Section 2: LoRA and QLoRA Fundamentals
The Problem: Efficiency
Fine-tuning a 7B parameter model typically requires 24GB+ VRAM. Not practical for most practitioners.
LoRA Solution
LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices. Instead of updating all 7B parameters, you update only a few million LoRA parameters:
- Original model: frozen (no gradient updates)
- LoRA adapters: trainable low-rank matrices
- At inference: merge adapters into base weights
Result: 99% fewer parameters to train, 10x less VRAM needed.
QLoRA: Quantization + LoRA
QLoRA quantizes the base model to 4-bit precision, further reducing memory:
from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer
# 4-bit quantization
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
"meta-llama/Llama-2-7b-hf",
quantization_config=bnb_config,
device_map="auto",
)
# LoRA adapter
lora_config = LoraConfig(
r=16, # LoRA rank
lora_alpha=32,
target_modules=["q_proj", "v_proj"],
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
)
# Apply LoRA
model = get_peft_model(model, lora_config)
Total VRAM needed: ~8-12GB for 7B model. ~16-20GB for 70B. Practical for RTX 4090, A100, etc.
Section 3: Training Setup with TRL
from trl import SFTTrainer
from transformers import TrainingArguments
# Prepare training data (list of examples with "text" field)
train_data = [
{"text": "instruction: Summarize this.\ninput: Long text...\noutput: Short summary"},
# ... more examples
]
# Training arguments
training_args = TrainingArguments(
output_dir="./lora_output",
num_train_epochs=3,
per_device_train_batch_size=4,
per_device_eval_batch_size=4,
gradient_accumulation_steps=4,
warmup_steps=100,
weight_decay=0.01,
logging_steps=10,
learning_rate=2e-4,
bf16=True, # Use bfloat16 if available
save_strategy="epoch",
eval_strategy="epoch",
report_to="wandb", # Log to Weights & Biases
)
# Trainer
trainer = SFTTrainer(
model=model,
train_dataset=train_data,
args=training_args,
peft_config=lora_config,
)
# Train
trainer.train()
Section 4: Hyperparameter Tuning
Learning rate: 1e-4 to 5e-4 typical. Too high → divergence. Too low → slow convergence.
Batch size: 4-8 per device with gradient accumulation. Larger batch = more stable gradients.
LoRA rank (r): 8-32 typical. Larger rank = more capacity but slower training.
Epochs: 1-3 typical. More epochs can lead to overfitting.
Warmup: 5-10% of total steps. Gradual learning rate increase prevents exploding gradients.
Section 5: Evaluation and Deployment
Compare Base vs. Fine-Tuned
# Generate with base model
base_output = base_model.generate(prompt, max_new_tokens=100)
# Load fine-tuned adapter
model = AutoModelForCausalLM.from_pretrained(base_model_name, ...)
model = PeftModel.from_pretrained(model, "./lora_output/final_model")
# Generate with fine-tuned
finetuned_output = model.generate(prompt, max_new_tokens=100)
# Human evaluation or automated metrics
# BLEU, ROUGE, BERTScore, or task-specific metrics
Merge and Deploy
# Merge LoRA adapters into base weights
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model")
# Now deployable as standard model via Ollama, HuggingFace, Modal, etc.
What's Next
Stage 6 covers evaluation, deployment, and production patterns for AI applications. You'll learn structured outputs, safety guardrails, serverless deployment, and how to build systems people can actually use.
Lock In Founding Member Access
Get full access to every course on TechNodeX — AI, cybersecurity, Python, and everything we build next. $9/month, price locked forever.
Become a Founding Member →