Stage 05 — Fine-Tuning & Model Adaptation

Fine-Tuning & Model Adaptation  ·  Comprehensive Technical Training  ·  ⏱ 10–14 hours

Learning Objectives

By the end of this stage you will be able to:

  • Understand when and why to fine-tune vs. use prompting or RAG
  • Implement LoRA (Low-Rank Adaptation) for efficient fine-tuning
  • Use QLoRA to fine-tune 7B-70B models on consumer hardware
  • Set up training hyperparameters and optimize convergence
  • Monitor training with Weights & Biases
  • Evaluate fine-tuned models and measure improvement
  • Deploy fine-tuned models locally or via API

Section 1: When to Fine-Tune

Fine-tuning is not always the answer. Evaluate alternatives first:

Fine-tuning vs. Other Approaches

ApproachCostLatencyCustomizationWhen to Use
Better promptingLowFastLimitedTry first; often sufficient
RAGMediumModerateHighNeed grounding in documents
Fine-tuningHighOfflineVery HighNeed domain-specific behavior
Training from scratchExtremeN/ACompleteAlmost never worth it

Fine-tune when: Base model struggles with your specific domain/task format despite prompt engineering.

Don't fine-tune if: Better prompting or RAG could solve the problem cheaper.


Section 2: LoRA and QLoRA Fundamentals

The Problem: Efficiency

Fine-tuning a 7B parameter model typically requires 24GB+ VRAM. Not practical for most practitioners.

LoRA Solution

LoRA (Low-Rank Adaptation) freezes the base model and trains small adapter matrices. Instead of updating all 7B parameters, you update only a few million LoRA parameters:

  • Original model: frozen (no gradient updates)
  • LoRA adapters: trainable low-rank matrices
  • At inference: merge adapters into base weights

Result: 99% fewer parameters to train, 10x less VRAM needed.

QLoRA: Quantization + LoRA

QLoRA quantizes the base model to 4-bit precision, further reducing memory:

from transformers import BitsAndBytesConfig
from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM, AutoTokenizer

# 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)

# Load model in 4-bit
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    quantization_config=bnb_config,
    device_map="auto",
)

# LoRA adapter
lora_config = LoraConfig(
    r=16,  # LoRA rank
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05,
    bias="none",
    task_type="CAUSAL_LM",
)

# Apply LoRA
model = get_peft_model(model, lora_config)

Total VRAM needed: ~8-12GB for 7B model. ~16-20GB for 70B. Practical for RTX 4090, A100, etc.


Section 3: Training Setup with TRL

from trl import SFTTrainer
from transformers import TrainingArguments

# Prepare training data (list of examples with "text" field)
train_data = [
    {"text": "instruction: Summarize this.\ninput: Long text...\noutput: Short summary"},
    # ... more examples
]

# Training arguments
training_args = TrainingArguments(
    output_dir="./lora_output",
    num_train_epochs=3,
    per_device_train_batch_size=4,
    per_device_eval_batch_size=4,
    gradient_accumulation_steps=4,
    warmup_steps=100,
    weight_decay=0.01,
    logging_steps=10,
    learning_rate=2e-4,
    bf16=True,  # Use bfloat16 if available
    save_strategy="epoch",
    eval_strategy="epoch",
    report_to="wandb",  # Log to Weights & Biases
)

# Trainer
trainer = SFTTrainer(
    model=model,
    train_dataset=train_data,
    args=training_args,
    peft_config=lora_config,
)

# Train
trainer.train()

Section 4: Hyperparameter Tuning

Learning rate: 1e-4 to 5e-4 typical. Too high → divergence. Too low → slow convergence.

Batch size: 4-8 per device with gradient accumulation. Larger batch = more stable gradients.

LoRA rank (r): 8-32 typical. Larger rank = more capacity but slower training.

Epochs: 1-3 typical. More epochs can lead to overfitting.

Warmup: 5-10% of total steps. Gradual learning rate increase prevents exploding gradients.


Section 5: Evaluation and Deployment

Compare Base vs. Fine-Tuned

# Generate with base model
base_output = base_model.generate(prompt, max_new_tokens=100)

# Load fine-tuned adapter
model = AutoModelForCausalLM.from_pretrained(base_model_name, ...)
model = PeftModel.from_pretrained(model, "./lora_output/final_model")

# Generate with fine-tuned
finetuned_output = model.generate(prompt, max_new_tokens=100)

# Human evaluation or automated metrics
# BLEU, ROUGE, BERTScore, or task-specific metrics

Merge and Deploy

# Merge LoRA adapters into base weights
merged_model = model.merge_and_unload()
merged_model.save_pretrained("./merged_model")

# Now deployable as standard model via Ollama, HuggingFace, Modal, etc.

What's Next

Stage 6 covers evaluation, deployment, and production patterns for AI applications. You'll learn structured outputs, safety guardrails, serverless deployment, and how to build systems people can actually use.

Lock In Founding Member Access

Get full access to every course on TechNodeX — AI, cybersecurity, Python, and everything we build next. $9/month, price locked forever.

Become a Founding Member →