Stage 02 — Using Frontier & Open-Source Models

Using Frontier & Open-Source Models · Comprehensive Technical Training · ⏱ 5–7 hours

Learning Objectives

By the end of this stage you will be able to:

Select the right model for your use case using evaluation criteria and benchmarks
Run open-source models locally using Ollama
Compare frontier models (GPT, Claude, Gemini) in practice
Understand model benchmarks (MMLU, GPQA, HumanEval) and their limitations
Evaluate the Chinchilla scaling law and training efficiency
Make cost-benefit trade-offs between frontier and open-source models

Section 1: Model Selection Strategy

Choosing a model requires understanding trade-offs: capability vs. cost, latency vs. quality, frontier vs. self-hosted. There is no universally "best" model—it depends on your constraints.

Decision Framework

Ask yourself:

Capability requirement: Do you need reasoning (o1/o3) or is a mid-tier model sufficient?
Context window: How much input do you need to process? Long documents suggest Gemini 1.5 or Claude.
Cost constraint: Can you afford frontier model API calls, or do you need self-hosted?
Latency requirement: Can you tolerate API latency, or do you need local inference?
Privacy: Can data leave your infrastructure, or is self-hosted required?

Section 2: Running Models Locally with Ollama

Ollama is a tool that downloads and runs open-source LLMs locally on your machine. It handles model quantization, memory management, and provides a simple interface.

Installation and First Run

# Download Ollama from ollama.com
# On Mac: drag to Applications
# On Windows/Linux: standard install

# Verify installation
ollama --version

# Run a small model
ollama run gemma:2b

# Run a larger model
ollama run llama2:7b

# Run a coding-focused model
ollama run mistral:7b

The first run downloads the model (takes minutes depending on size and internet speed). Subsequent runs are instant.

Common Local Models and Their Use Cases

Model	Size	Best For	Speed
Gemma 2B	1.4 GB	Testing, low-resource devices	Very Fast
Phi 3.5	2.4 GB	Code, reasoning, efficient	Very Fast
Mistral 7B	4 GB	Balanced performance	Fast
Llama 2 13B	7 GB	Better reasoning, coding	Moderate
Llama 3.1 70B	40 GB	High capability, near-GPT-4	Slow

When to Use Local vs. Frontier Models

Use local (Ollama) when:

You need privacy (data stays on your machine)
You need latency guarantees (no network dependency)
You're doing development/testing and want free inference
Your use case doesn't require frontier-level reasoning

Use frontier models (API) when:

You need the best possible reasoning (GPT-4o, o1, Claude Opus)
You need multimodal capabilities (vision)
You're processing long documents (1M+ tokens)
You want to avoid local compute costs

Section 3: Model Benchmarks

Benchmarks measure model performance on standardized tasks. They're useful but not perfect—models can be trained on benchmark data, and benchmarks don't capture all real-world needs.

Major Benchmarks

MMLU: 57-subject multiple choice test covering STEM, humanities, professional domains. ~86% for GPT-4o.
GPQA: Expert-level science questions. Graduate-level difficulty. ~90% for GPT-4o, ~85% for Claude Opus.
HumanEval: Python code generation. Pass@1 (first try success). ~92% for GPT-4o, ~88% for Claude Opus.
HellaSwag: Common-sense reasoning about visual situations.
MATH: Competition-level math problems. Indicates reasoning capability.

Limitations of Benchmarks

Benchmark contamination: Models may be trained on or near benchmark data
Narrow evaluation: Benchmarks don't measure real-world task performance
Gaming: Models improve on benchmarks without improving on actual use cases

Your own tests on your specific task are more valuable than benchmarks.

Section 4: The Chinchilla Scaling Law

The Chinchilla scaling law (DeepMind, 2022) describes the optimal balance between model size (parameters) and training data:

For a fixed compute budget, the optimal training data tokens ≈ 20x the number of parameters.

Historically, researchers over-parameterized models relative to data. Chinchilla showed that training a smaller model on more data is more efficient than training a large model on less data. This has influenced recent model development toward balanced architectures.

Implications:

A 70B model trained on 1.4T tokens is more efficient than a 100B model on 1T tokens
For fine-tuning, this suggests using more diverse data rather than larger model capacity
Open-source models often follow Chinchilla principles; frontier models may not

Section 5: Evaluating Models in Practice

Vibes-Based Evaluation

The most practical evaluation is testing models on your actual use cases. Ask each model the same questions and compare:

Answer quality and accuracy
Response tone and style
Handling of edge cases
Cost per response
Latency

Sample Comparison Prompt

System: You are a senior Python engineer. Explain complex concepts clearly.

User: Explain async/await in Python. Then show a concrete example that demonstrates
the difference between sequential and concurrent execution.

[Test with GPT-4o, Claude Opus, Gemini Pro, Llama 70B, Mistral Large]
[Compare answers for clarity, correctness, example quality]

Key Tools and Resources

Ollama.com: Download and documentation for running models locally
HuggingFace: Model hub with benchmarks and leaderboards
Artificial Analysis: Real-time model benchmarking and cost analysis
OpenRouter: Unified API across multiple models for easy comparison

What's Next

Stage 3 moves to building: you'll write Python code to call the Claude, OpenAI, and Gemini APIs, handling streaming, function calling, error handling, and building interactive applications.

Lock In Founding Member Access

Get full access to every course on TechNodeX — AI, cybersecurity, Python, and everything we build next. $9/month, price locked forever.

Become a Founding Member →

← Previous Stage Stage 2 of 6 Next: Building LLM-Powered Apps →