Stage 02 — Using Frontier & Open-Source Models
Using Frontier & Open-Source Models · Comprehensive Technical Training · ⏱ 5–7 hours
Learning Objectives
By the end of this stage you will be able to:
- Select the right model for your use case using evaluation criteria and benchmarks
- Run open-source models locally using Ollama
- Compare frontier models (GPT, Claude, Gemini) in practice
- Understand model benchmarks (MMLU, GPQA, HumanEval) and their limitations
- Evaluate the Chinchilla scaling law and training efficiency
- Make cost-benefit trade-offs between frontier and open-source models
Section 1: Model Selection Strategy
Choosing a model requires understanding trade-offs: capability vs. cost, latency vs. quality, frontier vs. self-hosted. There is no universally "best" model—it depends on your constraints.
Decision Framework
Ask yourself:
- Capability requirement: Do you need reasoning (o1/o3) or is a mid-tier model sufficient?
- Context window: How much input do you need to process? Long documents suggest Gemini 1.5 or Claude.
- Cost constraint: Can you afford frontier model API calls, or do you need self-hosted?
- Latency requirement: Can you tolerate API latency, or do you need local inference?
- Privacy: Can data leave your infrastructure, or is self-hosted required?
Section 2: Running Models Locally with Ollama
Ollama is a tool that downloads and runs open-source LLMs locally on your machine. It handles model quantization, memory management, and provides a simple interface.
Installation and First Run
# Download Ollama from ollama.com
# On Mac: drag to Applications
# On Windows/Linux: standard install
# Verify installation
ollama --version
# Run a small model
ollama run gemma:2b
# Run a larger model
ollama run llama2:7b
# Run a coding-focused model
ollama run mistral:7b
The first run downloads the model (takes minutes depending on size and internet speed). Subsequent runs are instant.
Common Local Models and Their Use Cases
| Model | Size | Best For | Speed |
|---|---|---|---|
| Gemma 2B | 1.4 GB | Testing, low-resource devices | Very Fast |
| Phi 3.5 | 2.4 GB | Code, reasoning, efficient | Very Fast |
| Mistral 7B | 4 GB | Balanced performance | Fast |
| Llama 2 13B | 7 GB | Better reasoning, coding | Moderate |
| Llama 3.1 70B | 40 GB | High capability, near-GPT-4 | Slow |
When to Use Local vs. Frontier Models
Use local (Ollama) when:
- You need privacy (data stays on your machine)
- You need latency guarantees (no network dependency)
- You're doing development/testing and want free inference
- Your use case doesn't require frontier-level reasoning
Use frontier models (API) when:
- You need the best possible reasoning (GPT-4o, o1, Claude Opus)
- You need multimodal capabilities (vision)
- You're processing long documents (1M+ tokens)
- You want to avoid local compute costs
Section 3: Model Benchmarks
Benchmarks measure model performance on standardized tasks. They're useful but not perfect—models can be trained on benchmark data, and benchmarks don't capture all real-world needs.
Major Benchmarks
- MMLU: 57-subject multiple choice test covering STEM, humanities, professional domains. ~86% for GPT-4o.
- GPQA: Expert-level science questions. Graduate-level difficulty. ~90% for GPT-4o, ~85% for Claude Opus.
- HumanEval: Python code generation. Pass@1 (first try success). ~92% for GPT-4o, ~88% for Claude Opus.
- HellaSwag: Common-sense reasoning about visual situations.
- MATH: Competition-level math problems. Indicates reasoning capability.
Limitations of Benchmarks
- Benchmark contamination: Models may be trained on or near benchmark data
- Narrow evaluation: Benchmarks don't measure real-world task performance
- Gaming: Models improve on benchmarks without improving on actual use cases
Your own tests on your specific task are more valuable than benchmarks.
Section 4: The Chinchilla Scaling Law
The Chinchilla scaling law (DeepMind, 2022) describes the optimal balance between model size (parameters) and training data:
For a fixed compute budget, the optimal training data tokens ≈ 20x the number of parameters.
Historically, researchers over-parameterized models relative to data. Chinchilla showed that training a smaller model on more data is more efficient than training a large model on less data. This has influenced recent model development toward balanced architectures.
Implications:
- A 70B model trained on 1.4T tokens is more efficient than a 100B model on 1T tokens
- For fine-tuning, this suggests using more diverse data rather than larger model capacity
- Open-source models often follow Chinchilla principles; frontier models may not
Section 5: Evaluating Models in Practice
Vibes-Based Evaluation
The most practical evaluation is testing models on your actual use cases. Ask each model the same questions and compare:
- Answer quality and accuracy
- Response tone and style
- Handling of edge cases
- Cost per response
- Latency
Sample Comparison Prompt
System: You are a senior Python engineer. Explain complex concepts clearly.
User: Explain async/await in Python. Then show a concrete example that demonstrates
the difference between sequential and concurrent execution.
[Test with GPT-4o, Claude Opus, Gemini Pro, Llama 70B, Mistral Large]
[Compare answers for clarity, correctness, example quality]
Key Tools and Resources
- Ollama.com: Download and documentation for running models locally
- HuggingFace: Model hub with benchmarks and leaderboards
- Artificial Analysis: Real-time model benchmarking and cost analysis
- OpenRouter: Unified API across multiple models for easy comparison
What's Next
Stage 3 moves to building: you'll write Python code to call the Claude, OpenAI, and Gemini APIs, handling streaming, function calling, error handling, and building interactive applications.
Lock In Founding Member Access
Get full access to every course on TechNodeX — AI, cybersecurity, Python, and everything we build next. $9/month, price locked forever.
Become a Founding Member →