Stage 03 — Using Claude, GPT & Gemini APIs with Python
From API Keys to Streaming Tool Use · Technical + Hands-On · ⏱ 8–10 hours
Learning Objectives
By the end of this stage you will be able to:
- Set up Python environments and API keys for Claude, OpenAI, and Gemini
- Send basic completions, handle responses, and stream output
- Build multi-turn conversation history correctly
- Implement proper error handling, retries, and rate limiting
- Use token counting and cost estimation
- Call functions/tools from LLM outputs
- Build a reusable LLM client wrapper class
Setup
Prerequisites
python --version # Python 3.9+
pip install anthropic openai google-generativeai python-dotenv
API Keys
Create a .env file in your project root (never commit this):
ANTHROPIC_API_KEY=sk-ant-...
OPENAI_API_KEY=sk-proj-...
GOOGLE_API_KEY=AI...
Load them in Python:
from dotenv import load_dotenv
import os
load_dotenv()
anthropic_key = os.getenv("ANTHROPIC_API_KEY")
openai_key = os.getenv("OPENAI_API_KEY")
google_key = os.getenv("GOOGLE_API_KEY")
Section 1: Anthropic (Claude) API
Basic Completion
import anthropic
client = anthropic.Anthropic() # Reads ANTHROPIC_API_KEY from env
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[
{"role": "user", "content": "Explain gradient descent in 3 sentences."}
]
)
print(message.content[0].text)
print(f"\nTokens used: {message.usage.input_tokens} in, {message.usage.output_tokens} out")
With System Prompt
message = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
system="You are a senior Python developer. Be concise. Use type hints. Show examples.",
messages=[
{"role": "user", "content": "How do I read a CSV file?"}
]
)
print(message.content[0].text)
Multi-Turn Conversation
The API is stateless — you must send the entire conversation history every time.
def chat(client, conversation_history: list[dict], user_message: str, system: str = "") -> str:
"""Send a message and return the assistant's response."""
conversation_history.append({
"role": "user",
"content": user_message
})
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
system=system,
messages=conversation_history
)
assistant_message = response.content[0].text
conversation_history.append({
"role": "assistant",
"content": assistant_message
})
return assistant_message
# Usage
client = anthropic.Anthropic()
history = []
system = "You are a helpful Python tutor."
while True:
user_input = input("You: ")
if user_input.lower() in ["quit", "exit"]:
break
response = chat(client, history, user_input, system)
print(f"Claude: {response}\n")
Streaming Responses
For long outputs, streaming provides real-time token-by-token output:
def stream_chat(client, messages: list[dict], system: str = "") -> str:
"""Stream a response and collect it for history."""
full_response = ""
with client.messages.stream(
model="claude-sonnet-4-5",
max_tokens=2048,
system=system,
messages=messages
) as stream:
for text in stream.text_stream:
print(text, end="", flush=True)
full_response += text
print() # newline after streaming
return full_response
Tool Use (Function Calling)
Tools allow the model to call functions you define:
import json
# Define available tools
tools = [
{
"name": "get_weather",
"description": "Get the current weather for a city",
"input_schema": {
"type": "object",
"properties": {
"city": {
"type": "string",
"description": "The city name"
},
"unit": {
"type": "string",
"enum": ["celsius", "fahrenheit"],
"description": "Temperature unit"
}
},
"required": ["city"]
}
}
]
# Simulated weather function
def get_weather(city: str, unit: str = "celsius") -> dict:
# In real code, call a weather API here
return {"city": city, "temperature": 22, "unit": unit, "condition": "Sunny"}
# Agentic loop: model decides when to call tools
def run_with_tools(client, user_message: str) -> str:
messages = [{"role": "user", "content": user_message}]
while True:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
tools=tools,
messages=messages
)
# Check if model wants to use a tool
if response.stop_reason == "tool_use":
# Process all tool calls
tool_results = []
for content_block in response.content:
if content_block.type == "tool_use":
tool_name = content_block.name
tool_input = content_block.input
tool_use_id = content_block.id
# Execute the tool
if tool_name == "get_weather":
result = get_weather(**tool_input)
else:
result = {"error": f"Unknown tool: {tool_name}"}
tool_results.append({
"type": "tool_result",
"tool_use_id": tool_use_id,
"content": json.dumps(result)
})
# Add assistant response and tool results to history
messages.append({"role": "assistant", "content": response.content})
messages.append({"role": "user", "content": tool_results})
else:
# Model is done — return final text response
return response.content[0].text
result = run_with_tools(client, "What's the weather like in Tokyo and London?")
print(result)
Section 2: OpenAI (GPT) API
Basic Completion
from openai import OpenAI
client = OpenAI() # Reads OPENAI_API_KEY from env
response = client.chat.completions.create(
model="gpt-4o",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain gradient descent in 3 sentences."}
],
max_tokens=512,
temperature=0.7
)
print(response.choices[0].message.content)
print(f"Tokens: {response.usage.prompt_tokens} in, {response.usage.completion_tokens} out")
Structured Output (JSON Mode)
import json
response = client.chat.completions.create(
model="gpt-4o",
response_format={"type": "json_object"},
messages=[
{"role": "system", "content": "Extract entities from the text. Return JSON with keys: people (list), organizations (list), locations (list), dates (list)."},
{"role": "user", "content": "Elon Musk founded SpaceX in 2002 in Hawthorne, California. He later acquired Twitter in October 2022."}
]
)
data = json.loads(response.choices[0].message.content)
print(data)
# {'people': ['Elon Musk'], 'organizations': ['SpaceX', 'Twitter'],
# 'locations': ['Hawthorne, California'], 'dates': ['2002', 'October 2022']}
Function Calling
tools = [
{
"type": "function",
"function": {
"name": "search_database",
"description": "Search the product database by query",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"},
"category": {"type": "string", "enum": ["laptops", "phones", "tablets"]},
"max_price": {"type": "number", "description": "Maximum price in USD"}
},
"required": ["query"]
}
}
}
]
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Find me a laptop under $1000"}],
tools=tools,
tool_choice="auto"
)
# Check if a tool was called
if response.choices[0].finish_reason == "tool_calls":
tool_call = response.choices[0].message.tool_calls[0]
function_name = tool_call.function.name
function_args = json.loads(tool_call.function.arguments)
print(f"Model wants to call: {function_name}({function_args})")
Streaming
stream = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": "Write a haiku about recursion."}],
stream=True
)
for chunk in stream:
if chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
Section 3: Google Gemini API
Basic Completion
import google.generativeai as genai
import os
genai.configure(api_key=os.getenv("GOOGLE_API_KEY"))
model = genai.GenerativeModel("gemini-1.5-flash")
response = model.generate_content("Explain the CAP theorem in simple terms.")
print(response.text)
With System Instruction
model = genai.GenerativeModel(
model_name="gemini-1.5-flash",
system_instruction="You are a database expert. Use examples from PostgreSQL and MongoDB."
)
response = model.generate_content("What's the difference between ACID and BASE consistency models?")
print(response.text)
Multi-Turn Chat
model = genai.GenerativeModel("gemini-1.5-flash")
chat = model.start_chat(history=[])
# Send messages
response = chat.send_message("What is Kubernetes?")
print("Gemini:", response.text)
response = chat.send_message("How does it compare to Docker Swarm?")
print("Gemini:", response.text)
# History is maintained automatically
print("\nConversation length:", len(chat.history), "turns")
Processing Long Documents
Gemini's 1M token context is especially useful for large documents:
with open("long_document.txt", "r") as f:
document = f.read()
model = genai.GenerativeModel("gemini-1.5-pro")
response = model.generate_content([
f"Document:\n{document}\n\nQuestions: 1) What is the main argument? 2) List the 5 key claims. 3) What evidence is weakest?"
])
print(response.text)
Section 4: Error Handling and Resilience
Production code must handle API failures gracefully.
Common Errors
| Error | Cause | Solution |
|---|---|---|
AuthenticationError | Invalid API key | Check key, reload env |
RateLimitError | Too many requests | Exponential backoff |
APIConnectionError | Network issue | Retry with backoff |
BadRequestError | Invalid parameters | Log and fix the request |
APIStatusError | 5xx server error | Retry with backoff |
Exponential Backoff Pattern
import time
import random
import anthropic
def call_with_retry(
client: anthropic.Anthropic,
messages: list[dict],
system: str = "",
max_retries: int = 5,
base_delay: float = 1.0
) -> str:
"""Call Claude API with exponential backoff retry."""
for attempt in range(max_retries):
try:
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=2048,
system=system,
messages=messages
)
return response.content[0].text
except anthropic.RateLimitError as e:
if attempt == max_retries - 1:
raise
# Get retry-after header if available
delay = float(e.response.headers.get("retry-after", base_delay * (2 ** attempt)))
jitter = random.uniform(0, delay * 0.1)
print(f"Rate limited. Waiting {delay + jitter:.1f}s (attempt {attempt + 1}/{max_retries})")
time.sleep(delay + jitter)
except anthropic.APIConnectionError:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt) + random.uniform(0, 1)
print(f"Connection error. Retrying in {delay:.1f}s...")
time.sleep(delay)
except anthropic.APIStatusError as e:
if e.status_code >= 500:
if attempt == max_retries - 1:
raise
delay = base_delay * (2 ** attempt)
time.sleep(delay)
else:
raise # 4xx errors should not be retried
raise RuntimeError("Max retries exceeded")
Section 5: Token Counting and Cost Estimation
Understanding token usage is essential for cost control in production.
import anthropic
client = anthropic.Anthropic()
def estimate_cost(input_tokens: int, output_tokens: int, model: str = "claude-sonnet-4-5") -> float:
"""Estimate Claude API cost in USD."""
# Prices per million tokens (as of 2025 — verify current prices)
pricing = {
"claude-opus-4-6": {"input": 15.00, "output": 75.00},
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
"claude-haiku-4-5": {"input": 0.80, "output": 4.00},
}
if model not in pricing:
return 0.0
p = pricing[model]
cost = (input_tokens / 1_000_000 * p["input"]) + (output_tokens / 1_000_000 * p["output"])
return cost
def count_tokens(client: anthropic.Anthropic, messages: list[dict], system: str = "") -> int:
"""Count tokens before making a request."""
response = client.messages.count_tokens(
model="claude-sonnet-4-5",
system=system,
messages=messages
)
return response.input_tokens
# Example: estimate before sending
messages = [{"role": "user", "content": "Explain the Raft consensus algorithm in detail with examples."}]
token_count = count_tokens(client, messages)
estimated_cost = estimate_cost(token_count, 1000) # Assume 1000 output tokens
print(f"Estimated input tokens: {token_count}")
print(f"Estimated cost: ${estimated_cost:.4f}")
Section 6: Building a Reusable LLM Client
Here's a production-ready wrapper that works across providers:
from abc import ABC, abstractmethod
from dataclasses import dataclass, field
from typing import Optional
import anthropic
from openai import OpenAI
import time
import random
@dataclass
class Message:
role: str # "user" or "assistant"
content: str
@dataclass
class LLMResponse:
content: str
input_tokens: int
output_tokens: int
model: str
cost_usd: float
class BaseLLMClient(ABC):
"""Abstract base class for LLM clients."""
@abstractmethod
def complete(
self,
messages: list[Message],
system: str = "",
max_tokens: int = 2048,
temperature: float = 0.7
) -> LLMResponse:
pass
def chat(self, conversation: list[Message], user_message: str, **kwargs) -> LLMResponse:
"""Add user message and get response, updating conversation in place."""
conversation.append(Message(role="user", content=user_message))
response = self.complete(conversation, **kwargs)
conversation.append(Message(role="assistant", content=response.content))
return response
class ClaudeClient(BaseLLMClient):
"""Anthropic Claude client with retry logic."""
PRICING = {
"claude-sonnet-4-5": {"input": 3.00, "output": 15.00},
"claude-haiku-4-5": {"input": 0.80, "output": 4.00},
}
def __init__(self, model: str = "claude-sonnet-4-5", max_retries: int = 3):
self.client = anthropic.Anthropic()
self.model = model
self.max_retries = max_retries
def complete(
self,
messages: list[Message],
system: str = "",
max_tokens: int = 2048,
temperature: float = 0.7
) -> LLMResponse:
api_messages = [{"role": m.role, "content": m.content} for m in messages]
for attempt in range(self.max_retries):
try:
response = self.client.messages.create(
model=self.model,
max_tokens=max_tokens,
system=system,
messages=api_messages,
temperature=temperature
)
input_tok = response.usage.input_tokens
output_tok = response.usage.output_tokens
p = self.PRICING.get(self.model, {"input": 3.0, "output": 15.0})
cost = (input_tok / 1e6 * p["input"]) + (output_tok / 1e6 * p["output"])
return LLMResponse(
content=response.content[0].text,
input_tokens=input_tok,
output_tokens=output_tok,
model=self.model,
cost_usd=cost
)
except anthropic.RateLimitError:
if attempt == self.max_retries - 1:
raise
time.sleep(2 ** attempt + random.random())
except anthropic.APIConnectionError:
if attempt == self.max_retries - 1:
raise
time.sleep(2 ** attempt)
class OpenAIClient(BaseLLMClient):
"""OpenAI GPT client."""
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
}
def __init__(self, model: str = "gpt-4o"):
self.client = OpenAI()
self.model = model
def complete(
self,
messages: list[Message],
system: str = "",
max_tokens: int = 2048,
temperature: float = 0.7
) -> LLMResponse:
api_messages = []
if system:
api_messages.append({"role": "system", "content": system})
api_messages.extend([{"role": m.role, "content": m.content} for m in messages])
response = self.client.chat.completions.create(
model=self.model,
messages=api_messages,
max_tokens=max_tokens,
temperature=temperature
)
input_tok = response.usage.prompt_tokens
output_tok = response.usage.completion_tokens
p = self.PRICING.get(self.model, {"input": 2.5, "output": 10.0})
cost = (input_tok / 1e6 * p["input"]) + (output_tok / 1e6 * p["output"])
return LLMResponse(
content=response.choices[0].message.content,
input_tokens=input_tok,
output_tokens=output_tok,
model=self.model,
cost_usd=cost
)
# Usage example
def run_benchmark(prompt: str) -> None:
"""Compare responses and costs across providers."""
clients = {
"Claude Sonnet": ClaudeClient("claude-sonnet-4-5"),
"GPT-4o": OpenAIClient("gpt-4o"),
}
for name, client in clients.items():
try:
messages = [Message(role="user", content=prompt)]
response = client.complete(messages)
print(f"\n{'='*50}")
print(f"Model: {name}")
print(f"Tokens: {response.input_tokens} in / {response.output_tokens} out")
print(f"Cost: ${response.cost_usd:.4f}")
print(f"Response: {response.content[:200]}...")
except Exception as e:
print(f"{name} error: {e}")
run_benchmark("What are the three most important concepts in distributed systems? Be concise.")
Section 7: Image and Vision
Both Claude and GPT-4o support image inputs.
import base64
from pathlib import Path
def encode_image(image_path: str) -> tuple[str, str]:
"""Encode image to base64 and detect media type."""
path = Path(image_path)
suffix = path.suffix.lower()
media_types = {".jpg": "image/jpeg", ".jpeg": "image/jpeg",
".png": "image/png", ".gif": "image/gif", ".webp": "image/webp"}
media_type = media_types.get(suffix, "image/jpeg")
with open(image_path, "rb") as f:
data = base64.standard_b64encode(f.read()).decode("utf-8")
return data, media_type
# Claude vision
def analyze_image_claude(client: anthropic.Anthropic, image_path: str, question: str) -> str:
image_data, media_type = encode_image(image_path)
response = client.messages.create(
model="claude-sonnet-4-5",
max_tokens=1024,
messages=[
{
"role": "user",
"content": [
{
"type": "image",
"source": {
"type": "base64",
"media_type": media_type,
"data": image_data,
}
},
{"type": "text", "text": question}
]
}
]
)
return response.content[0].text
Checkpoint Assessment
- Why must you send the entire conversation history with each API call?
- What is the difference between temperature=0 and temperature=1?
- When should you use streaming vs. standard completions?
- A production app is hitting rate limits under load. Name two architectural solutions.
- You need to call a weather API from within a Claude conversation. What API feature enables this, and sketch the call sequence in pseudocode.
- Estimate the cost of processing 10,000 documents, each with 500 tokens of content and generating 200 tokens of output, using Claude Sonnet.
Project: Multi-Provider Summarizer
Build a Python script that:
- Accepts a URL or file path as a command-line argument
- Extracts the text content (use
requests+beautifulsoup4for URLs) - Sends it to Claude for a structured summary: title, key points (3–5), sentiment, reading time estimate
- Formats the output as clean markdown
- Logs token usage and cost to a CSV file for tracking
Bonus: Add a --compare flag that runs all three providers and outputs a side-by-side comparison.
Key Vocabulary
| Term | Definition |
|---|---|
| Stateless API | Each call is independent; no memory of previous calls |
| Function calling / Tool use | LLM generating structured function call requests |
| Streaming | Receiving response tokens in real-time as they're generated |
| Rate limiting | API throttling to prevent abuse; requires backoff |
| Exponential backoff | Retry strategy doubling wait time between attempts |
| JSON mode | API feature forcing JSON output format |
| Vision / multimodal | Passing images alongside text in API calls |
What's Next
Stage 4 covers Retrieval-Augmented Generation (RAG) — how to give LLMs access to your own documents, databases, and real-time information using vector embeddings and semantic search.
Lock In Founding Member Access
Get full access to every course on TechNodeX — AI, cybersecurity, Python, and everything we build next. $9/month, price locked forever.
Become a Founding Member →