The single highest-leverage cost optimization in any agentic system is not prompt compression or caching — it is choosing the right model for each task. Routing a task that requires simple text extraction to Claude 3 Opus is like hiring a senior architect to fill in a form. The work gets done, but you've massively overpaid. Conversely, routing complex multi-step reasoning to a budget model introduces errors that cost far more to fix than the token savings.
Model selection as optimization is a discipline: a systematic framework for mapping task characteristics to model capabilities, and building routing logic that applies that framework automatically at scale.
The Model Tier Landscape
Modern LLM providers offer a tiered model lineup that roughly maps to three tiers:
Tier 1: Frontier/Flagship Models
High capability, high cost. Best for tasks requiring nuanced reasoning, complex code generation, ambiguous instructions, or creative synthesis.
| Provider | Model | Relative Cost Index |
|---|---|---|
| OpenAI | GPT-4o, o1 | 100x |
| Anthropic | Claude 3.5 Sonnet, Claude 3 Opus | 60x–200x |
| Gemini 1.5 Pro | 47x | |
| AWS Bedrock | Meta Llama 3.1 405B | 32x |
Tier 2: Balanced/Mid-Tier Models
Good capability at moderate cost. The sweet spot for most production tasks that are well-defined but non-trivial.
| Provider | Model | Relative Cost Index |
|---|---|---|
| OpenAI | GPT-4o mini, o3-mini | 4x–15x |
| Anthropic | Claude 3.5 Haiku | 11x |
| Gemini 2.0 Flash | 1.3x | |
| AWS Bedrock | Meta Llama 3.1 70B | 10x |
Tier 3: Economy/Edge Models
Low capability, very low cost. Best for classification, extraction, templated generation, and routing decisions themselves.
| Provider | Model | Relative Cost Index |
|---|---|---|
| OpenAI | GPT-3.5 Turbo | 2x |
| Anthropic | Claude 3 Haiku | 1x (baseline) |
| Gemini 1.5 Flash | 1x | |
| AWS Bedrock | Amazon Titan Text Lite | 0.3x |
The "relative cost index" above uses Claude 3 Haiku as a baseline (1x). These ratios illustrate the magnitude of difference — choosing Claude 3 Opus over Claude 3 Haiku for a simple task is a 75x cost decision.
Tip: When building a new agentic workflow, start by profiling your task mix. Categorize each task type the agent performs into Tier 1, 2, or 3 based on failure cost (how bad is a wrong answer?), reasoning complexity (how many inferential steps are required?), and ambiguity (how underspecified is the input?). Most workflows have 60–70% of tasks in Tier 3 and only 10–20% genuinely requiring Tier 1.
Task-to-Model Mapping Framework
Not all tasks are equal. Here is a practical mapping of common agentic subtasks to appropriate model tiers:
Tier 3 Tasks (Economy Models)
These tasks have well-defined inputs and outputs, low ambiguity, and a high tolerance for minor errors that can be caught downstream:
- Classification: "Is this ticket a bug or a feature request?" — Binary or categorical with clear criteria
- Data extraction: "Extract the date, amount, and vendor from this invoice" — Structured extraction from consistent formats
- Routing decisions: "Which tool should handle this request?" — Decision trees with defined options
- Format conversion: "Convert this JSON to CSV" — Deterministic transformation
- Simple summarization: "List the three main points of this paragraph" — Low-stakes, short-form
- Keyword extraction: "Extract all technical terms from this text"
- Template filling: "Fill this email template with these values"
Tier 2 Tasks (Mid-Tier Models)
Moderate complexity with defined success criteria but requiring real language understanding:
- Code review: "Identify bugs and suggest improvements in this function"
- Test generation: "Write unit tests for this API endpoint"
- Documentation writing: "Generate API documentation from this code"
- Structured reasoning: "Analyze this bug report and identify the root cause"
- Multi-step tool orchestration: Standard ReAct loops with well-defined tools
- Medium-length summarization: Summarizing a 20-page report
- Requirements clarification: "Turn this user story into acceptance criteria"
Tier 1 Tasks (Frontier Models)
High complexity, high stakes, or significant ambiguity:
- Architecture decisions: "Design the data model for this system given these constraints"
- Novel problem solving: Tasks with no clear prior solution
- Complex debugging: Multi-system failures with incomplete information
- Code generation for new systems: Building something from scratch with many unknowns
- Ambiguous instruction interpretation: When the user input is vague and stakes are high
- Strategic analysis: Business or technical analysis requiring synthesis across many factors
- Long-horizon reasoning: Tasks requiring 20+ step planning
Tip: Build a "task taxonomy" document for your specific domain and share it across your engineering, QA, and PM teams. When everyone uses the same classification vocabulary, routing decisions become team-level conventions rather than individual judgment calls. This prevents the common pattern of engineers defaulting to the most capable model "just to be safe."
Cascading and Fallback Patterns
A single-tier routing decision is limiting. Cascading patterns — where you try a cheaper model first and escalate only on failure — can achieve frontier-model reliability at mid-tier costs.
Confidence-Based Cascading
async def cascade_request(prompt: str, task_type: str) -> str:
# First attempt with economy model
economy_response = await call_model(
model="claude-3-haiku-20240307",
prompt=prompt,
max_tokens=500
)
# Evaluate confidence using a fast classifier
confidence = await evaluate_confidence(economy_response, task_type)
if confidence >= 0.85:
return economy_response.content
# Escalate to mid-tier
mid_response = await call_model(
model="claude-3-5-haiku-20241022",
prompt=prompt,
max_tokens=500
)
confidence = await evaluate_confidence(mid_response, task_type)
if confidence >= 0.85:
return mid_response.content
# Final escalation to frontier
frontier_response = await call_model(
model="claude-3-5-sonnet-20241022",
prompt=prompt,
max_tokens=1000
)
return frontier_response.content
The key is the confidence evaluator — this can itself be a simple/cheap model that checks whether the response meets the output schema, contains required fields, or passes basic sanity checks.
Cost Profile of Cascading
Assume 70% of tasks succeed at economy tier, 20% escalate to mid-tier, 10% escalate to frontier:
Economy tier cost: 100 tasks × 2,000 tokens × $0.25/1M = $0.05
Mid-tier cost: 20 tasks × 2,000 tokens × $0.80/1M = $0.032
Frontier cost: 10 tasks × 2,000 tokens × $3.00/1M = $0.06
Total: $0.142
vs. all-frontier: 100 tasks × $0.60 = $0.60
Savings: 76% cost reduction
Tip: Instrument your cascade to log which tier handled each task and the confidence score. After two weeks of production data, you'll have empirical evidence for tuning your confidence thresholds — often you can raise the economy tier threshold to capture more tasks there without meaningful quality loss.
Specialization vs. Generalization Tradeoffs
Not all model selection decisions are about cost tiers. Some decisions involve specialized models vs. general-purpose flagships.
Fine-Tuned Models
Fine-tuned models (available on OpenAI and AWS Bedrock) can match or exceed frontier model performance on narrow tasks at Tier 2 prices. If your agentic workflow repeatedly performs the same task type — say, classifying customer support tickets into 50 categories — a fine-tuned GPT-4o mini can outperform base GPT-4o at 1/50th the cost.
Fine-tuning investment threshold: Fine-tuning is worth considering when:
- You have 500+ high-quality labeled examples
- The task is narrow and well-defined
- You run the task >1,000 times/month
- Current accuracy is a bottleneck
Embedding Models
Many agentic tasks involve retrieval (RAG). Using an LLM for embedding is a costly mistake — dedicated embedding models are 100x cheaper and faster:
| Model | Cost per 1M tokens | Use case |
|---|---|---|
| OpenAI text-embedding-3-small | $0.02 | General RAG |
| OpenAI text-embedding-3-large | $0.13 | High-accuracy RAG |
| AWS Bedrock Titan Embed | $0.10 | AWS-native RAG |
| Google Vertex textembedding-gecko | $0.025 | Google Cloud RAG |
Never use a generative model for embeddings in production.
Reasoning Models (o1, o3)
OpenAI's o-series and similar "thinking" models (like extended thinking mode in Claude) perform internal chain-of-thought reasoning before outputting. They are dramatically more expensive but excel at tasks that trip up standard models: complex math, multi-step logical deduction, and algorithm design.
Use reasoning models when:
- The task has failed on standard models with careful prompting
- You can afford 5–10x the standard model cost
- Latency is acceptable (reasoning models are slower)
Tip: Before escalating to a reasoning model, try adding a simple "think step by step and verify your answer before responding" instruction to a Tier 2 model. This pseudo-chain-of-thought captures 60–70% of the quality improvement at zero extra cost.
Routing Architecture for Multi-Model Systems
In production agentic systems, model selection is implemented as a routing layer. Here is a practical architecture:
from enum import Enum
from dataclasses import dataclass
from typing import Optional
class TaskComplexity(Enum):
SIMPLE = "simple" # extraction, classification, templating
MODERATE = "moderate" # code review, test gen, structured reasoning
COMPLEX = "complex" # architecture, novel problems, ambiguous
CRITICAL = "critical" # high-stakes decisions, irreversible actions
@dataclass
class ModelConfig:
provider: str
model_id: str
max_tokens: int
temperature: float
ROUTING_TABLE = {
TaskComplexity.SIMPLE: ModelConfig(
provider="anthropic",
model_id="claude-3-haiku-20240307",
max_tokens=512,
temperature=0.0
),
TaskComplexity.MODERATE: ModelConfig(
provider="anthropic",
model_id="claude-3-5-haiku-20241022",
max_tokens=2048,
temperature=0.1
),
TaskComplexity.COMPLEX: ModelConfig(
provider="anthropic",
model_id="claude-3-5-sonnet-20241022",
max_tokens=4096,
temperature=0.2
),
TaskComplexity.CRITICAL: ModelConfig(
provider="anthropic",
model_id="claude-3-5-sonnet-20241022",
max_tokens=8192,
temperature=0.1
),
}
def classify_task(task_description: str, context: dict) -> TaskComplexity:
"""
Rule-based + ML hybrid classifier.
Start with rules, add ML classification once you have labeled data.
"""
# Rule-based signals
if context.get("is_irreversible_action"):
return TaskComplexity.CRITICAL
if any(kw in task_description.lower() for kw in
["extract", "classify", "format", "convert", "list"]):
return TaskComplexity.SIMPLE
if any(kw in task_description.lower() for kw in
["review", "test", "document", "analyze", "debug"]):
return TaskComplexity.MODERATE
if any(kw in task_description.lower() for kw in
["design", "architect", "novel", "complex", "unknown"]):
return TaskComplexity.COMPLEX
# Default to moderate if uncertain
return TaskComplexity.MODERATE
def route_to_model(task: str, context: dict) -> ModelConfig:
complexity = classify_task(task, context)
return ROUTING_TABLE[complexity]
Tip: Log every routing decision with the task description, assigned complexity, model used, and outcome quality score. After 500+ samples, use this data to train a small binary classifier (logistic regression or a fine-tuned economy model) to replace the rule-based classifier. This creates a self-improving routing system.
Measuring Model Selection Effectiveness
Model selection optimization should be measured, not just implemented. Track these metrics:
| Metric | Definition | Target |
|---|---|---|
| Cost per successful task | Total tokens / successful completions | Decreasing MoM |
| Economy tier capture rate | % tasks handled by Tier 3 models | >60% |
| Escalation rate | % tasks that cascade to higher tiers | <25% |
| Quality parity score | Quality of routed output vs. all-frontier baseline | >95% |
| Misrouting rate | % tasks where routing was incorrect | <5% |
A good model selection framework should deliver 50–70% cost savings vs. all-frontier routing while maintaining quality parity above 95% of baseline.
Tip: Run a monthly A/B experiment where 5% of traffic is sent to the next-tier-up model for comparison. This gives you continuous ground truth about whether your economy and mid-tier models are still adequate, since model quality shifts with version updates and your task distribution changes over time.
Summary
Model selection is the highest-ROI optimization in any agentic cost strategy. By mapping task characteristics systematically to model tiers, implementing cascading fallback patterns, and using specialized models (fine-tuned, embedding, reasoning) where appropriate, teams can typically achieve 50–75% cost reduction with minimal quality loss. The key is building routing infrastructure that applies these decisions consistently, automatically, and measurably — not relying on individual developers to make the right model choice in each feature.