·

The single highest-leverage cost optimization in any agentic system is not prompt compression or caching — it is choosing the right model for each task. Routing a task that requires simple text extraction to Claude 3 Opus is like hiring a senior architect to fill in a form. The work gets done, but you've massively overpaid. Conversely, routing complex multi-step reasoning to a budget model introduces errors that cost far more to fix than the token savings.

Model selection as optimization is a discipline: a systematic framework for mapping task characteristics to model capabilities, and building routing logic that applies that framework automatically at scale.


The Model Tier Landscape

Modern LLM providers offer a tiered model lineup that roughly maps to three tiers:

Tier 1: Frontier/Flagship Models

High capability, high cost. Best for tasks requiring nuanced reasoning, complex code generation, ambiguous instructions, or creative synthesis.

Provider Model Relative Cost Index
OpenAI GPT-4o, o1 100x
Anthropic Claude 3.5 Sonnet, Claude 3 Opus 60x–200x
Google Gemini 1.5 Pro 47x
AWS Bedrock Meta Llama 3.1 405B 32x

Tier 2: Balanced/Mid-Tier Models

Good capability at moderate cost. The sweet spot for most production tasks that are well-defined but non-trivial.

Provider Model Relative Cost Index
OpenAI GPT-4o mini, o3-mini 4x–15x
Anthropic Claude 3.5 Haiku 11x
Google Gemini 2.0 Flash 1.3x
AWS Bedrock Meta Llama 3.1 70B 10x

Tier 3: Economy/Edge Models

Low capability, very low cost. Best for classification, extraction, templated generation, and routing decisions themselves.

Provider Model Relative Cost Index
OpenAI GPT-3.5 Turbo 2x
Anthropic Claude 3 Haiku 1x (baseline)
Google Gemini 1.5 Flash 1x
AWS Bedrock Amazon Titan Text Lite 0.3x

The "relative cost index" above uses Claude 3 Haiku as a baseline (1x). These ratios illustrate the magnitude of difference — choosing Claude 3 Opus over Claude 3 Haiku for a simple task is a 75x cost decision.

Tip: When building a new agentic workflow, start by profiling your task mix. Categorize each task type the agent performs into Tier 1, 2, or 3 based on failure cost (how bad is a wrong answer?), reasoning complexity (how many inferential steps are required?), and ambiguity (how underspecified is the input?). Most workflows have 60–70% of tasks in Tier 3 and only 10–20% genuinely requiring Tier 1.


Task-to-Model Mapping Framework

Not all tasks are equal. Here is a practical mapping of common agentic subtasks to appropriate model tiers:

Tier 3 Tasks (Economy Models)

These tasks have well-defined inputs and outputs, low ambiguity, and a high tolerance for minor errors that can be caught downstream:

  • Classification: "Is this ticket a bug or a feature request?" — Binary or categorical with clear criteria
  • Data extraction: "Extract the date, amount, and vendor from this invoice" — Structured extraction from consistent formats
  • Routing decisions: "Which tool should handle this request?" — Decision trees with defined options
  • Format conversion: "Convert this JSON to CSV" — Deterministic transformation
  • Simple summarization: "List the three main points of this paragraph" — Low-stakes, short-form
  • Keyword extraction: "Extract all technical terms from this text"
  • Template filling: "Fill this email template with these values"

Tier 2 Tasks (Mid-Tier Models)

Moderate complexity with defined success criteria but requiring real language understanding:

  • Code review: "Identify bugs and suggest improvements in this function"
  • Test generation: "Write unit tests for this API endpoint"
  • Documentation writing: "Generate API documentation from this code"
  • Structured reasoning: "Analyze this bug report and identify the root cause"
  • Multi-step tool orchestration: Standard ReAct loops with well-defined tools
  • Medium-length summarization: Summarizing a 20-page report
  • Requirements clarification: "Turn this user story into acceptance criteria"

Tier 1 Tasks (Frontier Models)

High complexity, high stakes, or significant ambiguity:

  • Architecture decisions: "Design the data model for this system given these constraints"
  • Novel problem solving: Tasks with no clear prior solution
  • Complex debugging: Multi-system failures with incomplete information
  • Code generation for new systems: Building something from scratch with many unknowns
  • Ambiguous instruction interpretation: When the user input is vague and stakes are high
  • Strategic analysis: Business or technical analysis requiring synthesis across many factors
  • Long-horizon reasoning: Tasks requiring 20+ step planning

Tip: Build a "task taxonomy" document for your specific domain and share it across your engineering, QA, and PM teams. When everyone uses the same classification vocabulary, routing decisions become team-level conventions rather than individual judgment calls. This prevents the common pattern of engineers defaulting to the most capable model "just to be safe."


Cascading and Fallback Patterns

A single-tier routing decision is limiting. Cascading patterns — where you try a cheaper model first and escalate only on failure — can achieve frontier-model reliability at mid-tier costs.

Confidence-Based Cascading

async def cascade_request(prompt: str, task_type: str) -> str:
    # First attempt with economy model
    economy_response = await call_model(
        model="claude-3-haiku-20240307",
        prompt=prompt,
        max_tokens=500
    )

    # Evaluate confidence using a fast classifier
    confidence = await evaluate_confidence(economy_response, task_type)

    if confidence >= 0.85:
        return economy_response.content

    # Escalate to mid-tier
    mid_response = await call_model(
        model="claude-3-5-haiku-20241022",
        prompt=prompt,
        max_tokens=500
    )

    confidence = await evaluate_confidence(mid_response, task_type)

    if confidence >= 0.85:
        return mid_response.content

    # Final escalation to frontier
    frontier_response = await call_model(
        model="claude-3-5-sonnet-20241022",
        prompt=prompt,
        max_tokens=1000
    )

    return frontier_response.content

The key is the confidence evaluator — this can itself be a simple/cheap model that checks whether the response meets the output schema, contains required fields, or passes basic sanity checks.

Cost Profile of Cascading

Assume 70% of tasks succeed at economy tier, 20% escalate to mid-tier, 10% escalate to frontier:

Economy tier cost: 100 tasks × 2,000 tokens × $0.25/1M = $0.05
Mid-tier cost:     20 tasks × 2,000 tokens × $0.80/1M = $0.032
Frontier cost:     10 tasks × 2,000 tokens × $3.00/1M = $0.06

Total: $0.142

vs. all-frontier: 100 tasks × $0.60 = $0.60

Savings: 76% cost reduction

Tip: Instrument your cascade to log which tier handled each task and the confidence score. After two weeks of production data, you'll have empirical evidence for tuning your confidence thresholds — often you can raise the economy tier threshold to capture more tasks there without meaningful quality loss.


Specialization vs. Generalization Tradeoffs

Not all model selection decisions are about cost tiers. Some decisions involve specialized models vs. general-purpose flagships.

Fine-Tuned Models

Fine-tuned models (available on OpenAI and AWS Bedrock) can match or exceed frontier model performance on narrow tasks at Tier 2 prices. If your agentic workflow repeatedly performs the same task type — say, classifying customer support tickets into 50 categories — a fine-tuned GPT-4o mini can outperform base GPT-4o at 1/50th the cost.

Fine-tuning investment threshold: Fine-tuning is worth considering when:
- You have 500+ high-quality labeled examples
- The task is narrow and well-defined
- You run the task >1,000 times/month
- Current accuracy is a bottleneck

Embedding Models

Many agentic tasks involve retrieval (RAG). Using an LLM for embedding is a costly mistake — dedicated embedding models are 100x cheaper and faster:

Model Cost per 1M tokens Use case
OpenAI text-embedding-3-small $0.02 General RAG
OpenAI text-embedding-3-large $0.13 High-accuracy RAG
AWS Bedrock Titan Embed $0.10 AWS-native RAG
Google Vertex textembedding-gecko $0.025 Google Cloud RAG

Never use a generative model for embeddings in production.

Reasoning Models (o1, o3)

OpenAI's o-series and similar "thinking" models (like extended thinking mode in Claude) perform internal chain-of-thought reasoning before outputting. They are dramatically more expensive but excel at tasks that trip up standard models: complex math, multi-step logical deduction, and algorithm design.

Use reasoning models when:
- The task has failed on standard models with careful prompting
- You can afford 5–10x the standard model cost
- Latency is acceptable (reasoning models are slower)

Tip: Before escalating to a reasoning model, try adding a simple "think step by step and verify your answer before responding" instruction to a Tier 2 model. This pseudo-chain-of-thought captures 60–70% of the quality improvement at zero extra cost.


Routing Architecture for Multi-Model Systems

In production agentic systems, model selection is implemented as a routing layer. Here is a practical architecture:

from enum import Enum
from dataclasses import dataclass
from typing import Optional

class TaskComplexity(Enum):
    SIMPLE = "simple"          # extraction, classification, templating
    MODERATE = "moderate"      # code review, test gen, structured reasoning
    COMPLEX = "complex"        # architecture, novel problems, ambiguous
    CRITICAL = "critical"      # high-stakes decisions, irreversible actions

@dataclass
class ModelConfig:
    provider: str
    model_id: str
    max_tokens: int
    temperature: float

ROUTING_TABLE = {
    TaskComplexity.SIMPLE: ModelConfig(
        provider="anthropic",
        model_id="claude-3-haiku-20240307",
        max_tokens=512,
        temperature=0.0
    ),
    TaskComplexity.MODERATE: ModelConfig(
        provider="anthropic",
        model_id="claude-3-5-haiku-20241022",
        max_tokens=2048,
        temperature=0.1
    ),
    TaskComplexity.COMPLEX: ModelConfig(
        provider="anthropic",
        model_id="claude-3-5-sonnet-20241022",
        max_tokens=4096,
        temperature=0.2
    ),
    TaskComplexity.CRITICAL: ModelConfig(
        provider="anthropic",
        model_id="claude-3-5-sonnet-20241022",
        max_tokens=8192,
        temperature=0.1
    ),
}

def classify_task(task_description: str, context: dict) -> TaskComplexity:
    """
    Rule-based + ML hybrid classifier.
    Start with rules, add ML classification once you have labeled data.
    """
    # Rule-based signals
    if context.get("is_irreversible_action"):
        return TaskComplexity.CRITICAL

    if any(kw in task_description.lower() for kw in 
           ["extract", "classify", "format", "convert", "list"]):
        return TaskComplexity.SIMPLE

    if any(kw in task_description.lower() for kw in 
           ["review", "test", "document", "analyze", "debug"]):
        return TaskComplexity.MODERATE

    if any(kw in task_description.lower() for kw in 
           ["design", "architect", "novel", "complex", "unknown"]):
        return TaskComplexity.COMPLEX

    # Default to moderate if uncertain
    return TaskComplexity.MODERATE

def route_to_model(task: str, context: dict) -> ModelConfig:
    complexity = classify_task(task, context)
    return ROUTING_TABLE[complexity]

Tip: Log every routing decision with the task description, assigned complexity, model used, and outcome quality score. After 500+ samples, use this data to train a small binary classifier (logistic regression or a fine-tuned economy model) to replace the rule-based classifier. This creates a self-improving routing system.


Measuring Model Selection Effectiveness

Model selection optimization should be measured, not just implemented. Track these metrics:

Metric Definition Target
Cost per successful task Total tokens / successful completions Decreasing MoM
Economy tier capture rate % tasks handled by Tier 3 models >60%
Escalation rate % tasks that cascade to higher tiers <25%
Quality parity score Quality of routed output vs. all-frontier baseline >95%
Misrouting rate % tasks where routing was incorrect <5%

A good model selection framework should deliver 50–70% cost savings vs. all-frontier routing while maintaining quality parity above 95% of baseline.

Tip: Run a monthly A/B experiment where 5% of traffic is sent to the next-tier-up model for comparison. This gives you continuous ground truth about whether your economy and mid-tier models are still adequate, since model quality shifts with version updates and your task distribution changes over time.


Summary

Model selection is the highest-ROI optimization in any agentic cost strategy. By mapping task characteristics systematically to model tiers, implementing cascading fallback patterns, and using specialized models (fine-tuned, embedding, reasoning) where appropriate, teams can typically achieve 50–75% cost reduction with minimal quality loss. The key is building routing infrastructure that applies these decisions consistently, automatically, and measurably — not relying on individual developers to make the right model choice in each feature.