Tokens are the fundamental unit of computation for every large language model. They are not words, not characters, and not bytes — they are variable-length chunks of text that a model's vocabulary maps to integer IDs. Understanding tokens at a mechanical level is the prerequisite to optimizing everything else in an agentic system. This topic goes deep: from byte-pair encoding to how token consumption drives inference latency and billing.
How Tokenization Works — From Text to Integer IDs
When you send a prompt to an LLM, the very first operation the system performs is tokenization: splitting your raw text string into a sequence of tokens and mapping each token to an integer in the model's vocabulary. This integer sequence — not the original text — is what the model actually processes.
Most modern LLMs use Byte-Pair Encoding (BPE) or a variant of it (SentencePiece, WordPiece). The algorithm works like this:
- Start with a base vocabulary of individual bytes (256 entries for UTF-8).
- Scan the training corpus and find the most frequent pair of adjacent symbols.
- Merge that pair into a new symbol and add it to the vocabulary.
- Repeat until the vocabulary reaches the target size (GPT-4 uses ~100,000 tokens; Claude's tokenizer is similar in scale).
The result is a vocabulary where common English words are single tokens (" the" → token 262), rare words are split across multiple tokens (" pneumonoultramicroscopic" → 5+ tokens), and non-English text is tokenized less efficiently because the training corpus was English-dominant.
Concrete example — the same meaning, very different token counts:
English: "Please summarize this document." → 6 tokens
Spanish: "Por favor resume este documento." → 9 tokens
Chinese: "请总结这份文件。" → 11 tokens (each CJK char ≈ 1.5 tokens avg)
This has a direct implication: if your agentic system processes multilingual content, your effective context budget shrinks when handling non-English text, even if the semantic density is identical.
Whitespace, punctuation, and casing matter:
"Hello world" → 2 tokens (" Hello", " world")
"hello world" → 2 tokens (" hello", " world")
"HELLO WORLD" → 4 tokens (" HE", "LLO", " WOR", "LD") — all-caps is token-expensive
"Hello world" → 3 tokens (" Hello", " ", " world") — double space = extra token
Tip: Normalize your inputs before sending them to the model. Strip redundant whitespace, avoid ALL-CAPS text in prompts, and standardize punctuation. For code-heavy prompts, be aware that variable names with underscores (user_account_id) often tokenize as multiple tokens whereas camelCase (userAccountId) may tokenize more compactly depending on the tokenizer.
The Tokenizer Vocabulary and Model-Specific Differences
Not all tokenizers are equal. The tokenizer is baked into the model at training time, so switching models means switching tokenizers — and the same string can produce different token counts on different models.
| Model Family | Tokenizer | Vocab Size | Notes |
|---|---|---|---|
| GPT-3.5 / GPT-4 | cl100k_base (tiktoken) | ~100,256 | Numbers 0–9 are single tokens |
| GPT-4o | o200k_base (tiktoken) | ~200,019 | Better multilingual efficiency |
| Claude (Anthropic) | Custom BPE | ~100k range | Similar to cl100k, not publicly released |
| Llama 2 | SentencePiece BPE | 32,000 | Much smaller vocab → more tokens per text |
| Llama 3 | tiktoken-like BPE | 128,256 | Major improvement over Llama 2 |
| Mistral | SentencePiece | 32,000 | Same as Llama 2 tokenizer |
| Gemini | SentencePiece | ~256k | Excellent multilingual efficiency |
Practical implication for engineers: If you are comparing costs between providers, do not assume token counts transfer. A 10,000-token prompt on GPT-4 might be 11,500 tokens on Llama 2 for the same text. Always measure with the target model's tokenizer.
How to count tokens programmatically:
import tiktoken
def count_tokens(text: str, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
return len(enc.encode(text))
prompt = "Analyze the following Python function and suggest three improvements:"
print(count_tokens(prompt, "gpt-4o")) # → 12
print(count_tokens(prompt, "gpt-4")) # → 12 (same tokenizer: cl100k)
print(count_tokens(prompt, "gpt-3.5-turbo")) # → 12
import anthropic
client = anthropic.Anthropic()
response = client.messages.count_tokens(
model="claude-opus-4-5",
system="You are a senior software engineer.",
messages=[{"role": "user", "content": prompt}]
)
print(response.input_tokens)
For Llama and other open-source models, use the Hugging Face tokenizer:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B")
tokens = tokenizer.encode(text)
print(len(tokens))
Tip: Build token counting into your development workflow as early as possible. Create a utility function that wraps the appropriate tokenizer for each model you use, and call it in your prompt-building logic so you can catch context window overflows before they become runtime errors.
How the Model Processes Tokens — The Transformer Attention Mechanism
Understanding why tokens matter requires a brief look at how the transformer processes them. This is not an academic detour — it explains why token count has non-linear effects on performance and cost.
A transformer processes the input token sequence through multiple attention layers. In each layer, every token attends to every other token in the context via the self-attention mechanism. The computational complexity of self-attention is O(n²) with respect to sequence length n. This means:
- Doubling your token count roughly quadruples the attention computation per layer.
- At 10,000 tokens, attention is ~100× more expensive than at 1,000 tokens (per layer).
- Providers mitigate this with optimizations (FlashAttention, multi-query attention, sparse attention), but the fundamental scaling pressure remains.
In practice, inference cost and latency scale roughly linearly with total tokens (input + output) for typical use because throughput optimizations are effective at the scales most APIs operate. However, very long contexts (>50k tokens) can exhibit noticeable latency increases even with modern KV-caching.
KV Cache is the critical optimization that makes multi-turn conversations economical. When you send a second message in a conversation, the provider does not re-process the entire history from scratch. Instead, the key-value pairs computed during the first pass are cached. You pay for new tokens only, not the cached ones — which is why Anthropic, OpenAI, and other providers offer prompt caching as a pricing tier.
First request: [system prompt: 500 tokens] + [user message 1: 50 tokens] = 550 input tokens billed
Second request: [system prompt: 500 cached] + [user message 1: 50 cached] + [user message 2: 60 tokens]
→ 110 tokens billed at full price + 550 at cache-hit price (~10-20% of full)
Tip: Structure your prompts so that stable content (system instructions, static context, tool definitions) appears at the beginning of the input. This maximizes cache hit rates. Anthropic's prompt caching requires marking cache breakpoints explicitly with cache_control parameters; OpenAI's is automatic. Both reward you for front-loading stable context.
Token Cost Mechanics — How Billing Actually Works
Every major LLM provider bills separately for input tokens and output tokens, and output tokens are almost always more expensive than input tokens (typically 3–5× more expensive per token). This asymmetry exists because output generation is autoregressive — each output token requires a forward pass through the model, while input tokens are processed in a single parallel forward pass.
Current pricing reference (May 2026 approximate — always verify with provider):
| Model | Input (per 1M tokens) | Output (per 1M tokens) | Cached Input |
|---|---|---|---|
| GPT-4o | $2.50 | $10.00 | $1.25 |
| GPT-4o mini | $0.15 | $0.60 | $0.075 |
| Claude Opus 4 | $15.00 | $75.00 | $1.50 |
| Claude Sonnet 4 | $3.00 | $15.00 | $0.30 |
| Claude Haiku 3.5 | $0.80 | $4.00 | $0.08 |
| Gemini 2.0 Flash | $0.10 | $0.40 | — |
Cost calculation formula:
total_cost = (input_tokens / 1,000,000 × input_price_per_million)
+ (output_tokens / 1,000,000 × output_price_per_million)
+ (cached_tokens / 1,000,000 × cache_price_per_million)
Example: A code review agent
def estimate_cost(
input_tokens: int,
output_tokens: int,
cached_tokens: int = 0,
model: str = "claude-sonnet-4"
) -> float:
pricing = {
"claude-opus-4": {"input": 15.00, "output": 75.00, "cache": 1.50},
"claude-sonnet-4": {"input": 3.00, "output": 15.00, "cache": 0.30},
"claude-haiku-3-5": {"input": 0.80, "output": 4.00, "cache": 0.08},
"gpt-4o": {"input": 2.50, "output": 10.00, "cache": 1.25},
"gpt-4o-mini": {"input": 0.15, "output": 0.60, "cache": 0.075},
}
p = pricing[model]
billable_input = input_tokens - cached_tokens
cost = (billable_input / 1_000_000 * p["input"]
+ output_tokens / 1_000_000 * p["output"]
+ cached_tokens / 1_000_000 * p["cache"])
return cost
print(f"${estimate_cost(8000, 1200, cached_tokens=6000, model='claude-sonnet-4'):.4f}")
Tip: For QA engineers and product managers: translate your token usage into business terms. "We use 50M tokens per month" is abstract. "Our test automation agent costs $150/month on Claude Sonnet or $750/month on Claude Opus" is actionable. Always include a cost estimate in your agentic workflow design documents so stakeholders can make informed model-tier decisions.
Special Token Categories — System, Tool, and Invisible Tokens
Beyond the text you explicitly write, LLM APIs inject additional tokens that you never see in your prompt string but absolutely pay for:
1. Special delimiter tokens
Every conversation turn is wrapped in special tokens that mark role boundaries. For example, ChatML format (used by OpenAI) adds tokens like <|im_start|>, <|im_end|>, role names, and newlines. These add roughly 3–7 tokens per message, which compounds across long conversations.
2. Tool/function definitions
When you provide function definitions to the model (for tool use / function calling), the full JSON schema of every tool is tokenized and included in the input. A typical tool definition runs 50–200 tokens. An agent with 20 tools has 1,000–4,000 tokens of tool overhead on every single request, even if most tools are never used.
3. System prompt tokens
System prompts are input tokens, charged at full input price (or cache price if caching is enabled). A 2,000-token system prompt that is re-sent with every request in a 100-turn conversation costs 200,000 additional input tokens over the session — roughly $0.60 on Claude Sonnet. With prompt caching, this drops to ~$0.06.
4. Image tokens
Vision models charge for images based on resolution. OpenAI charges ~85 tokens for low-detail images and 170+ tokens per 512×512 tile for high-detail images. A full-page screenshot at 1920×1080 in high-detail mode costs approximately 1,105 tokens — more than many text prompts.
def image_token_cost_openai(width: int, height: int, detail: str = "high") -> int:
"""Estimate OpenAI vision token cost."""
if detail == "low":
return 85
# High detail: tile into 512x512 blocks, max dimension 2048
max_dim = 2048
scale = min(max_dim / max(width, height), 1.0)
w, h = int(width * scale), int(height * scale)
tiles_w = (w + 511) // 512
tiles_h = (h + 511) // 512
return 170 * tiles_w * tiles_h + 85
print(image_token_cost_openai(1920, 1080)) # → 765 tokens
print(image_token_cost_openai(3840, 2160)) # → 1105 tokens (capped at 2048 max dim)
Tip: Audit all the "invisible" token sources in your agentic system before optimizing the visible prompt text. In many production systems, tool definitions and system prompts account for 30–60% of total input tokens. These are often the highest-leverage targets for optimization.
Hands-On Exercise — Build a Token Profiler
This exercise applies across all personas. Engineers build it; QA validates the numbers; PMs use the output to justify infrastructure decisions.
Goal: Create a utility that profiles the full token breakdown of any API request before it is sent.
import tiktoken
import json
from typing import Any
def profile_request_tokens(
system: str,
messages: list[dict],
tools: list[dict] | None = None,
model: str = "gpt-4o"
) -> dict:
"""
Profile token consumption of a complete API request.
Returns a breakdown by component.
"""
enc = tiktoken.encoding_for_model(model)
def count(text: str) -> int:
return len(enc.encode(text))
# Count system prompt
system_tokens = count(system)
# Count each message (including role overhead: ~4 tokens per message)
message_breakdown = []
total_message_tokens = 0
for msg in messages:
content = msg.get("content", "")
if isinstance(content, list): # multi-modal
content = " ".join(
block.get("text", "") for block in content
if block.get("type") == "text"
)
msg_tokens = count(content) + 4 # 4 for role/delimiter overhead
message_breakdown.append({
"role": msg["role"],
"tokens": msg_tokens,
"preview": content[:80] + "..." if len(content) > 80 else content
})
total_message_tokens += msg_tokens
# Count tool definitions
tool_tokens = 0
tool_breakdown = []
if tools:
for tool in tools:
t_tokens = count(json.dumps(tool))
tool_breakdown.append({"name": tool.get("name", "unknown"), "tokens": t_tokens})
tool_tokens += t_tokens
total = system_tokens + total_message_tokens + tool_tokens
return {
"total_input_tokens": total,
"breakdown": {
"system_prompt": system_tokens,
"system_pct": f"{system_tokens/total*100:.1f}%",
"messages": total_message_tokens,
"messages_pct": f"{total_message_tokens/total*100:.1f}%",
"tools": tool_tokens,
"tools_pct": f"{tool_tokens/total*100:.1f}%",
},
"message_detail": message_breakdown,
"tool_detail": tool_breakdown,
}
system_prompt = """You are a senior QA engineer assistant.
You help write test cases, analyze bug reports, and review test coverage.
Always respond in structured JSON format. Be concise and precise."""
conversation = [
{"role": "user", "content": "Review this pull request and identify testing gaps."},
{"role": "assistant", "content": "I'll analyze the PR for testing gaps. Please share the diff."},
{"role": "user", "content": "Here is the diff: [... 500 lines of code ...]"},
]
tools_list = [
{
"name": "create_test_case",
"description": "Creates a structured test case in the test management system",
"parameters": {
"type": "object",
"properties": {
"title": {"type": "string"},
"steps": {"type": "array", "items": {"type": "string"}},
"expected_result": {"type": "string"},
"priority": {"type": "string", "enum": ["P0", "P1", "P2", "P3"]}
},
"required": ["title", "steps", "expected_result"]
}
}
]
result = profile_request_tokens(system_prompt, conversation, tools_list)
import json
print(json.dumps(result, indent=2))
Sample output:
{
"total_input_tokens": 287,
"breakdown": {
"system_prompt": 51,
"system_pct": "17.8%",
"messages": 125,
"messages_pct": "43.5%",
"tools": 111,
"tools_pct": "38.7%"
}
}
Notice that a single tool definition consumes almost as many tokens as three conversation turns. In a real agent with 15 tools, tool definitions alone would dominate the input.
Tip: Run this profiler on your existing agentic systems before making any optimizations. The output often reveals surprises — many teams discover that their tool schemas account for 40%+ of input tokens, and trimming tool descriptions (not removing tools, just writing tighter descriptions) cuts input cost by 15–25% with zero impact on agent behavior.