Few-shot prompting — providing examples of the desired behavior directly in the prompt — is one of the most powerful tools in the prompt engineer's toolkit. It's also one of the most expensive. Each example you include is a block of tokens you pay on every single API call. The decision of whether to use zero-shot or few-shot prompting is fundamentally a cost-quality trade-off that requires deliberate analysis, not a default assumption that "more examples = better."
Understanding the Token Cost of Examples
Before analyzing trade-offs, you need to internalize what few-shot examples actually cost in tokens.
A typical few-shot example for a classification task might look like:
Example:
Input: "The checkout button doesn't work on mobile Safari"
Output: {"category": "bug", "component": "checkout", "platform": "mobile", "severity": "high"}
That example costs approximately 35–40 tokens. Add 3 examples and you've added ~115 tokens to every prompt call. At 100,000 calls per day, that's 11.5 million extra input tokens daily — purely for the examples.
For GPT-4o at $2.50 per million input tokens, those 3 examples cost $28.75/day. At 30 days that's ~$860/month — just for the example tokens, on top of all other prompt and output costs.
This math scales with the complexity of examples. For code generation tasks, a single example might be 200–300 tokens. Three code examples = 600–900 tokens per call. The cost calculus changes dramatically.
| Example type | Tokens per example | 3 examples | 100K calls/day cost (GPT-4o) |
|---|---|---|---|
| Classification | ~35 | ~105 | ~$26/month |
| JSON extraction | ~60 | ~180 | ~$45/month |
| Code generation | ~250 | ~750 | ~$187/month |
| Complex reasoning | ~400 | ~1,200 | ~$300/month |
Tip: Before adding few-shot examples to any prompt, calculate the monthly token cost of those examples at your expected call volume. Then estimate the quality improvement. If the quality improvement can be achieved through better zero-shot instruction design, that's almost always the lower-cost path. Use few-shot as a tool of last resort after optimizing your zero-shot prompt.
When Zero-Shot Works — and Why It's Usually Your First Choice
Zero-shot prompting — no examples, just instructions — works surprisingly well with modern frontier models, and it should be your default starting point. Here's why:
Frontier models have extensive pre-training. GPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, and similar models have seen enormous quantities of human tasks and outputs during training. For common task patterns (sentiment classification, JSON extraction, code review, summarization), these models have strong priors that don't need reinforcement from examples.
Instruction quality matters more than example quantity. A precisely written zero-shot instruction often outperforms a vague instruction with three examples. The examples compensate for unclear instructions — but well-written instructions eliminate the need for examples.
Zero-shot performance benchmarks. Empirical evidence from production systems:
- Sentiment classification (3 classes): Zero-shot accuracy on frontier models is typically 90–95% with a well-written instruction.
- Intent classification (10 classes): Zero-shot is typically 80–90% accurate.
- JSON field extraction from structured text: Zero-shot with a schema definition is typically 95%+ accurate.
- Code generation (standard algorithms): Zero-shot quality matches or exceeds few-shot in most evaluations.
When zero-shot struggles:
- Novel output formats the model hasn't seen in training (custom DSLs, proprietary data formats)
- Domain-specific terminology that deviates from standard usage
- Edge cases with non-obvious correct answers
- Tone and style matching (when you need output that mimics a specific voice)
Tip: Always benchmark zero-shot first. Build a representative test set of 30–50 inputs for your task, define a quality rubric, and measure zero-shot performance before writing a single example. You may find zero-shot is sufficient. You'll certainly learn which failure modes actually need examples versus which are solvable by refining the instruction.
The Diminishing Returns of Additional Examples
When few-shot does improve quality, the improvement is not linear with the number of examples. Research and production data consistently show diminishing returns beyond 2–3 examples for most tasks, and the token cost increases linearly.
Typical quality curve:
| Examples | Quality improvement (vs zero-shot) | Marginal gain from previous |
|---|---|---|
| 0 (zero-shot) | baseline | — |
| 1 | +8–15% | +8–15% |
| 2 | +12–20% | +2–8% |
| 3 | +14–22% | +1–4% |
| 4 | +15–23% | +0–2% |
| 5 | +15–23% | ~0% |
The first example provides the most signal. By the third example, marginal gains are usually minimal. Beyond five examples, additional examples rarely improve quality and can occasionally hurt it (by introducing noise or constraining the model too narrowly).
Exception: few-shot with diverse examples. If your examples cover distinctly different cases (e.g., edge cases, different tones, different complexity levels), each additional example continues to provide value up to the point of full coverage. The key word is diverse — three identical examples in terms of pattern are worth less than one.
The one-shot sweet spot. For many real-world tasks, one carefully chosen example provides 70–80% of the quality benefit of three examples at one-third the token cost. When you decide to use few-shot, consider starting with one example and only adding more if evaluation shows genuine improvement.
Tip: When building few-shot prompts, conduct an ablation study: test with 1 example, 2 examples, and 3 examples against your evaluation set. Calculate the quality-per-token ratio for each. In most cases, you'll find that 1–2 examples are the optimum. This data-driven approach prevents the common mistake of defaulting to 3–5 examples without measuring whether they're earning their token cost.
Crafting High-Efficiency Examples
When you have determined that few-shot examples are justified, the quality of those examples determines whether they earn their token cost. A poorly chosen example wastes tokens and may hurt quality.
Characteristics of high-efficiency examples:
- Representative of the common case — not edge cases. The model needs to learn the standard pattern, not your most unusual input.
- Minimal but complete — includes only what's necessary to demonstrate the input-output mapping, no padding.
- Consistent in format — the delimiter structure and format should match exactly what you expect in production.
- Covering failure modes — if zero-shot fails on a specific pattern, design an example that demonstrates the correct behavior for that pattern.
Example compression. Raw examples often contain prose explanation and context that the model doesn't need. Compare:
Verbose example format (~80 tokens):
Here is an example of how I want you to classify support tickets:
User message: "I've been waiting 3 days for a refund and nobody has gotten back to me."
In this case, the correct classification is billing, because the user is asking about a financial transaction, specifically a refund. The priority should be high because of the wait time mentioned.
Output: {"category": "billing", "priority": "high"}
Compressed example format (~30 tokens):
Q: "I've been waiting 3 days for a refund and nobody has gotten back to me."
A: {"category": "billing", "priority": "high"}
The compressed format loses the explanation but retains the input-output mapping — which is what actually teaches the model. A 62% token reduction with no quality impact.
Tip: After writing your examples, apply the same compression techniques you use on instructions: remove prose transitions, explanations, and any text that isn't part of the input-output demonstration. Then verify with evaluation that the compressed examples produce equivalent quality. Compression of ~50% is usually achievable without quality loss.
Dynamic Few-Shot Selection: Paying Only When It Helps
Static few-shot examples in system prompts are the most common pattern but not the most efficient. Dynamic few-shot selection — choosing examples at runtime based on the current input — is a technique that can improve both quality and token efficiency simultaneously.
The concept: maintain an example library. For each incoming request, retrieve the 1–2 most semantically similar examples using embedding-based similarity search (using tools like OpenAI's embedding API, Cohere's embed endpoint, or a vector store like Pinecone, Weaviate, or pgvector). Inject only these relevant examples.
Benefits over static few-shot:
- Better quality: the example actually resembles the input, providing stronger signal
- Lower average token count: you only include examples that matter (0 examples for inputs that zero-shot handles well)
- Scalable example library: you can maintain hundreds of examples without putting them all in the prompt
Implementation sketch:
from openai import OpenAI
import numpy as np
client = OpenAI()
def get_embedding(text):
response = client.embeddings.create(
model="text-embedding-3-small",
input=text
)
return response.data[0].embedding
def cosine_similarity(a, b):
return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
def get_relevant_examples(query, example_library, top_k=2, threshold=0.75):
query_embedding = get_embedding(query)
scored = []
for example in example_library:
sim = cosine_similarity(query_embedding, example["embedding"])
if sim >= threshold:
scored.append((sim, example))
scored.sort(reverse=True)
return [e["example"] for _, e in scored[:top_k]]
examples = get_relevant_examples(user_input, example_library)
if examples:
prompt = build_few_shot_prompt(user_input, examples)
else:
prompt = build_zero_shot_prompt(user_input)
This approach means you pay for examples only when they improve quality — and you pay for the most relevant examples, not a generic static set.
Tip: When implementing dynamic few-shot selection, pre-compute and cache embeddings for all examples in your library. Recomputing embeddings at inference time adds latency and API cost. Use a vector store like pgvector (built into PostgreSQL) or a dedicated service like Pinecone to enable fast similarity search against your example library without per-request embedding computation.
Few-Shot for Format Adherence vs. Semantic Learning
Not all few-shot use cases are equal. Understanding what examples are teaching helps you design them more efficiently.
Format adherence — teaching the model the exact output structure you need — is the most token-efficient use of few-shot. A single example demonstrating a custom JSON schema, a specific table format, or a proprietary output structure is worth a small number of tokens and has a high payoff if zero-shot struggles to get the format right consistently.
Semantic learning — teaching the model what constitutes a correct answer for a non-obvious judgment call — is the most expensive and often least reliable use of few-shot. If the task requires nuanced domain knowledge that wasn't in training data, examples help but may not be sufficient. Fine-tuning is often a better investment.
A decision framework:
| Situation | Recommended approach |
|---|---|
| Common task (classification, summarization) | Zero-shot with precise instruction |
| Standard format needed | Zero-shot with schema/format spec |
| Custom output format | One example + format spec |
| Edge case correction | One targeted example |
| Consistent style/tone matching | 2–3 style examples |
| Novel domain, non-obvious correctness | Dynamic few-shot or fine-tuning |
| High volume, quality-critical | Fine-tune — amortize examples over millions of calls |
Fine-tuning as the endgame for few-shot. If you find yourself using 5+ examples consistently and the task runs at high volume, fine-tuning eliminates the runtime example cost entirely. Fine-tuned models internalize the examples during training, so at inference time you get few-shot quality at zero-shot token cost. This is covered in depth in Module 5, but the decision point belongs here: when few-shot costs exceed fine-tuning amortization cost, it's time to fine-tune.
Tip: Categorize every few-shot use case in your application as "format adherence" or "semantic learning." For format adherence, try zero-shot with a detailed schema first — modern models follow JSON Schema definitions well without examples. Reserve examples for genuinely semantic learning tasks, and for those, evaluate whether the volume justifies fine-tuning over runtime few-shot.