·

Input token efficiency (crafting lean prompts) is only half the equation. Output tokens are billed at a higher rate on every major LLM API — on GPT-4o, output tokens cost 4x more than input tokens; on Claude 3.5 Sonnet, they cost 5x more. Output shaping — directing the model to produce exactly the information you need in the most compact form — is one of the highest-leverage optimizations available to you.

The Economics of Output Tokens

Before diving into techniques, it's worth grounding the discussion in real numbers. As of mid-2025, representative pricing:

Model Input (per 1M tokens) Output (per 1M tokens) Output multiplier
GPT-4o $2.50 $10.00 4x
Claude 3.5 Sonnet $3.00 $15.00 5x
Gemini 1.5 Pro $1.25 $5.00 4x
GPT-4o-mini $0.15 $0.60 4x

These multipliers mean that a system generating 1,000 tokens of output per request pays 4–5x more for those tokens than for the equivalent input. At scale, output length becomes your primary cost driver. An agent that consistently generates 800-word responses when 200-word responses would serve the user equally well is burning 4x the output budget unnecessarily.

Output shaping techniques give you explicit control over this cost. The goal is not to make responses worse — it's to eliminate the gap between "what the model generates by default" and "what the user actually needs."

Tip: Log the actual output token counts for every LLM call in your application. Most platforms return usage data in the API response (usage.completion_tokens in OpenAI, usage.output_tokens in Anthropic). Aggregate these metrics by endpoint or agent type. You'll quickly find which workflows are generating the most output tokens and where optimization will have the highest ROI.

Direct Length Constraints

The most direct output-shaping technique is telling the model exactly how long the response should be. This works better than you might expect — frontier models follow length constraints reliably when the instructions are specific.

Word/sentence constraints. Explicit word counts are highly effective:

Summarize this bug report in exactly 2 sentences.
Explain the trade-off in 50 words or fewer.
Provide a one-paragraph assessment (4–6 sentences maximum).

Step/item constraints. For list-format outputs:

List the top 3 risks only. Do not list more than 3.
Provide exactly 5 acceptance criteria. No more, no fewer.

Character constraints for structured outputs. When generating content that feeds into a UI with character limits (like mobile push notifications or card titles), specify the constraint:

Write a push notification message. Maximum 100 characters including spaces.

Before/after comparison for a code review task:

Before (no length constraint) — typical output: ~350 tokens:

Review this Python function and provide your assessment.

Typical output: A thorough multi-paragraph review covering style, logic, performance, testing, documentation, and edge cases.

After (constrained) — typical output: ~90 tokens:

Review this Python function. Return: (1) one critical issue if any, (2) one improvement suggestion, (3) a pass/fail verdict. Three items max.

The constrained prompt produces a focused, actionable output at roughly 25% of the token cost.

Tip: Different constraint formulations have different effectiveness. "Be concise" is weak — models interpret this loosely. "Maximum 100 words" is stronger. "Respond in exactly 3 bullet points, each bullet no more than 15 words" is the strongest. When token efficiency is critical (high-volume pipelines, cost-sensitive applications), use the most specific constraint formulation.

Format Shaping: Structured Outputs Over Prose

Prose is the most token-expensive output format. When you need specific information, requesting it in a structured format almost always produces the same information in fewer tokens.

JSON output is typically 20–40% more token-efficient than equivalent prose for structured information, while being directly machine-parseable:

Instead of asking:

Describe the sentiment of this review, whether it mentions a specific product feature, and the overall rating implied by the text.

Which produces ~80 tokens of prose like: "The review expresses a positive sentiment overall. The reviewer specifically mentions the battery life as a standout feature. Based on the tone and language used, the implied rating would be approximately 4 or 5 out of 5 stars."

Ask:

Analyze the review. Respond in JSON: {"sentiment": "positive|neutral|negative", "feature_mentioned": "<string or null>", "implied_rating": <1-5>}

Which produces ~15 tokens: {"sentiment": "positive", "feature_mentioned": "battery life", "implied_rating": 5}

This is an 80% token reduction with no information loss.

OpenAI's JSON mode and structured outputs (response_format: {"type": "json_object"} or the newer Structured Outputs feature with JSON Schema) enforce valid JSON output while also allowing you to define a schema that constrains what fields are generated. This prevents the model from adding explanatory prose around the JSON.

Anthropic's tool use / function calling similarly produces structured output. Defining a tool with a tight schema forces the model to populate only the defined fields.

YAML for human-readable structured output. YAML is often more compact than JSON (no quotes on keys, no commas) and still machine-parseable:

severity: high
component: auth-service
suggested_fix: rotate JWT secret

vs.

{"severity": "high", "component": "auth-service", "suggested_fix": "rotate JWT secret"}

YAML is ~15% more compact in typical cases.

Tip: For any workflow where the output is consumed by code rather than displayed to humans, always use structured output (JSON, YAML, or function calling). Never parse prose programmatically — it's fragile and more expensive. The switch from prose to structured output is one of the most impactful single changes you can make in an LLM pipeline.

Verbosity Control via Persona and Style Instructions

Beyond explicit length constraints, you can shape verbosity by instructing the model to adopt a low-verbosity communication style:

Anti-preamble instruction:

Do not restate the question. Do not explain what you are about to do. Begin your response with the answer directly.

This single instruction eliminates the pattern where models open with "Certainly! I'd be happy to help you with that. Let me analyze the code you've provided..." — which can be 20–40 tokens of pure overhead on every response.

Anti-conclusion instruction:

Do not add a closing summary or offer to help further.

Models frequently close responses with "I hope this helps! Let me know if you have any other questions." — useless tokens in an automated pipeline.

Style persona for low verbosity:

Respond like a senior engineer in a code review: terse, specific, no pleasantries.

Before/after with verbosity control:

Before (default style) — ~120 tokens:

Classify the sentiment of this customer message.

Output: "Certainly! After carefully analyzing the customer message you've provided, I can determine that the sentiment expressed is negative. The customer appears to be frustrated with the response time of the support team, and their language suggests a high level of dissatisfaction. I hope this classification is helpful for your analysis!"

After (verbosity controlled) — ~5 tokens:

Classify sentiment. No explanation. Reply with one word: positive, neutral, or negative.

Output: "negative"

A 96% reduction in output tokens for the same information.

Tip: Build verbosity control instructions into your system prompt rather than the user turn. This ensures they apply consistently across all interactions without relying on every individual prompt to include them. In multi-agent systems, each agent's system prompt should define its communication style explicitly.

Controlling Reasoning and Explanation Depth

Some tasks benefit from the model showing its reasoning; most pipeline tasks do not. The default behavior of frontier models is to provide thorough explanations, which is appropriate for a human-facing chatbot but wasteful in an automated workflow.

Suppress explanation for classification tasks:

Classify the following support ticket. Return only the category label. Do not explain.
Categories: billing, technical, account, general

Request explanation only when needed:

Is this code snippet safe to execute in a production environment? Answer yes or no. If no, provide one sentence explaining the critical risk only.

Graduated explanation depth for QA workflows:

QA engineers reviewing AI test generation outputs may need different detail levels:

  • For pass/fail verdicts on 500 automated checks: "Return PASS or FAIL. No explanation."
  • For the 10 failing checks: "Return FAIL with a one-sentence root cause."
  • For the 2 critical failures: "Return full analysis with reproduction steps."

This tiered approach applies expensive, verbose output only where it creates value.

Reasoning models (o1, o3, Gemini 2.0 Flash Thinking) require special consideration. These models generate "thinking tokens" that are internal reasoning steps. These tokens may be billed separately and can be very large (thousands of tokens). Use reasoning models only for tasks that genuinely require multi-step logical deduction — not for straightforward classification or extraction tasks where a standard model suffices.

Tip: Create a decision matrix for your team: which task types require explanation in outputs, and which do not. For each task type that doesn't require explanation, add "Do not explain. Return [format] only." to the prompt template. This discipline, applied consistently, can reduce output token counts by 40–60% on classification and extraction workloads.

Negative Constraints and What Not to Generate

Telling the model what not to include is as powerful as telling it what to include, and often more compact:

Standard negative constraints:

Do not include:
- Preamble or restatement of the task
- Explanation of your reasoning
- Caveats or disclaimers
- Closing remarks
- Suggestions for follow-up questions

This five-item list in a system prompt eliminates common sources of output bloat across all interactions.

Content-specific exclusions for code generation:

Generate the function implementation only. Do not include:
- Import statements (assume all imports exist)
- Type hints (not used in this codebase)
- Docstrings
- Example usage

Before/after for a code generation task:

Before (no exclusions) — typical output: ~180 tokens:

Write a Python function to validate an email address.

Output includes: imports, function with type hints, docstring, 3 examples of usage, explanation paragraph.

After (exclusions applied) — typical output: ~35 tokens:

Write a Python function to validate an email address. Return the function body only. No imports, no docstring, no examples.

Output: just the function body.

Tip: Maintain a "negative constraint library" for common output patterns you want to suppress. Over time, you'll identify patterns specific to your use case (e.g., "don't suggest database migration as a fix" for a no-migration codebase). Store these as named constraint blocks that can be included or excluded from prompts by reference.

Streaming and Progressive Disclosure

For user-facing applications, output shaping intersects with UX. Streaming (available on OpenAI, Anthropic, and Google APIs) allows you to begin rendering output before the full response is complete. This changes the user's perceived latency without changing actual token count.

However, streaming can also be used strategically to limit output: you can implement a token budget on the client side and stop streaming once a threshold is reached. This is useful for summary widgets or preview panels where you only want the first 100 tokens of a potentially longer response.

token_count = 0
max_tokens = 100

with client.chat.completions.stream(
    model="gpt-4o",
    messages=messages,
) as stream:
    for text in stream.text_stream:
        print(text, end="")
        token_count += 1  # approximate; use tiktoken for precision
        if token_count >= max_tokens:
            stream.close()
            break

Note: Tokens consumed before you close the stream are still billed, so this technique is about UX (showing users a preview) rather than cost reduction. True cost reduction comes from setting max_tokens on the API call itself.

max_tokens parameter: Every major LLM API accepts a max_tokens (or max_completion_tokens) parameter. This is the most direct output length control available. Use it aggressively:

  • Set it based on your actual data: if P99 of valid responses is 300 tokens, set max_tokens to 400 as a safety buffer.
  • Monitor truncation rates: if the model is hitting the limit frequently, either your limit is too low or your prompt needs better length constraints.

Tip: Do not use the API default for max_tokens. The default is either very high or unlimited, depending on the platform. Always set an explicit max_tokens that reflects your use case. Review this value quarterly as you gather data on actual output lengths in your application.