·

The system prompt is the most expensive piece of text you write. It is paid for on every single API call in a session, it persists without modification unless you deliberately rebuild it, and it is the layer the model trusts most. A bloated system prompt is a tax on every interaction. A lean system prompt is a compound investment — you pay less per call while the model performs at least as well, and often better, because it has clearer, denser signal to work from.

This topic gives you a systematic method for designing system prompts that pack maximum instructional value into minimum tokens.


Why System Prompts Become Bloated

System prompt bloat follows a predictable lifecycle. The initial prompt is reasonably concise. Then the model makes a mistake — it generates output in the wrong format, or ignores a constraint. The response is to add a sentence to the system prompt clarifying that constraint. The model makes a different mistake. Another sentence is added. Three months later, the system prompt has grown from 300 tokens to 2,400 tokens and resembles a legal disclaimer more than an instruction set.

The most common bloat patterns are:

Defensive repetition. The same instruction is stated multiple times in slightly different wording because the team is not confident a single statement will hold. Example: "Always respond in JSON. Your output must be valid JSON. Never include prose outside the JSON structure. The response format is JSON." This is four tokens where one would do.

Narrative explanation in place of instruction. "It's important to remember that our users are technical professionals, so when they ask about code, you should provide detailed technical explanations because they prefer depth over simplicity." This can be compressed to: "Users: senior engineers. Code responses: technical depth, no simplified explanations."

Legacy context. Instructions added to fix problems that no longer exist, rules that applied to an older version of the model, or constraints that were added speculatively and never validated as necessary.

Politeness padding. "Please always remember to..." and "It would be helpful if you could..." add tokens without adding instruction value. The model responds identically to "Always..." and "Please always remember to always..."

Tip: Run a "bloat audit" on any system prompt over 500 tokens. Print it out or put it in a document and highlight every sentence that: (a) repeats something already said, (b) explains why rather than stating the instruction, or (c) you cannot point to a concrete case where removing it caused a problem. Highlighted sentences are candidates for deletion or compression.


The Principle of Maximum Instruction Density

Instruction density is the ratio of behavioral change induced per token spent. A high-density instruction causes the model to behave differently across many situations. A low-density instruction is a token cost with minimal behavioral impact.

Compare these two instructions:

Low density (38 tokens):
"When users ask you questions about the codebase, it is important to make sure that you look at the relevant files carefully before providing your answer, since this will help ensure your answer is accurate."

High density (14 tokens):
"Before answering codebase questions: read relevant files first."

Both communicate the same behavior. The high-density version uses 63% fewer tokens. Over 1,000 API calls, that single compression saves 24,000 tokens.

The core techniques for increasing instruction density are:

1. Imperative over narrative. Use direct commands: "Do X." not "You should try to X when possible."

2. Structured lists over prose paragraphs. A bulleted list of five constraints takes fewer tokens than five sentences describing the same constraints, and is more reliably followed because each constraint is visually separated.

3. Abbreviate the obvious. If your agent is a Python code reviewer, you do not need to say "When reviewing Python code, focus on Python-specific issues." Just state the focus areas. The domain is established by context.

4. Eliminate hedge words. "Generally," "usually," "in most cases," "where appropriate" — these hedges cost tokens and make instructions weaker, not stronger. Replace them with explicit conditions if you genuinely need conditional behavior.

5. Use reference over description. "Follow PEP 8" costs 4 tokens and communicates an entire style guide. "Ensure proper indentation of 4 spaces, use snake_case for variables, limit lines to 79 characters..." costs 30 tokens to communicate a fragment of the same information.

Tip: After writing or revising a system prompt, run every sentence through a single-question filter: "If I removed this sentence, would the model behave differently in a meaningful way?" If the answer is "probably not," the sentence is a candidate for deletion. Keep only sentences where you can articulate a concrete behavioral failure that would occur without them.


Structural Patterns for Lean System Prompts

The structure of a system prompt is as important as its content. A well-structured prompt allows the model to parse it efficiently and follow it reliably, which means you do not need redundant restatement.

Recommended structure for engineering and QA agents:

[ROLE]
{1–2 sentences: who the agent is and its primary purpose}

[CONTEXT]
{3–8 bullet points: invariant facts about the environment}

[BEHAVIOR]
{Numbered list: ordered priorities for decision-making}

[CONSTRAINTS]
{Bullet list: hard rules the agent must never violate}

[OUTPUT FORMAT]
{Explicit format specification with a minimal example if needed}

Example — lean system prompt for a code review agent:

[ROLE]
Senior code reviewer for a TypeScript/Node.js backend team. Focus: correctness, security, and maintainability.

[CONTEXT]
- Stack: Node.js 20, TypeScript 5.4, Express 4, PostgreSQL via Prisma
- Style guide: Airbnb TypeScript, enforced by ESLint
- Test framework: Vitest, 80% branch coverage required
- CI: GitHub Actions, PRs must pass lint + tests before merge

[BEHAVIOR]
1. Identify bugs and logic errors first
2. Flag security issues (injection, auth bypass, secrets exposure)
3. Note style violations only if ESLint would fail
4. Suggest improvements only after covering issues above

[CONSTRAINTS]
- Never approve PRs with unhandled promise rejections
- Never approve PRs that expose credentials or PII in logs
- Do not request changes for style if ESLint passes

[OUTPUT FORMAT]
Return a structured review:
ISSUES: (numbered, severity: critical/high/medium/low)
SUGGESTIONS: (numbered, optional improvements)
VERDICT: APPROVE | REQUEST_CHANGES | NEEDS_DISCUSSION

This prompt is approximately 220 tokens. Many teams write equivalent prompts at 800–1,200 tokens through narrative prose. The structured version is more reliably followed, easier to update, and costs 3–5x less per call.

Tip: Adopt a standard system prompt schema for your team and enforce it via a template. When every engineer on the team writes system prompts using the same [ROLE]/[CONTEXT]/[BEHAVIOR]/[CONSTRAINTS]/[OUTPUT FORMAT] structure, it becomes easy to spot redundancy, compare prompts across agents, and enforce a token budget per section.


Output Format Specifications: Getting Precision Without Verbosity

Output format instructions are frequently the most bloated section of a system prompt. Teams write lengthy narrative descriptions of what they want when a compact specification would work better.

Bloated format instruction (95 tokens):
"Please make sure to always return your response in a JSON format. The JSON should have a field called 'summary' that contains a brief text summary of your findings, and a field called 'items' that is a list of objects where each object has a 'title', 'description', and 'severity' field. The severity should be one of: critical, high, medium, or low."

Lean format specification (45 tokens):

Output JSON only:
{"summary": "string", "items": [{"title": "string", "description": "string", "severity": "critical|high|medium|low"}]}

The JSON schema example is self-documenting. It specifies field names, types, and allowed values in fewer than half the tokens. The model reads the schema and knows exactly what to produce.

For even more complex formats, use a concise schema notation:

Output format (JSON):
{
  "verdict": "pass|fail|warn",
  "checks": [{"id": "string", "passed": boolean, "detail": "string|null"}],
  "summary": "string (max 100 chars)"
}

Constraints like "max 100 chars" embedded in the schema save a separate constraint sentence.

Tip: For API integrations where you have full control of the response parsing, use structured output features when available — OpenAI's response_format: { type: "json_schema" }, Anthropic's tool use with a single "respond" tool, or LangChain's with_structured_output(). These approaches enforce the format at the API level, which means you do not need to use tokens in the system prompt to explain it.


Token Budget Allocation for System Prompt Sections

A practical way to maintain lean system prompts is to set explicit token budgets per section and enforce them as you write. Here are recommended budgets based on production system prompt analysis across dozens of engineering and product teams:

Section Recommended Budget Danger Zone
Role definition 20–50 tokens >100 tokens
Context/environment facts 50–150 tokens >300 tokens
Behavioral priorities 50–100 tokens >200 tokens
Constraints / hard rules 50–150 tokens >300 tokens
Output format 30–100 tokens >200 tokens
Total system prompt 200–550 tokens >800 tokens

These are guidelines, not hard rules. A complex agent with many tools and intricate behavior may legitimately need 700 tokens. But if you find yourself exceeding 800 tokens, it is a strong signal that you are narrating instead of instructing, repeating yourself, or including context that belongs in the persistent layer rather than the system prompt.

Tip: Measure your system prompt token count as part of your CI/CD pipeline or as a pre-commit hook. If you use a CLAUDE.md file, add a GitHub Action that runs wc -w on the file and comments on the PR if it exceeds your budget. This keeps system prompt size a team concern, not an invisible individual decision.


Testing System Prompt Efficiency

Writing a lean system prompt is not just an editing exercise — you need to verify that compression does not degrade quality. Use this testing protocol:

Step 1: Establish a golden test set.
Collect 15–20 real inputs your agent receives, covering the most important use cases. Write the expected output for each.

Step 2: Measure baseline performance.
Run your current system prompt against the golden set. Score each output (pass/fail or 1–5 quality score). Record the token count of the system prompt.

Step 3: Apply compression techniques.
Use the methods from the previous sections to compress the system prompt. Target a 30–50% token reduction.

Step 4: Re-test and compare.
Run the compressed prompt against the same golden set. Compare quality scores and token counts.

Step 5: Iterate.
If quality dropped on specific cases, add back the minimum instruction necessary to recover those cases. If quality was maintained, you have a validated improvement.

This protocol should take 1–2 hours for a typical agent. The result is a system prompt that is both smaller and more reliably tested than what most teams deploy.

import anthropic
import tiktoken

client = anthropic.Anthropic()
enc = tiktoken.get_encoding("cl100k_base")

def count_tokens(text: str) -> int:
    return len(enc.encode(text))

def test_system_prompt(system_prompt: str, test_cases: list[dict]) -> dict:
    token_count = count_tokens(system_prompt)
    scores = []

    for case in test_cases:
        response = client.messages.create(
            model="claude-opus-4-5",
            max_tokens=1024,
            system=system_prompt,
            messages=[{"role": "user", "content": case["input"]}]
        )
        output = response.content[0].text
        # Score based on your evaluation criteria
        score = evaluate_output(output, case["expected"])
        scores.append(score)

    return {
        "token_count": token_count,
        "avg_quality": sum(scores) / len(scores),
        "pass_rate": sum(1 for s in scores if s >= 4) / len(scores)
    }

original_result = test_system_prompt(original_prompt, golden_set)
compressed_result = test_system_prompt(compressed_prompt, golden_set)

token_reduction = (1 - compressed_result["token_count"] / original_result["token_count"]) * 100
print(f"Token reduction: {token_reduction:.1f}%")
print(f"Quality delta: {compressed_result['avg_quality'] - original_result['avg_quality']:.2f}")

Tip: Track the results of each system prompt version in a simple log file or spreadsheet — date, token count, quality score, and a one-line description of what changed. This history is invaluable when a future change degrades quality and you need to understand what trade-off was made when.