What Are Tokens and Why Do They Matter for QA Prompts?
When you type a message to an LLM, the model never reads it as characters or words the way you do. It reads tokens — sub-word units produced by a process called tokenization. Understanding tokens is not a theoretical exercise: it directly determines how much you can fit in one prompt, how much it costs to run an AI session, and why prompts sometimes behave differently than you expect.
Tokenization Basics
A tokenizer splits text into chunks that the model was trained to recognize. Common patterns:
- Short common words are usually one token:
test,pass,fail,bug - Longer or less-common words split:
refactoring→re+factor+ing(three tokens) - Whitespace and punctuation count: a newline or a comma is often a separate token
- Code identifiers split unpredictably:
getUserById→get+User+By+Id(four tokens) - Numbers in log files tokenize digit-by-digit in some models:
404→4+0+4
You can inspect tokenization directly using tools like the OpenAI tokenizer or Anthropic's token counting API. For practical QA work, use this rough guide:
| Content type | Approximate tokens per unit |
|---|---|
| Plain English text | ~750 words per 1,000 tokens |
| JSON API response (compact) | ~500–600 tokens per 1,000 characters |
| Prettified JSON | ~300–400 tokens per 1,000 characters |
| Python / JS source code | ~500 tokens per 1,000 characters |
| Minified code / stack traces | ~600–700 tokens per 1,000 characters |
| Markdown with headers and lists | ~650 tokens per 1,000 characters |
These ratios matter because every model has a token budget — a hard ceiling on how much text it can process in one request. You'll hit it faster with log dumps than with prose instructions.
Why QA Prompts Are Token-Intensive
QA prompts tend to be large. Think about what a thorough test generation prompt contains:
- System instructions and role framing: 200–400 tokens
- User story with acceptance criteria: 300–600 tokens
- Relevant source code snippet: 500–2,000 tokens
- Existing test examples for style reference: 500–1,500 tokens
- API schema or data model: 300–1,000 tokens
- The actual instruction: 50–150 tokens
Add those up and you're easily at 2,000–5,000 tokens before the model generates its response. For most current models with 100k–200k token windows, this is fine. But for log analysis or full test suite review tasks, you can hit limits fast.
Token Budget Allocation as a QA Practice
Treat your context window like a container with a finite capacity. Allocate it deliberately:
| Section | Recommended budget allocation |
|---|---|
| Task instruction | 5–10% |
| Role and constraint framing | 5% |
| Specification / requirements context | 15–25% |
| Code or system under test context | 20–30% |
| Existing test examples (for style guidance) | 10–20% |
| Output format instructions | 5–10% |
| Buffer for model response | 20–30% |
When you violate these allocations — for example, pasting an entire 10,000-line codebase to ask one question — you waste most of your budget on irrelevant content and leave less room for the response.
Practical Token Estimation Without Counting
You don't need to count tokens precisely before every prompt. Use these heuristics:
- A single user story with acceptance criteria: about 300–400 tokens
- A 200-line Python test file: about 800–1,200 tokens
- A 50-line API endpoint handler: about 300–500 tokens
- A typical CI log failure section (last 100 lines): about 600–900 tokens
- An OpenAPI spec for one resource (5 endpoints): about 1,000–2,000 tokens
If your assembled context feels like "a lot," measure it. Claude's API includes a count_tokens endpoint. CLI tools like ttok let you count tokens on the command line before submitting.
Learning Tip: Build a habit of checking token size for your largest recurring prompts. Find the ones that consistently bloat — usually CI log dumps and full spec files — and create trimmed templates that extract only the relevant sections. A prompt that fits in 3,000 tokens and contains exactly the right context will outperform a 15,000-token prompt that includes everything-just-in-case.
How Do Context Windows Work and What Happens When Your Input Is Too Large?
The context window is the total number of tokens an LLM can process in a single request — both input and output combined. Every model has a published context window size, and understanding what happens at and near that limit is essential for building reliable QA workflows.
Current Model Context Window Sizes
| Model | Context window (approximate) |
|---|---|
| Claude 3.5 Sonnet | 200,000 tokens |
| Claude 3 Opus | 200,000 tokens |
| GPT-4o | 128,000 tokens |
| Gemini 1.5 Pro | 1,000,000 tokens |
| Gemini 2.0 Flash | 1,000,000 tokens |
| Llama 3.1 70B | 128,000 tokens |
These numbers sound large, but they can be consumed quickly on real QA tasks. A large codebase with 500 files is hundreds of thousands of tokens. A week of CI run logs can exceed a million tokens.
What the Context Window Actually Contains
The context window holds everything the model processes in one request:
- System prompt — your instructions, role framing, constraints
- Conversation history — all prior turns in a multi-turn session
- Injected context — pasted code, specs, logs, test files
- The current user message — your actual request
- The model's response — the output counts against the window too
In an agentic workflow, the context also includes tool call results — every file the agent reads, every command output it processes — because those results get appended to the conversation. An agent that reads 10 large files before generating output can consume 50,000+ tokens in tool results alone.
What Happens When You Exceed the Context Window
Hard truncation is what most models do: they cut off the beginning of the context when the window fills up. This is called a left truncation or sliding window approach. The model processes the most recent tokens and discards the oldest.
For QA work, this is dangerous because:
- Your system prompt (which contains your role framing and constraints) is at the beginning — it gets cut first
- Your detailed task instructions from earlier in a session disappear
- The model may start giving output that ignores constraints it was given but can no longer see
- In agent loops, early tool results — which may contain critical spec information — get evicted
Some models and APIs throw an explicit error when you exceed the window. Others silently truncate. Never assume your full context is being processed just because the API returned a response.
Soft Degradation Before the Hard Limit
Research consistently shows that LLM performance degrades on information that appears in the middle of a very large context — a phenomenon called the "lost in the middle" effect. Even with a 200k token window, if you paste a 150,000-token codebase and bury your critical requirements in the middle, the model will give worse output than if those requirements were near the beginning or end.
Practical implication: Put the most important context last, right before your instruction. This is the position the model attends to most reliably.
[System prompt: role, constraints, output format]
[Background context: lower priority — older specs, general codebase overview]
[High-priority context: the specific file, the specific diff, the specific spec section]
[Instruction: what you want the model to do]
Managing Context Window Budget in Long Agent Sessions
For agentic workflows that run many tool calls, context grows continuously. Strategies to manage it:
Summarization turns: Periodically ask the model to summarize what it has learned so far, then start a new conversation with that summary instead of the full history. This compresses context without losing essential information.
Chunked analysis: Break large analysis tasks into chunks — analyze 10 files, summarize findings, discard the full file contents, then analyze the next 10. The agent carries forward a compact summary, not the full raw content.
Selective retrieval: Rather than pasting an entire codebase, use semantic search to retrieve only the relevant files. Tools like Claude Code and Cursor implement this automatically — they don't paste your entire repo, they search and retrieve relevant sections.
Learning Tip: When an agent suddenly starts ignoring instructions it was following earlier, context window eviction is the likely cause. Check how many tokens your session has consumed. If you're over 80% of the window, start a new session with a fresh context that includes the most important instructions at the top and a compact summary of work done so far.
What Does the AI Actually See vs. What You Think You Sent?
This is one of the most common sources of poor AI output for QA engineers: a mismatch between what you believe you sent and what the model actually processed. Understanding this gap prevents hours of frustrated prompt debugging.
The Processing Pipeline
Between "you click send" and "the model processes your input," several transformations happen:
- Tokenization: Your text is split into tokens (described above)
- System prompt prepending: Your tool or API prepends a system prompt you may not have written or may not be fully aware of
- Message formatting: The tool wraps your message in a structured format (
[HUMAN],[ASSISTANT]tags or equivalent) - Conversation history injection: All prior turns are prepended in order
- Context injection: If you're using an agentic tool, it may inject retrieved files, tool results, or workspace context automatically
- Encoding: The token sequence is passed to the model's embedding layer
What the model "sees" is the entire assembled token sequence, not your message in isolation.
Implicit Context Injections
When you use tools like Claude Code, Cursor, GitHub Copilot, or a custom RAG system, the tool injects context you didn't manually type. Examples:
- Claude Code automatically injects your CLAUDE.md project instructions, the list of open files, and relevant code context
- Cursor injects semantically similar code from your codebase
- GitHub Copilot injects open editor tabs and recently opened files
- ChatGPT with memory injects your personal memory context
For QA workflows, this implicit injection is sometimes exactly what you want — but it can also inject stale or irrelevant context that misleads the model.
Example of a harmful implicit injection: You're analyzing a bug in service B, but your editor has files from service A open. An IDE-based AI assistant injects service A's code as context, and the model's bug analysis confidently references service A's data model instead of service B's.
What Gets Truncated or Ignored
Even when content is within the window, some content gets less attention:
- Very long, unbroken blocks of code — the model may not attend carefully to lines 200–800 of a 1,000-line file
- Repetitive content — if you paste the same file twice by accident, the model's attention dilutes across both copies
- Content with low relevance signals — the model's attention mechanism gives less weight to sections that don't semantically relate to the task instruction
- Minified or obfuscated code — tokenizes poorly, comprehends worse
Inspecting What the Model Actually Received
For high-stakes QA tasks (test plan for a critical feature, regression analysis before a major release), you should verify what context the model has:
Technique 1 — Ask the model to summarize its context:
**Prompt:**
Before generating the test plan, briefly list the artifacts you have access to in this
conversation: what files, specs, or documentation have been provided, and what the key
constraints are. Then proceed with the test plan.
The model's summary tells you if it registered your key inputs. If it lists the wrong spec version or doesn't mention a file you pasted, your context assembly has a problem.
Technique 2 — Ask about specific facts it should know:
**Prompt:**
Based on the spec I provided, what is the expected HTTP status code when an unauthenticated
user attempts to access the /admin/users endpoint?
If the answer is wrong, the spec either wasn't included, was truncated, or was formatted in a way the model couldn't parse.
Technique 3 — Use structured context acknowledgment blocks:
Start important prompts with a context inventory:
**Prompt:**
Context provided:
- Feature spec: [ATTACHED: user-authentication-spec.md, 450 lines]
- PR diff: [ATTACHED: pr-1234.diff, 120 lines]
- Existing test file: [ATTACHED: auth.test.ts, 200 lines]
Confirm you can see all three artifacts by summarizing one key point from each.
Then generate integration test scenarios for the spec's acceptance criteria.
This forces an explicit acknowledgment before the model proceeds.
Common Mismatches and How to Catch Them
| What you think you sent | What the model may have processed |
|---|---|
| A 500-line spec doc | The spec was in an attached file; the model received a file reference, not content |
| The latest version of a requirements doc | An earlier version from your clipboard history |
| The full test file | Only the visible portion of the file if you copy-pasted from your IDE |
| An API response in JSON | JSON with truncated lines because your terminal wrapped them |
| A stack trace | A stack trace cut off at a character limit |
Learning Tip: Add a "context check" step at the start of your most important recurring prompts. It takes 15 seconds and prevents you from getting a confident, detailed, and completely wrong answer based on incorrect context. Think of it as the QA equivalent of verifying your test environment before running a test suite — you don't skip that check, so don't skip this one.
How Does Context Window Size Affect Test Suite and Log Analysis Tasks?
The two most context-hungry QA tasks are test suite analysis and log/trace analysis. Each has a distinct set of constraints and strategies.
Test Suite Analysis at Scale
A typical mid-size web application has 500–2,000 test files. Even a modest test suite with 300 files, each averaging 150 lines, represents roughly 45,000–60,000 lines of test code — well over 200,000 tokens. No single prompt can hold the entire test suite.
The naive approach (and why it fails):
Paste the entire test suite directory into a prompt and ask "find coverage gaps." The model runs out of context, processes a subset, and produces gaps analysis that's incomplete at best and misleading at worst.
Effective strategies for large test suite analysis:
Strategy 1 — Hierarchical summarization
Process the test suite in layers:
1. For each test file, generate a compact summary: file name, module under test, test count, key scenarios covered (1–2 sentence per file)
2. Assemble all summaries into a single coverage map document (typically 5,000–15,000 tokens)
3. Analyze the coverage map against the requirements
This reduces 200,000 tokens to 15,000 while preserving the information needed for gap analysis.
**Prompt:**
Summarize this test file in 3–5 bullet points. For each bullet point, note:
- The function/component under test
- The key scenario(s) covered
- Any notable edge cases or negative paths tested
File: [paste test file content]
Keep your summary to 200 words maximum.
Strategy 2 — Targeted module analysis
Rather than analyzing the full suite, analyze one module at a time aligned with the feature under current development. This keeps prompts focused and small.
Strategy 3 — Coverage diff analysis
Don't analyze all tests — analyze the delta. Focus only on test files that changed or were added in the current sprint, and the features those tests claim to cover. This is the most practical approach for ongoing sprint work.
Log and Trace Analysis at Scale
CI run logs and distributed system traces are often enormous. A single test run for a large application might produce 50,000–500,000 lines of log output. Failure analysis on that raw volume is impossible in a single prompt.
Key constraints to understand:
- Most interesting failure information in a CI log is in the last 10–20% of the output (where failures surface after successful setup and build steps)
- Stack traces are usually complete within 50–100 lines
- Test runner output (test names, pass/fail, timings) is compact and high-signal
- Verbose request/response logging can be multi-megabyte and mostly useless for failure analysis
Effective log analysis strategies:
Strategy 1 — Pre-filter before AI ingestion
Use grep or awk before passing to AI. Extract only failure lines, ERROR lines, and the 20 lines surrounding each failure. Reduce a 50,000-line log to 200–500 lines.
grep -n "FAIL\|● \|Error:" test-output.log | head -200
grep -n "ERROR\|FATAL" server.log | head -100
Strategy 2 — Two-stage analysis
- Stage 1: Ask the model to classify the types of failures present (assertion failures, setup errors, network timeouts, etc.)
- Stage 2: For each failure type, provide the specific relevant log section for detailed analysis
This keeps each individual prompt focused while building a complete picture.
Stage 1 prompt:
**Prompt:**
I'm providing the last 300 lines of a CI run log. Your job is to:
1. List every distinct failure you can identify (test name + brief failure type)
2. Categorize the failures: assertion failures / setup errors / timeout / dependency issues / other
3. Identify if any failures share a common root cause
Do NOT explain the failures yet. Just classify and list.
[paste log excerpt]
Stage 2 prompt (per failure category):
**Prompt:**
Here is the full stack trace and context for the "database connection timeout" failures
you identified. Analyze the root cause and suggest:
1. The most likely cause based on the evidence
2. What additional information would confirm the diagnosis
3. Recommended fix or investigation steps
[paste relevant log section]
Strategy 3 — Structured log extraction for recurring pipelines
For CI pipelines you run regularly, build a log extraction script that outputs a standardized failure summary JSON. Feed this structured summary to the AI instead of raw logs.
{
"run_id": "ci-4521",
"total_tests": 847,
"failed_tests": 12,
"failures": [
{
"test": "UserAuthService > login > should reject expired tokens",
"file": "src/auth/auth.service.test.ts",
"error": "Expected 401, received 200",
"stack_top": "at Object.<anonymous> (auth.service.test.ts:145:5)"
}
]
}
At 12 failures, this JSON is under 500 tokens — vastly more efficient than the raw log while containing everything needed for analysis.
Right-Sizing Your Context for Each Task
| Task | Optimal context size | What to include | What to exclude |
|---|---|---|---|
| Test case gen for one story | 2,000–5,000 tokens | Story + AC + relevant code + one test example | Unrelated test files, full codebase |
| Regression scope for a PR | 3,000–8,000 tokens | PR diff + affected module tests + risk areas | Full test suite, unrelated modules |
| Single failure analysis | 1,000–3,000 tokens | Stack trace + test code + relevant source | Full CI log, passing test output |
| Test suite coverage audit | 5,000–15,000 tokens | Coverage summary map (see strategy above) | Raw test files at scale |
| Full sprint test planning | 5,000–10,000 tokens | All sprint stories + acceptance criteria + domain context | Historical tests from unrelated modules |
Learning Tip: For log analysis, build a
qa-log-extractshell script or alias that takes a CI log file and outputs a clean failure summary. Running 10 minutes of preprocessing to distill a 100,000-line log into 300 lines of high-signal content will consistently produce better AI analysis than pasting the raw log and hoping the model finds the needle in the haystack. Automation of context preparation is a force multiplier — build it once, reuse it on every CI run.