How LLMs Process Context in QA | AgenticSkillset.org

What Are Tokens and Why Do They Matter for QA Prompts?

When you type a message to an LLM, the model never reads it as characters or words the way you do. It reads tokens — sub-word units produced by a process called tokenization. Understanding tokens is not a theoretical exercise: it directly determines how much you can fit in one prompt, how much it costs to run an AI session, and why prompts sometimes behave differently than you expect.

Tokenization Basics

A tokenizer splits text into chunks that the model was trained to recognize. Common patterns:

Short common words are usually one token: test, pass, fail, bug
Longer or less-common words split: refactoring → re + factor + ing (three tokens)
Whitespace and punctuation count: a newline or a comma is often a separate token
Code identifiers split unpredictably: getUserById → get + User + By + Id (four tokens)
Numbers in log files tokenize digit-by-digit in some models: 404 → 4 + 0 + 4

You can inspect tokenization directly using tools like the OpenAI tokenizer or Anthropic's token counting API. For practical QA work, use this rough guide:

Content type	Approximate tokens per unit
Plain English text	~750 words per 1,000 tokens
JSON API response (compact)	~500–600 tokens per 1,000 characters
Prettified JSON	~300–400 tokens per 1,000 characters
Python / JS source code	~500 tokens per 1,000 characters
Minified code / stack traces	~600–700 tokens per 1,000 characters
Markdown with headers and lists	~650 tokens per 1,000 characters

These ratios matter because every model has a token budget — a hard ceiling on how much text it can process in one request. You'll hit it faster with log dumps than with prose instructions.

Why QA Prompts Are Token-Intensive

QA prompts tend to be large. Think about what a thorough test generation prompt contains:

System instructions and role framing: 200–400 tokens
User story with acceptance criteria: 300–600 tokens
Relevant source code snippet: 500–2,000 tokens
Existing test examples for style reference: 500–1,500 tokens
API schema or data model: 300–1,000 tokens
The actual instruction: 50–150 tokens

Add those up and you're easily at 2,000–5,000 tokens before the model generates its response. For most current models with 100k–200k token windows, this is fine. But for log analysis or full test suite review tasks, you can hit limits fast.

Token Budget Allocation as a QA Practice

Treat your context window like a container with a finite capacity. Allocate it deliberately:

Section	Recommended budget allocation
Task instruction	5–10%
Role and constraint framing	5%
Specification / requirements context	15–25%
Code or system under test context	20–30%
Existing test examples (for style guidance)	10–20%
Output format instructions	5–10%
Buffer for model response	20–30%

When you violate these allocations — for example, pasting an entire 10,000-line codebase to ask one question — you waste most of your budget on irrelevant content and leave less room for the response.

Practical Token Estimation Without Counting

You don't need to count tokens precisely before every prompt. Use these heuristics:

A single user story with acceptance criteria: about 300–400 tokens
A 200-line Python test file: about 800–1,200 tokens
A 50-line API endpoint handler: about 300–500 tokens
A typical CI log failure section (last 100 lines): about 600–900 tokens
An OpenAPI spec for one resource (5 endpoints): about 1,000–2,000 tokens

If your assembled context feels like "a lot," measure it. Claude's API includes a count_tokens endpoint. CLI tools like ttok let you count tokens on the command line before submitting.

Learning Tip: Build a habit of checking token size for your largest recurring prompts. Find the ones that consistently bloat — usually CI log dumps and full spec files — and create trimmed templates that extract only the relevant sections. A prompt that fits in 3,000 tokens and contains exactly the right context will outperform a 15,000-token prompt that includes everything-just-in-case.

How Do Context Windows Work and What Happens When Your Input Is Too Large?

The context window is the total number of tokens an LLM can process in a single request — both input and output combined. Every model has a published context window size, and understanding what happens at and near that limit is essential for building reliable QA workflows.

Current Model Context Window Sizes

Model	Context window (approximate)
Claude 3.5 Sonnet	200,000 tokens
Claude 3 Opus	200,000 tokens
GPT-4o	128,000 tokens
Gemini 1.5 Pro	1,000,000 tokens
Gemini 2.0 Flash	1,000,000 tokens
Llama 3.1 70B	128,000 tokens

These numbers sound large, but they can be consumed quickly on real QA tasks. A large codebase with 500 files is hundreds of thousands of tokens. A week of CI run logs can exceed a million tokens.

What the Context Window Actually Contains

The context window holds everything the model processes in one request:

System prompt — your instructions, role framing, constraints
Conversation history — all prior turns in a multi-turn session
Injected context — pasted code, specs, logs, test files
The current user message — your actual request
The model's response — the output counts against the window too

In an agentic workflow, the context also includes tool call results — every file the agent reads, every command output it processes — because those results get appended to the conversation. An agent that reads 10 large files before generating output can consume 50,000+ tokens in tool results alone.

What Happens When You Exceed the Context Window

Hard truncation is what most models do: they cut off the beginning of the context when the window fills up. This is called a left truncation or sliding window approach. The model processes the most recent tokens and discards the oldest.

For QA work, this is dangerous because:

Your system prompt (which contains your role framing and constraints) is at the beginning — it gets cut first
Your detailed task instructions from earlier in a session disappear
The model may start giving output that ignores constraints it was given but can no longer see
In agent loops, early tool results — which may contain critical spec information — get evicted

Some models and APIs throw an explicit error when you exceed the window. Others silently truncate. Never assume your full context is being processed just because the API returned a response.

Soft Degradation Before the Hard Limit

Research consistently shows that LLM performance degrades on information that appears in the middle of a very large context — a phenomenon called the "lost in the middle" effect. Even with a 200k token window, if you paste a 150,000-token codebase and bury your critical requirements in the middle, the model will give worse output than if those requirements were near the beginning or end.

Practical implication: Put the most important context last, right before your instruction. This is the position the model attends to most reliably.

[System prompt: role, constraints, output format]
[Background context: lower priority — older specs, general codebase overview]
[High-priority context: the specific file, the specific diff, the specific spec section]
[Instruction: what you want the model to do]

Managing Context Window Budget in Long Agent Sessions

For agentic workflows that run many tool calls, context grows continuously. Strategies to manage it:

Summarization turns: Periodically ask the model to summarize what it has learned so far, then start a new conversation with that summary instead of the full history. This compresses context without losing essential information.

Chunked analysis: Break large analysis tasks into chunks — analyze 10 files, summarize findings, discard the full file contents, then analyze the next 10. The agent carries forward a compact summary, not the full raw content.

Selective retrieval: Rather than pasting an entire codebase, use semantic search to retrieve only the relevant files. Tools like Claude Code and Cursor implement this automatically — they don't paste your entire repo, they search and retrieve relevant sections.

Learning Tip: When an agent suddenly starts ignoring instructions it was following earlier, context window eviction is the likely cause. Check how many tokens your session has consumed. If you're over 80% of the window, start a new session with a fresh context that includes the most important instructions at the top and a compact summary of work done so far.

What Does the AI Actually See vs. What You Think You Sent?

This is one of the most common sources of poor AI output for QA engineers: a mismatch between what you believe you sent and what the model actually processed. Understanding this gap prevents hours of frustrated prompt debugging.

The Processing Pipeline

Between "you click send" and "the model processes your input," several transformations happen:

Tokenization: Your text is split into tokens (described above)
System prompt prepending: Your tool or API prepends a system prompt you may not have written or may not be fully aware of
Message formatting: The tool wraps your message in a structured format ([HUMAN], [ASSISTANT] tags or equivalent)
Conversation history injection: All prior turns are prepended in order
Context injection: If you're using an agentic tool, it may inject retrieved files, tool results, or workspace context automatically
Encoding: The token sequence is passed to the model's embedding layer

What the model "sees" is the entire assembled token sequence, not your message in isolation.

Implicit Context Injections

When you use tools like Claude Code, Cursor, GitHub Copilot, or a custom RAG system, the tool injects context you didn't manually type. Examples:

Claude Code automatically injects your CLAUDE.md project instructions, the list of open files, and relevant code context
Cursor injects semantically similar code from your codebase
GitHub Copilot injects open editor tabs and recently opened files
ChatGPT with memory injects your personal memory context

For QA workflows, this implicit injection is sometimes exactly what you want — but it can also inject stale or irrelevant context that misleads the model.

Example of a harmful implicit injection: You're analyzing a bug in service B, but your editor has files from service A open. An IDE-based AI assistant injects service A's code as context, and the model's bug analysis confidently references service A's data model instead of service B's.

What Gets Truncated or Ignored

Even when content is within the window, some content gets less attention:

Very long, unbroken blocks of code — the model may not attend carefully to lines 200–800 of a 1,000-line file
Repetitive content — if you paste the same file twice by accident, the model's attention dilutes across both copies
Content with low relevance signals — the model's attention mechanism gives less weight to sections that don't semantically relate to the task instruction
Minified or obfuscated code — tokenizes poorly, comprehends worse

Inspecting What the Model Actually Received

For high-stakes QA tasks (test plan for a critical feature, regression analysis before a major release), you should verify what context the model has:

Technique 1 — Ask the model to summarize its context:

**Prompt:**
Before generating the test plan, briefly list the artifacts you have access to in this
conversation: what files, specs, or documentation have been provided, and what the key
constraints are. Then proceed with the test plan.

The model's summary tells you if it registered your key inputs. If it lists the wrong spec version or doesn't mention a file you pasted, your context assembly has a problem.

Technique 2 — Ask about specific facts it should know:

**Prompt:**
Based on the spec I provided, what is the expected HTTP status code when an unauthenticated
user attempts to access the /admin/users endpoint?

If the answer is wrong, the spec either wasn't included, was truncated, or was formatted in a way the model couldn't parse.

Technique 3 — Use structured context acknowledgment blocks:

Start important prompts with a context inventory:

**Prompt:**
Context provided:
- Feature spec: [ATTACHED: user-authentication-spec.md, 450 lines]
- PR diff: [ATTACHED: pr-1234.diff, 120 lines]
- Existing test file: [ATTACHED: auth.test.ts, 200 lines]

Confirm you can see all three artifacts by summarizing one key point from each.
Then generate integration test scenarios for the spec's acceptance criteria.

This forces an explicit acknowledgment before the model proceeds.

Common Mismatches and How to Catch Them

What you think you sent	What the model may have processed
A 500-line spec doc	The spec was in an attached file; the model received a file reference, not content
The latest version of a requirements doc	An earlier version from your clipboard history
The full test file	Only the visible portion of the file if you copy-pasted from your IDE
An API response in JSON	JSON with truncated lines because your terminal wrapped them
A stack trace	A stack trace cut off at a character limit

Learning Tip: Add a "context check" step at the start of your most important recurring prompts. It takes 15 seconds and prevents you from getting a confident, detailed, and completely wrong answer based on incorrect context. Think of it as the QA equivalent of verifying your test environment before running a test suite — you don't skip that check, so don't skip this one.

How Does Context Window Size Affect Test Suite and Log Analysis Tasks?

The two most context-hungry QA tasks are test suite analysis and log/trace analysis. Each has a distinct set of constraints and strategies.

Test Suite Analysis at Scale

A typical mid-size web application has 500–2,000 test files. Even a modest test suite with 300 files, each averaging 150 lines, represents roughly 45,000–60,000 lines of test code — well over 200,000 tokens. No single prompt can hold the entire test suite.

The naive approach (and why it fails):
Paste the entire test suite directory into a prompt and ask "find coverage gaps." The model runs out of context, processes a subset, and produces gaps analysis that's incomplete at best and misleading at worst.

Effective strategies for large test suite analysis:

Strategy 1 — Hierarchical summarization

Process the test suite in layers:
1. For each test file, generate a compact summary: file name, module under test, test count, key scenarios covered (1–2 sentence per file)
2. Assemble all summaries into a single coverage map document (typically 5,000–15,000 tokens)
3. Analyze the coverage map against the requirements

This reduces 200,000 tokens to 15,000 while preserving the information needed for gap analysis.

**Prompt:**
Summarize this test file in 3–5 bullet points. For each bullet point, note:
- The function/component under test
- The key scenario(s) covered
- Any notable edge cases or negative paths tested

File: [paste test file content]

Keep your summary to 200 words maximum.

Strategy 2 — Targeted module analysis

Rather than analyzing the full suite, analyze one module at a time aligned with the feature under current development. This keeps prompts focused and small.

Strategy 3 — Coverage diff analysis

Don't analyze all tests — analyze the delta. Focus only on test files that changed or were added in the current sprint, and the features those tests claim to cover. This is the most practical approach for ongoing sprint work.

Log and Trace Analysis at Scale

CI run logs and distributed system traces are often enormous. A single test run for a large application might produce 50,000–500,000 lines of log output. Failure analysis on that raw volume is impossible in a single prompt.

Key constraints to understand:

Most interesting failure information in a CI log is in the last 10–20% of the output (where failures surface after successful setup and build steps)
Stack traces are usually complete within 50–100 lines
Test runner output (test names, pass/fail, timings) is compact and high-signal
Verbose request/response logging can be multi-megabyte and mostly useless for failure analysis

Effective log analysis strategies:

Strategy 1 — Pre-filter before AI ingestion

Use grep or awk before passing to AI. Extract only failure lines, ERROR lines, and the 20 lines surrounding each failure. Reduce a 50,000-line log to 200–500 lines.

grep -n "FAIL\|● \|Error:" test-output.log | head -200

grep -n "ERROR\|FATAL" server.log | head -100

Strategy 2 — Two-stage analysis

Stage 1: Ask the model to classify the types of failures present (assertion failures, setup errors, network timeouts, etc.)
Stage 2: For each failure type, provide the specific relevant log section for detailed analysis

This keeps each individual prompt focused while building a complete picture.

Stage 1 prompt:

**Prompt:**
I'm providing the last 300 lines of a CI run log. Your job is to:
1. List every distinct failure you can identify (test name + brief failure type)
2. Categorize the failures: assertion failures / setup errors / timeout / dependency issues / other
3. Identify if any failures share a common root cause

Do NOT explain the failures yet. Just classify and list.

[paste log excerpt]

Stage 2 prompt (per failure category):

**Prompt:**
Here is the full stack trace and context for the "database connection timeout" failures
you identified. Analyze the root cause and suggest:
1. The most likely cause based on the evidence
2. What additional information would confirm the diagnosis
3. Recommended fix or investigation steps

[paste relevant log section]

Strategy 3 — Structured log extraction for recurring pipelines

For CI pipelines you run regularly, build a log extraction script that outputs a standardized failure summary JSON. Feed this structured summary to the AI instead of raw logs.

{
  "run_id": "ci-4521",
  "total_tests": 847,
  "failed_tests": 12,
  "failures": [
    {
      "test": "UserAuthService > login > should reject expired tokens",
      "file": "src/auth/auth.service.test.ts",
      "error": "Expected 401, received 200",
      "stack_top": "at Object.<anonymous> (auth.service.test.ts:145:5)"
    }
  ]
}

At 12 failures, this JSON is under 500 tokens — vastly more efficient than the raw log while containing everything needed for analysis.

Right-Sizing Your Context for Each Task

Task	Optimal context size	What to include	What to exclude
Test case gen for one story	2,000–5,000 tokens	Story + AC + relevant code + one test example	Unrelated test files, full codebase
Regression scope for a PR	3,000–8,000 tokens	PR diff + affected module tests + risk areas	Full test suite, unrelated modules
Single failure analysis	1,000–3,000 tokens	Stack trace + test code + relevant source	Full CI log, passing test output
Test suite coverage audit	5,000–15,000 tokens	Coverage summary map (see strategy above)	Raw test files at scale
Full sprint test planning	5,000–10,000 tokens	All sprint stories + acceptance criteria + domain context	Historical tests from unrelated modules

Learning Tip: For log analysis, build a qa-log-extract shell script or alias that takes a CI log file and outputs a clean failure summary. Running 10 minutes of preprocessing to distill a 100,000-line log into 300 lines of high-signal content will consistently produce better AI analysis than pasting the raw log and hoping the model finds the needle in the haystack. Automation of context preparation is a force multiplier — build it once, reuse it on every CI run.