Understanding how LLMs read and prioritize your input lets you write prompts that actually work — and avoid the subtle failures that come from feeding them the wrong context in the wrong order.
What Is a Token — and Why Should You Care as an Engineer?
When you send a message to an LLM, the model does not read your text the way you do. It does not see words, sentences, or lines of code. It sees tokens — small chunks of text produced by a process called tokenization. A token is roughly 3–4 characters of English text on average, but code behaves differently.
Consider this Python snippet:
def calculate_discount(price, percent):
return price * (1 - percent / 100)
That 60-character snippet is approximately 20–25 tokens. Each identifier (calculate_discount, price, percent), each operator (*, -, /), each parenthesis, and each keyword (def, return) may become its own token or share one with adjacent characters. Whitespace and indentation also consume tokens — which means deeply nested code or files with long lines cost more context than you might expect.
Why does this matter? Because every model has a token budget — a hard ceiling on how much it can process in a single interaction. Once you exceed it, the model either truncates your input, summarizes it, or fails. And even before you hit the ceiling, quality degrades as the context fills up. Knowing the token cost of what you're sending helps you make deliberate trade-offs: should you paste the whole 800-line service class, or just the 40-line method that's actually relevant?
A practical rule of thumb: 1,000 tokens ≈ 750 words ≈ 30–50 lines of average code. Use this to estimate whether a file fits comfortably or whether you need to be selective.
Learning tip: Before pasting code into a prompt, estimate its token cost mentally. A single large file (500+ lines) can consume 15–20% of a typical context window. If you are sending multiple files, you may be spending most of your budget before you even write your question.
Context Windows — The Working Memory of an LLM
A context window is the total number of tokens a model can process in one call — both input (your prompt, files, conversation history) and output (the model's response). Common sizes today range from 8K tokens to 200K tokens depending on the model.
Think of the context window as the model's working memory. Unlike a database or a file system, the model has no persistent state between calls. Every time you send a new message, the entire conversation history is re-sent from scratch. If you are building an agentic system that reads code, executes tools, and accumulates results over many steps, the context window fills up fast.
For large codebases, this creates a genuine engineering problem. A typical backend service might have dozens of files totaling hundreds of thousands of tokens — far more than any model can process at once. This forces you to make explicit decisions about what context to include. You cannot just "give the model your whole codebase" and expect it to work well. You need to be selective, structured, and deliberate.
Retrieval strategies help here: rather than loading all files, you identify the most relevant files for the current task and load only those. This is a skill in itself, and it is one of the core competencies of context engineering covered throughout this module.
Learning tip: Treat the context window like RAM in a memory-constrained environment. You would not load every library into memory to run one function. Apply the same thinking to prompts: load only what the model actually needs to do the task at hand.
How Attention Works — A Mental Model for Engineers
You do not need to understand the mathematics of attention mechanisms to use LLMs effectively — but having a mental model for how attention works helps you predict model behavior and diagnose failures.
Here is the core idea: when a transformer processes your input, every token can "look at" every other token in the context. The model learns to assign higher attention weight to tokens that are more relevant to the current token being processed. In practical terms, this means related concepts reinforce each other — the word price in your code is likely to attend strongly to the word discount nearby, because the model has learned that they co-occur in financial contexts.
The implication for engineers is this: relationships matter, not just presence. Dumping a file into context gives the model the tokens, but does not guarantee it will understand how those tokens relate to your task. If you add a 500-line utility file to provide one helper function, the model now has to "find" that function in 500 lines of noise. Attention is real, but it is not magic — the model does better when you surface what matters rather than hoping it discovers it.
Think of attention like a spotlight that sweeps across your context. The spotlight is smart — it learns where to look — but the brightness falls off with distance and density. Sparse, relevant context lets the spotlight find what matters. Dense, unfocused context forces it to work harder, and it will make mistakes.
Learning tip: If the model gives a wrong or off-target answer, consider whether the relevant information was buried inside a large block of unrelated code or text. Try isolating just the relevant portion and re-running. The answer often improves dramatically.
Context Position Affects Quality — Recency Bias and the Lost-in-the-Middle Problem
One of the most practically important things to know about LLMs is that where you put information in the context window affects how well the model uses it. This is not a theoretical concern — it is a well-documented empirical pattern.
Two effects are worth knowing:
Recency bias means that models tend to give higher weight to tokens near the end of the context. If you put your question at the end and your relevant code near the beginning, the model "remembers" the question more strongly than the code. This sounds counterintuitive — you might expect the model to weigh everything equally — but it does not.
The lost-in-the-middle problem refers to a documented finding that information placed in the middle of a long context is retrieved less reliably than information placed at the beginning or end. If you have a 100K-token context and your most important function is buried at position 50K, the model is more likely to miss or misinterpret it than if you had placed it at the beginning.
For engineers building agentic workflows, these effects are not just curiosities — they are bugs waiting to happen. A retrieval step that prepends a retrieved file to the context might accidentally push your task instructions to the middle. A conversation that accumulates tool outputs in order might bury the most critical constraint between irrelevant results.
The practical rule is simple: put the most important information first. Put your task description, constraints, and critical code at the top. Let supporting context follow. If you have a long conversation history, summarize or truncate old turns rather than letting them crowd out new relevant information.
Learning tip: Audit the structure of your prompts as carefully as you would audit the structure of a function. Ask: what is the most important thing the model needs to know? Is it at the top? Is it separated from noise? Would a human skimming this quickly find the key information? If no, restructure.
Why Long Contexts Degrade — and What You Can Do About It
Even within the context window limit, longer contexts generally produce lower-quality outputs. This happens for several compounding reasons:
First, attention becomes diluted. As the context grows, each token competes with more other tokens for the model's attention. Important signals get weaker relative to the background noise of irrelevant text.
Second, the model may anchor on wrong information. A longer context gives the model more opportunities to latch onto something irrelevant that superficially looks related to the task. This produces confident-sounding but incorrect responses.
Third, instruction following degrades. Studies have found that models following multi-step instructions are more likely to miss or skip steps when the instructions are surrounded by large amounts of context. The model's ability to track state and follow a plan weakens as context grows.
What this means practically: shorter, more focused contexts produce better results than longer, less focused ones — even when the longer context technically contains all the information the model needs. This is the central insight of context engineering: it is not about giving the model more, it is about giving it the right thing.
Learning tip: If a complex prompt is producing unreliable results, try splitting it into smaller calls. Break the task into phases, use one call to gather information, a second to analyze, and a third to generate output. Each call operates on a focused context and produces more reliable results than one monolithic prompt trying to do everything.
Hands-On: Diagnosing and Improving a Bloated Prompt
This exercise walks you through a concrete workflow for identifying when a prompt is suffering from context overload and restructuring it to improve quality.
Prerequisites: Access to any modern LLM interface (Claude, GPT-4, etc.) and a code file of at least 200 lines from a real project.
-
Select a real engineering task. Choose something you have actually needed help with recently — for example, understanding why a particular function behaves unexpectedly, or getting a review of a complex method.
-
Write a first-draft prompt by pasting everything. Include the entire file, the full error message or context, and your question at the end. This is the "naive" approach most engineers start with.
``prompt
Here is my entire UserService class (800 lines). It uses a repository pattern with dependency injection. I'm seeing an issue wheregetUserById` returns stale data after an update. Can you help me debug this?
[paste entire 800-line file here]
```
-
Note the quality of the response. Does the model identify the actual problem? Does it get distracted by unrelated methods? Does it give generic advice rather than specific guidance?
-
Rewrite the prompt with surgical context. Extract only the specific method causing the issue, the method that calls it, and any relevant interface definitions. Put the task description first, then the code.
``prompt
I need to debug a stale data issue in a repository-pattern service. The methodgetUserByIdis returning a cached/stale value afterupdateUser` is called. Here are the two relevant methods and the cache interface they use:
[paste getUserById — ~20 lines]
[paste updateUser — ~15 lines]
[paste CacheInterface definition — ~10 lines]
What is causing the stale read, and how should I fix it?
```
-
Compare the responses. The focused prompt should produce a more specific diagnosis, reference actual variable names and logic from the pasted code, and give actionable fix recommendations.
-
Iterate by adding context incrementally. If the focused response is incomplete because it needs more context, add one piece at a time — the cache implementation, the repository base class — and observe when the quality peaks and when it starts to decline.
-
Document your findings. Note: at what point did adding more context stop helping? This gives you a calibrated intuition for how much context this type of task actually needs.
-
Apply the pattern to a new task. Take a different engineering problem and start immediately with the focused approach rather than the naive approach. Compare the time it takes you to write the prompt and the quality of the result.
Hands-On: Testing Position Effects in Your Own Prompts
This exercise lets you directly observe how context position affects model output, building empirical intuition you can apply to every prompt you write.
-
Create a test prompt with three sections: a task instruction, a code block, and a constraint. Write the task instruction as a simple question like "What is the time complexity of this function and why?"
-
Version A — constraint at the end. Structure the prompt with the task first, then code, then add a constraint at the very end.
```prompt
What is the time complexity of the following function and why?
def find_duplicates(items):
seen = set()
duplicates = []
for item in items:
if item in seen:
duplicates.append(item)
else:
seen.add(item)
return duplicates
Answer in one sentence. Do not explain set lookup complexity — assume I know it.
```
- Version B — constraint buried in the middle. Restructure so the constraint is sandwiched between the task and the code.
```prompt
What is the time complexity of the following function and why?
Note: Answer in one sentence. Do not explain set lookup complexity — assume I know it.
def find_duplicates(items):
seen = set()
duplicates = []
for item in items:
if item in seen:
duplicates.append(item)
else:
seen.add(item)
return duplicates
```
-
Run both versions and compare. Does Version B produce a longer response that explains set lookup despite the instruction? Does Version A more reliably follow the constraint?
-
Scale up the test with a longer context. Add 200–300 lines of unrelated code between the instruction and the function. Observe whether the model still correctly answers the question about the target function or drifts toward something else in the noise.
-
Record your observations as a reference for how this specific model handles position. Different models have different profiles — some are more robust to position effects than others. Knowing your model's behavior lets you structure prompts accordingly.
Key Takeaways
- A token is not a word — code is often tokenized at the symbol level, making it more expensive per line than prose. Always estimate token cost before pasting large files.
- The context window is the model's entire working memory for a call. It does not persist between calls, and filling it with irrelevant content hurts quality even before you hit the limit.
- Attention is directional and distance-sensitive: the model is better at using information that is nearby, clearly related, and not buried in noise. Surface what matters rather than hoping the model finds it.
- Information in the middle of a long context is retrieved less reliably than information at the beginning or end. Put the most critical content first — task description, key constraints, most relevant code.
- Focused contexts outperform bloated ones. When a prompt is underperforming, the fix is usually to remove context, not add it.