When you ask an AI to analyze a bug from logs and traces, the quality of your output is determined almost entirely by how you structure your input. Pasting a raw, unedited log file into a chat window is the fastest way to get a vague, hedged, low-value response. Structured, trimmed, labeled evidence produces specific, actionable analysis.
This topic is about the craft of evidence structuring. You already know how to gather evidence from your years of bug investigation. The skill gap here is not collection — it's preparation. How do you decide what to include? How do you trim a 200,000-line log to the 60 lines that matter? How do you format distributed trace spans for an AI that reads linearly? How do you combine evidence from four different sources into a single prompt that doesn't exceed the context window or confuse the AI with unlabeled artifacts?
By the end of this topic, you'll have a repeatable evidence-structuring workflow and a set of prompt templates ready to apply to your most common bug categories.
Which logs should you include in an AI bug analysis prompt — and how to trim them?
The answer to "which logs" is not "all of them." It is not even "all logs from the affected service." It is: the minimum log volume that contains the full execution context of the failing operation, starting from the last known-good state.
That definition is precise on purpose. Let's unpack each part.
"Minimum log volume"
Every line of log you include consumes context window space that could be used for reasoning. Context windows are large but not infinite — and more importantly, AI attention degrades as irrelevant content increases. A 2,000-line log full of [INFO] heartbeat OK entries dilutes the 15 lines that contain the actual failure signal.
Your target is a log excerpt that a senior engineer could read in five minutes and understand completely. That is typically 50–300 lines for most application failures.
"Full execution context of the failing operation"
A single error line is rarely enough. You need the log context that shows:
- What operation was being performed when the error occurred
- What state was established before the operation started
- What the system tried to do at each step
- Where the first anomaly appeared (which is often before the visible error)
"Starting from the last known-good state"
Many bugs have a root cause that occurs earlier in the log than the error line. A common pattern: a configuration value is loaded incorrectly at startup (buried in INFO lines), and that misconfiguration causes an NPE 47 seconds later. If you only include the NPE context, you'll never find the root cause.
Log trimming procedure
Use this procedure every time you prepare logs for AI analysis:
Step 1: Find the error anchor. Identify the exact timestamp and line of the primary error — the exception, the 5xx response, the assertion failure. This is your anchor.
Step 2: Walk backward from the anchor. Look back through the logs for the start of the operation that failed. For a web request, this is the incoming request log line. For a batch job, it's the job start line. For a test, it's the test setup or the first test step. This is your start line.
Step 3: Include 10–20 lines before the start line. This captures context from the previous operation or system state that may have set up the failure.
Step 4: Include everything from the start line through 10–20 lines after the error. The lines after the error often show recovery attempts, retry behavior, or cascading failures that are diagnostic.
Step 5: Remove repetitive noise lines. If a line pattern appears more than three times consecutively, replace the repetitions with a comment:
[... 847 lines of INFO health check omitted ...]
Step 6: Mark the error line. Add a comment so the AI knows exactly where the fault is visible:
2024-03-15 14:23:41.882 ERROR PaymentService - Refund calculation failed # <-- PRIMARY ERROR
Prompt:
I'm providing trimmed application logs for a failed [operation name]. I've:
- Trimmed to the relevant time window: [start timestamp] to [end timestamp]
- Removed [N] repeated health check / INFO lines (noted with comments)
- Marked the primary error line with "# <-- PRIMARY ERROR"
## Application Logs
[paste trimmed, annotated logs]
Based on these logs, identify:
1. The first anomaly that appears before the primary error (if any)
2. The exact operation that was executing when the error occurred
3. Any log patterns that suggest the failure mechanism (missing expected log lines, unexpected retries, state transitions)
4. What additional log entries would you need to see to confirm root cause?
Learning Tip: Build the habit of grepping for your error anchor before opening an AI session. Run something like
grep -n "ERROR\|WARN\|Exception" application.log | grep -A 2 -B 2 "14:23"to find candidate anchor lines quickly. The 60 seconds you spend trimming saves 10 minutes of AI analysis churn on irrelevant log noise.
How to format log context so AI can read it without wasting the context window?
Beyond trimming, the formatting of your log excerpt affects how well AI can parse and reference specific lines. Raw application logs often have format inconsistencies, long prefixes that carry little information, and missing structural labels that make it hard for the AI to navigate.
Use header labels for each log section
If you're including logs from multiple sources (application log, access log, database slow query log), clearly separate and label each section:
Prompt:
## Evidence Block
### [1] Application Log — payments-service (2024-03-15 14:22:55 – 14:23:50 UTC)
[log content]
### [2] Database Slow Query Log — payments-db (same time window)
[log content]
### [3] HTTP Access Log — api-gateway (same time window)
[log content]
Each section is from a different system component. The timestamps should align. Correlate across sections to identify where in the request lifecycle the failure occurred.
This labeling prevents the AI from conflating log sources and makes it possible to ask follow-up questions like "In section [2], is there a slow query that corresponds to the time of the error in section [1]?"
Normalize log timestamps
If your logs use different time formats or time zones across services, normalize them before pasting. Mismatched timestamps are one of the most common causes of AI failing to correlate events correctly. Add a normalization note if you've changed the format:
[Note: All timestamps converted to UTC. Original format was epoch milliseconds in sections [1] and [3].]
Replace long log prefixes with shorthand
Some logging frameworks produce very long prefixes:
2024-03-15T14:23:41.882+00:00 [main] INFO com.company.payments.service.impl.RefundServiceImpl - Starting refund calculation
If every line has the same 80-character prefix pattern, the prefix consumes token budget without adding information. You can truncate with a note:
[Format: TIMESTAMP [THREAD] LEVEL LOGGER - MESSAGE. Abbreviated below.]
14:23:41.882 INFO RefundServiceImpl - Starting refund calculation
Use inline annotations for key events
Add short inline comments to highlight state transitions, anomalies, or causally significant lines:
14:23:41.882 INFO RefundServiceImpl - Starting refund calculation for TXN-8821
14:23:41.901 INFO RefundServiceImpl - Retrieved original transaction: amount=150.00, currency=USD
14:23:41.903 INFO RefundServiceImpl - Partial refund requested: amount=75.00 # <-- PARTIAL REFUND PATH
14:23:41.904 DEBUG RefundServiceImpl - Calculating refund: originalAmount=150.00 # <-- USES ORIGINAL, NOT PARTIAL
14:23:41.905 ERROR RefundServiceImpl - Refund validation failed: amount 150.00 exceeds available balance 75.00 # <-- ERROR
Inline comments like this let the AI immediately see what you already noticed, so it can build on your analysis rather than rediscovering it.
Chunk very long evidence into sequential prompts
If your trimmed log is still over ~500 lines, split the analysis across prompts rather than pasting everything at once:
Prompt:
I'm going to share evidence in two parts. Please acknowledge Part 1 and hold it in context while I share Part 2, then analyze both together.
## Part 1: System Context and Symptom Description
[system context and symptom]
## Part 1: Application Logs (first half of relevant window)
[logs]
Please acknowledge you've received Part 1 and are ready for Part 2.
Then follow up with Part 2. This explicit sequencing prevents context confusion on large evidence sets.
Learning Tip: Create a personal log formatting cheat sheet for your primary application. Document the logging format, which fields appear in every line, and which fields are diagnostic. The cheat sheet becomes the preamble of your AI prompts — paste it at the start of any log analysis session so the AI can parse the format correctly from the first line.
How to summarize distributed traces effectively for AI analysis?
Modern applications produce distributed traces — execution paths that span multiple services, each recording spans with timing, metadata, and status. A full distributed trace for a single user request can contain dozens or hundreds of spans across five or more services. Pasting a raw Jaeger or Zipkin trace JSON into an AI prompt is almost never the right approach.
What to extract from a distributed trace
For AI analysis, you need a trace narrative — a structured summary that preserves the timing, service boundaries, and failure signal without the JSON verbosity.
Extract and format the following:
1. Trace summary header:
Trace ID: abc123def456
Root operation: POST /api/v2/payments/refund
Total duration: 2,341 ms (expected: ~150 ms)
Status: FAILED (5xx at payments-service)
Span count: 47 spans across 6 services
2. Span waterfall — critical path only:
Omit successful fast spans. Include only spans that are:
- On the critical path (direct ancestors of the failed span)
- Unusually slow (> 2× their expected duration)
- Errored or timed out
- The last span before a gap of > 100ms
Service Call Waterfall (critical path only):
├── [0ms] api-gateway: route_request (12ms) ✓
├── [12ms] auth-service: validate_token (8ms) ✓
├── [20ms] payments-service: process_refund (2,321ms) ✗ ERROR
│ ├── [21ms] payments-service: validate_request (3ms) ✓
│ ├── [24ms] payments-service: fetch_transaction (850ms) ⚠ SLOW
│ │ └── [24ms] payments-db: SELECT transactions (847ms) ⚠ SLOW
│ ├── [874ms] payments-service: calculate_refund (2ms) ✓
│ └── [876ms] payments-service: write_refund (1,445ms) ✗ TIMEOUT
│ └── [876ms] payments-db: INSERT refunds (1,445ms) ✗ TIMEOUT
3. Error span detail:
For each errored span, include the span's metadata and error event:
Span: payments-service/write_refund
Duration: 1,445ms (timeout threshold: 1,000ms)
Error event: DB connection timeout after 1000ms
Tags: db.type=postgresql, db.statement=INSERT INTO refunds...
Prompt:
Below is a distributed trace summary for a failed refund operation. I've extracted the critical path spans and omitted 39 successful fast spans.
## Trace Summary
[paste trace summary header]
## Critical Path Waterfall
[paste formatted waterfall]
## Error Span Details
[paste error span details]
Analyze this trace for:
1. The primary failure point and its direct cause
2. Whether there are secondary anomalies (latency spikes, unexpected retries) that may have contributed
3. Whether the failure is isolated to one service or suggests a cascading failure pattern
4. What you'd want to check in the database or application logs for the slow spans
Reading trace timing patterns
AI is particularly good at identifying timing patterns in trace data that humans miss. Ask explicitly for timing analysis:
Prompt:
Looking at the timing in this trace waterfall:
1. Are there gaps between spans that suggest async processing, queuing, or lock waiting?
2. Is the slow span's latency consistent with a database lock, a cold cache, or network I/O?
3. Does the timeout duration suggest a configured timeout (round number) or a natural failure duration?
Learning Tip: Most distributed tracing tools (Jaeger, Zipkin, DataDog APM, Honeycomb) can export individual traces as JSON. Build a small script that takes a trace JSON and outputs the critical-path-only waterfall format shown above. You'll use this formatting step every time you investigate a performance-related or timeout bug, so automating it pays for itself within a week.
How to combine multiple evidence sources into a single actionable AI prompt?
Real bugs rarely have a single evidence source. A complete investigation typically combines application logs, a stack trace, a distributed trace, database query results, and possibly environment configuration data. Combining these into a single coherent prompt without overwhelming the AI requires deliberate structure.
The combined evidence prompt structure
Use a numbered evidence registry at the top of your prompt, then reference evidence sections by number in your analysis request:
Prompt:
## Bug Investigation Context
**System**: [system description]
**Symptom**: [precise symptom]
**Frequency**: [always / intermittent]
**Scope**: [affected users, environments, operations]
## Evidence Registry
I'm providing the following evidence. Each section is numbered for reference.
[1] Application Logs — payments-service — 2024-03-15 14:22–14:24 UTC
[2] Stack Trace — payments-service exception
[3] Distributed Trace — critical path only (trace ID: abc123)
[4] Database Query Result — payments table, transaction TXN-8821
[5] Environment Config — relevant env vars at time of failure
---
## [1] Application Logs
[trimmed, annotated logs]
## [2] Stack Trace
[full stack trace with caused-by chain]
## [3] Distributed Trace Summary
[formatted waterfall as described above]
## [4] Database Query Result
```sql
-- Query: SELECT * FROM payments WHERE id = 'TXN-8821'
-- Result:
id | status | amount | refund_amount | updated_at
TXN-8821 | PENDING | 150.00 | NULL | 2024-03-15 14:23:41
[5] Environment Configuration
NODE_ENV=production
DB_CONNECTION_TIMEOUT_MS=1000
REFUND_MAX_RETRY=3
FEATURE_PARTIAL_REFUNDS=true
Analysis Request
Using all five evidence sections, perform a root cause analysis:
1. Correlate evidence [1], [2], and [3] to identify the primary failure mechanism
2. Use evidence [4] to determine whether the database state is consistent with the failure
3. Use evidence [5] to identify whether any configuration contributed
4. State a root cause as a precise, falsifiable claim
5. List what is still uncertain and what you'd collect to resolve it
### Managing evidence volume in combined prompts
When your total evidence volume approaches the context limit, use this prioritization:
1. **Keep all of**: stack traces (full), error-specific log lines, database query results, configuration variables
2. **Trim aggressively**: INFO-level logs from healthy spans, verbose request/response bodies (show headers and status, trim body to relevant fields)
3. **Summarize instead of pasting**: If a service's logs are clean and show no anomalies, note "Service X logs reviewed — no anomalies in this window" instead of including them
### Iterative evidence addition
For complex bugs, start with your highest-priority evidence and add more in follow-up turns:
**Prompt (initial):**
Here is my initial evidence for a bug investigation. I'll add more evidence in follow-up messages as I collect it.
[Evidence block — stack trace and application logs only]
Initial question: Based on this evidence alone, what are your top hypotheses, and what additional evidence would most help you narrow them down?
Then in follow-up turns:
Based on your initial analysis, I collected the distributed trace you requested. Here it is:
[Evidence — trace]
Does this confirm or change your hypothesis ranking? What do you need next?
```
This iterative approach is more effective than front-loading all evidence, because the AI helps you decide what to collect next rather than you having to guess.
Learning Tip: Create a bug analysis template file in your team's wiki or personal notes — a blank version of the combined evidence prompt structure above with placeholders. When a bug lands in your queue, open the template, fill in what you have, and paste it into your AI session immediately. The discipline of filling the template before starting AI analysis eliminates the lazy habit of just pasting a stack trace and hoping for the best.
Effective evidence structuring is what separates an AI-assisted investigation that concludes in 30 minutes from one that circles for two hours. The investment is small — trimming, labeling, and formatting evidence takes 10–15 minutes for most bugs — and the return is consistently higher-quality analysis, fewer follow-up questions, and a much more useful evidence trail for the bug report you'll write afterward.