Hands-On: Diagnosing a Production Incident

Walking through a real incident end-to-end with an AI assistant teaches you the discipline of structured root cause analysis faster than any amount of theory.

The Scenario

It is 3:47 PM. Your alerting system fires: the checkout service is returning HTTP 503 errors intermittently. The error rate is 8% of requests, which is high enough to affect a meaningful percentage of customers but low enough that the service has not tripped your circuit breaker. The deploy happened two hours ago. Three services are involved: checkout, which orchestrates the purchase flow; inventory, which reserves stock; and payment, which processes the charge.

This is the kind of incident that is easy to get wrong. The 503 could originate in checkout, in one of the downstream services, or in the infrastructure connecting them. Without a structured approach, investigation time is dominated by hypothesis-driven log scrolling — expensive, slow, and prone to confirmation bias. This is exactly where an AI assistant, given the right context, becomes a force multiplier.

The goal of this capstone is not just to solve the scenario — it is to learn the structured incident investigation workflow so you can apply it to any production incident on any stack.

Learning tip: Before you start any AI-assisted debugging session for a real incident, open a scratch document and commit to writing down every hypothesis you form and whether it was confirmed or ruled out. This discipline prevents the AI from nudging you toward a plausible-sounding but incorrect explanation.

Assembling the Incident Context Package

The quality of your AI-assisted RCA is directly proportional to the quality of the context you provide. A vague prompt ("my checkout service is broken") produces vague output. A structured context package produces actionable analysis.

Your context package should contain four components:

1. Error log samples — Paste three to five representative error log lines, not fifty. Include timestamps, service names, trace IDs, and the raw error message. Strip PII.

2. Distributed trace samples — One trace from a failing request and one from a successful request, both from the same time window. The contrast is what tells the story.

3. Recent deploy diff — The relevant portions of the diff from the deploy that happened two hours ago. If the diff is large, include only the files that touch the code paths involved in the failing flow.

4. Service dependency map — A plain-text or inline description of which services call which, what protocols they use, and what timeout/retry settings are configured. If this does not exist, write it from memory now — the act of writing it often surfaces the answer.

For this scenario, your context package looks like this:

ERROR LOG SAMPLES (checkout service, last 15 minutes):
[15:32:41] checkout ERROR TraceID=abc123 checkout.processOrder failed: upstream timeout after 3000ms calling inventory.reserveStock
[15:34:12] checkout ERROR TraceID=def456 checkout.processOrder failed: upstream timeout after 3000ms calling inventory.reserveStock
[15:35:50] checkout ERROR TraceID=ghi789 checkout.processOrder failed: upstream timeout after 3000ms calling payment.chargeCard

TRACE COMPARISON:
Successful trace (TraceID=jkl012):
  checkout.processOrder: 1240ms total
  -> inventory.reserveStock: 340ms
  -> payment.chargeCard: 890ms

Failing trace (TraceID=abc123):
  checkout.processOrder: 3001ms total (timeout)
  -> inventory.reserveStock: 3001ms (no response received)

RECENT DEPLOY DIFF (inventory service, deployed 13:45):
- timeout: 5000
+ timeout: 2500
  retries: 0

SERVICE MAP:
checkout -> inventory (HTTP, timeout: 3000ms, retries: 0)
checkout -> payment (HTTP, timeout: 5000ms, retries: 2)
inventory -> warehouse-db (PostgreSQL, timeout: 4000ms, pool: 10)

Learning tip: If you are missing any component of the context package during a real incident, say so explicitly in your prompt. Write "MISSING: no trace data available, working from logs only." The AI will adjust its confidence level and avoid over-indexing on what it cannot see.

Structured AI RCA Session

With the context package assembled, you are ready to begin the RCA session. The session has three phases: initial analysis, hypothesis ranking, and confirmation.

Phase 1: Initial Analysis

Open your AI session with the full context package and this structured prompt:

I am investigating a production incident. Here is the context package:

[PASTE YOUR FULL CONTEXT PACKAGE]

Based only on the evidence above:
1. What are the top three hypotheses for the root cause, ranked by likelihood?
2. For each hypothesis, what additional evidence would confirm or rule it out?
3. What is the most likely blast radius — which user flows are affected and which are not?
4. Is there any evidence in the logs or traces that points AWAY from the most obvious hypothesis?

Do not recommend fixes yet. Focus only on analysis and evidence gaps.

Expected output: Three ranked hypotheses with supporting evidence citations. For this scenario, the top hypothesis should involve the inventory service timeout reduction, since the diff shows timeout: 2500 while checkout calls inventory with a timeout: 3000ms. When inventory takes between 2500ms and 3000ms (which it does legitimately for complex stock queries), the inventory service drops the request before checkout's timeout fires — causing checkout to wait the full 3000ms and then return a 503.

Phase 2: Hypothesis Ranking and Testing

Take the top hypothesis and probe it:

Focus on Hypothesis 1: the inventory service timeout was reduced from 5000ms to 2500ms in the recent deploy, but checkout's call timeout is 3000ms. This means inventory can silently abort requests that checkout is still waiting for.

Help me design a test to confirm or rule out this hypothesis using only what is available during the incident (no code changes, no deploys):

1. What should I look for in inventory service logs to confirm it is terminating requests at the 2500ms mark?
2. What should I grep for in checkout logs to confirm the 503s correlate with requests that took exactly 3000ms (checkout's timeout) rather than some other duration?
3. Is there a safe way to reproduce this with synthetic traffic in production without impacting real customers?
4. If this hypothesis is correct, what is the fastest safe mitigation that does not require a code change?

Expected output: A concrete investigation checklist with exact log grep patterns, a synthetic traffic approach using a known test account, and a mitigation recommendation such as increasing the inventory timeout back to 5000ms via a config change.

Phase 3: Confirming the Root Cause

After running the investigation steps, you confirm: inventory logs show requests being dropped at exactly 2500ms, checkout logs show 503s from calls that lasted exactly 3000ms, and the error rate correlates perfectly with requests to SKUs that require complex multi-warehouse stock aggregation queries — which legitimately take 2500–3500ms.

Now close the loop:

The hypothesis is confirmed. Inventory service logs show requests terminated at 2500ms. Checkout 503s correlate exactly with the 3000ms timeout boundary. Root cause: the inventory service timeout was reduced from 5000ms to 2500ms without validating against the actual P99 latency of complex stock queries (which is ~2800ms).

The missing piece: there is no retry configured on the checkout -> inventory call. A single timeout causes a 503 with no recovery attempt.

Summarize the confirmed root cause in 3 sentences suitable for inclusion in a postmortem. Then list the contributing factors in order of impact.

Expected output: A clean RCA summary and a prioritized contributing factors list — both ready to paste into a postmortem document.

Learning tip: Always ask the AI what evidence would point AWAY from the leading hypothesis. This is the most important discipline in AI-assisted debugging. Without it, the model's tendency to be helpful will bias it toward confirming whatever you seem to believe rather than stress-testing it.

Hands-On: Full Incident Walkthrough

Work through these steps using the scenario above, substituting your own context package if you have a real incident to analyze.

Step 1: Assemble the context package

Gather your four components: error log samples, trace comparison, deploy diff, and service map. Write them into a single markdown document. Time-box this step to 10 minutes. If you cannot find all four components, note what is missing.

Step 2: Run the initial analysis prompt

I am investigating a production incident. Here is my context package:

INCIDENT SUMMARY:
- Checkout service returning 503s intermittently (8% error rate)
- Started approximately 2 hours after a deploy to the inventory service
- Three services involved: checkout (orchestrator), inventory (stock reservation), payment (charge processing)

ERROR LOG SAMPLES (checkout service):
[15:32:41] checkout ERROR TraceID=abc123 processOrder failed: upstream timeout after 3000ms calling inventory.reserveStock
[15:34:12] checkout ERROR TraceID=def456 processOrder failed: upstream timeout after 3000ms calling inventory.reserveStock

TRACE COMPARISON:
Successful: checkout=1240ms | inventory.reserveStock=340ms | payment.chargeCard=890ms
Failing: checkout=3001ms timeout | inventory.reserveStock=3001ms (no response)

DEPLOY DIFF (inventory service, 13:45 today):
- timeout: 5000
+ timeout: 2500
  retries: 0

SERVICE MAP:
checkout -> inventory: HTTP, timeout 3000ms, retries 0
checkout -> payment: HTTP, timeout 5000ms, retries 2
inventory -> warehouse-db: PostgreSQL, timeout 4000ms, pool 10

Give me the top 3 hypotheses ranked by likelihood. For each: cite the evidence, describe what would confirm it, and describe what would rule it out. Do not recommend fixes yet.

Expected output: Hypothesis 1 should be the timeout mismatch. Hypotheses 2 and 3 might include warehouse-db slowness (plausible given pool size 10) and an application-level regression in inventory's query logic.

Step 3: Investigate the leading hypothesis

Take the AI's top hypothesis and run:

I want to confirm Hypothesis 1 (inventory timeout 2500ms < checkout call timeout 3000ms causing silent request termination). Give me:
1. The exact log pattern I should search for in inventory logs to confirm it is dropping requests at the 2500ms mark
2. The exact log pattern I should search for in checkout logs to confirm 503s have exactly 3000ms duration
3. A one-line shell command to grep for these patterns given that logs are JSON and stored in /var/log/services/
4. A safe production test that confirms the hypothesis without impacting real customers

Expected output: Two grep patterns (one per service), a jq-based shell command for JSON logs, and a synthetic test approach using a dedicated test account or internal flag.

Step 4: Rule out the alternatives

Assume Hypothesis 1 is confirmed. Help me rule out Hypotheses 2 and 3:

Hypothesis 2: warehouse-db is slow (connection pool size 10 may be causing queue buildup under load)
Hypothesis 3: application regression in inventory's stock aggregation query logic

For each, what single metric or log pattern would definitively rule it out using only data I can pull without a code change or deploy?

Expected output: For Hypothesis 2, check db query duration percentiles in inventory logs — if P99 is below 2500ms for warehouse-db, the pool is not the constraint. For Hypothesis 3, compare error rates between complex and simple SKU queries — if both error at the same rate, the query logic is not differentiating.

Step 5: Apply the mitigation

Root cause confirmed: inventory service timeout set to 2500ms; checkout calls inventory with timeout 3000ms and zero retries. When inventory handles complex stock queries (~2800ms P99), it self-terminates before checkout's timeout, causing checkout to wait the full 3000ms and return 503 with no recovery.

Generate the mitigation steps in priority order:
1. Immediate: what can be changed without a code deploy to stop the bleeding in the next 10 minutes?
2. Short-term: what code change closes the root cause and what should the PR description say?
3. Long-term: what architectural changes prevent this class of problem in the future?

For each step, include the specific config value or code change required.

Expected output: Immediate = roll back inventory timeout to 5000ms via config; Short-term = add retries: 2 to checkout's inventory client with exponential backoff; Long-term = implement contract testing between services to validate timeout compatibility on every deploy.

Step 6: Write the postmortem

Write a production postmortem for this incident using the following confirmed facts:

- Impact: 8% of checkout requests returned 503 for approximately 2 hours
- Timeline: Deploy at 13:45, alerts fired at 15:32, root cause identified at 16:10, mitigation applied at 16:18, full recovery at 16:20
- Root cause: Inventory service timeout reduced from 5000ms to 2500ms without validating against P99 latency of complex stock queries (~2800ms). No retry configured on checkout -> inventory calls, so every timeout became a user-facing error.
- Contributing factors: (1) No contract test validating timeout compatibility between services. (2) Timeout change not flagged as a high-risk configuration change in the deploy checklist. (3) P99 latency not monitored per-query-complexity in inventory.

Format: Title, Impact Summary, Timeline, Root Cause, Contributing Factors, Action Items (with owner and due date fields).

Expected output: A complete postmortem document ready for stakeholder review, with action items that are specific, assignable, and time-bounded.

Step 7: Generate prevention action items

Based on the postmortem above, generate a prevention checklist for the team. For each action item:
1. Write it as a specific, testable engineering task (not a vague aspiration)
2. Assign it to one of: Platform team, Service team, or Both
3. Estimate effort as S/M/L
4. Specify the definition of done

Focus on changes that would have caught this issue before it reached production.

Expected output: Five to eight action items including things like: "Add contract test that validates inventory service timeout > checkout client timeout — Service team — S — Done when test runs in checkout's CI pipeline and fails if the mismatch occurs."

Step 8: Extract the reusable pattern

Based on this incident, write a one-page runbook titled "Diagnosing Timeout Mismatch Incidents in Service Meshes." Include: trigger conditions, initial investigation steps, key log patterns to search, common mitigations, and a postmortem template link. Write it so that an on-call engineer who was not involved in this incident can use it at 3 AM.

Expected output: A runbook in Markdown format that your team can add to your incident response wiki — turning this single incident into permanent institutional knowledge.

Learning tip: The last step — extracting the reusable pattern — is the most important habit to build. Every incident you close with AI assistance should produce at least one runbook, one checklist item, or one monitoring rule that makes the next similar incident faster to resolve. The AI is your documentation co-author, not just your debugger.

Key Takeaways

Structured context packages (logs, traces, deploy diff, service map) are the single biggest lever on AI-assisted RCA quality. Invest 10 minutes assembling good context before starting the AI session.
Ask the AI what evidence would rule out the leading hypothesis, not just what would confirm it. This is the discipline that separates fast, accurate RCA from AI-amplified confirmation bias.
A complete incident session produces five artifacts: confirmed root cause, mitigation steps, postmortem, prevention action items, and a reusable runbook. Do not stop at the fix.
The missing retry was the proximate cause; the missing contract test was the systemic cause. AI can help you see both levels if you ask explicitly about contributing factors.
Every incident resolved with AI assistance is an opportunity to build institutional knowledge. Prompt the AI to convert findings into runbooks and checklists immediately, while the context is still in the session.