·

AI Assisted Root Cause Analysis

AI Assisted Root Cause Analysis

Production incidents are expensive precisely when root cause analysis is slow — AI compresses the investigation timeline without replacing the judgment that interprets the findings.

Structured Incident Investigation Frameworks

Root cause analysis frameworks like the 5 Whys and the fishbone (Ishikawa) diagram exist because human intuition is a poor guide during the adrenaline-fueled chaos of an active incident. The 5 Whys technique — repeatedly asking "why did this happen?" until you reach a systemic cause rather than a proximate one — is powerful but difficult to execute rigorously under pressure. Fishbone diagrams force you to consider causes across multiple dimensions (people, process, technology, environment) before locking onto a narrative.

AI can act as a structured facilitator for both techniques. Unlike a postmortem meeting where participants may defend their own work or anchor too quickly on the most obvious cause, AI applies the frameworks consistently and without ego. The critical skill is learning to feed it the right inputs and challenge its outputs — because AI will produce a coherent-sounding root cause analysis even when the inputs are incomplete, and a coherent-sounding wrong answer can be more dangerous than an obvious gap.

For 5 Whys with AI, the prompt pattern is to provide the incident statement at the top, then ask AI to walk through each "why" and show its reasoning at each level. You can interrupt the chain at any "why" where you have evidence that contradicts the AI's inference, and the conversation naturally documents your actual investigation chain rather than a post-hoc rationalized one.

For fishbone diagrams, AI is particularly useful for generating the branches you might not think of. A backend engineer investigating a latency incident will naturally focus on code and database causes; AI will also surface environment causes (DNS resolution time, TLS handshake overhead), process causes (recent deploy without performance baseline), and measurement causes (is the latency real or is the monitoring instrumentation itself slow?).

Learning tip: Run the 5 Whys exercise with AI before your postmortem meeting, not after. It surfaces the questions your postmortem needs to answer rather than just documenting the narrative you've already agreed on.

Feeding Incident Data to AI

The quality of AI-assisted RCA is entirely determined by the quality of the incident data you provide. AI cannot access your live systems, your monitoring dashboards, or your deployment pipeline — everything it knows about your incident must come from what you paste into the prompt. This constraint forces a discipline that is valuable in itself: it requires you to assemble a coherent, time-ordered picture of the incident before AI can help you interpret it.

The most useful incident data types to include, in rough order of diagnostic value:

Timeline of events: A chronological list of what happened — when the alert fired, when engineers were paged, when the first mitigation was attempted, when the incident was resolved. Even a rough timeline with 5-minute granularity is more useful than paragraphs of prose description.

Error logs: Paste representative samples, not entire log files. Three to five representative error log lines are more useful than five hundred lines of mixed output. Focus on the first errors that appeared, not just the most recent. Timestamps are mandatory.

Metrics at the time of the incident: Key numbers — request error rate, p95 and p99 latency, CPU and memory utilization, database connection pool saturation, queue depth. Before-and-after numbers are more useful than just peak values.

Recent changes: Any deployment, config change, feature flag toggle, dependency update, or infrastructure change within 72 hours of the incident. This is the most frequently omitted input and one of the most diagnostic.

System topology: A brief description of which services are involved, how they communicate, and any relevant dependencies. AI cannot infer your architecture from log samples alone.

Learning tip: Build a "context package" template in your incident response runbook and fill it in as the incident unfolds, not after. Teams that document in real time produce dramatically better postmortems.

Assembling a High-Quality Incident Context Package

The incident context package is the artifact you hand to AI (and to human investigators) to begin structured RCA. A well-assembled package transforms a chaotic collection of Slack messages and partial observations into a coherent input that AI can reason over.

A good context package has a standard structure:

INCIDENT CONTEXT PACKAGE

Incident ID: [identifier]
Severity: [P1/P2/P3]
Duration: [start time] to [end time or "ongoing"]
Affected systems: [list of services/components]
User impact: [what users experienced]

TIMELINE
[HH:MM] [Event description]
[HH:MM] [Event description]
...

METRICS SNAPSHOT
[Metric name]: [baseline value] → [peak incident value] → [current value]
[Metric name]: [baseline value] → [peak incident value] → [current value]

ERROR LOG SAMPLES
[timestamp] [service] [log line — first occurrence of the error]
[timestamp] [service] [log line — representative repeating error]

RECENT CHANGES (last 72 hours)
[date/time] [description of change] [who made it]

HYPOTHESES ALREADY INVESTIGATED
[Hypothesis] — [Eliminated because / Still open]

Filling in this template before your first AI prompt forces you to answer the questions that are almost always the answer: "What changed recently?" and "What was the exact sequence of events?"

Learning tip: The "Recent Changes" section alone resolves 40-60% of production incidents during initial investigation. If AI's analysis doesn't immediately point to a recent change, go back and verify the completeness of that section first.

AI's Role in Postmortem Writing

Postmortems serve two audiences: the engineers who need to understand what happened so they can prevent recurrence, and the organization that needs to understand the systemic factors that allowed the incident to occur. Good postmortems are hard to write because they require synthesizing a chaotic incident timeline into a clear narrative while remaining analytically honest about causes — not just proximate causes, but contributing factors and systemic weaknesses.

AI is exceptionally useful for the mechanical parts of postmortem writing: reconstructing a coherent timeline from Slack threads and alert histories, grouping contributing factors into categories, and drafting action items that are specific and assignable. It is less useful for — and should not be the primary author of — the section that identifies systemic organizational or process failures, which requires human judgment and organizational context.

A useful postmortem workflow with AI: provide the incident context package, ask AI to draft the timeline and contributing factors sections, review and correct those drafts, then write the systemic factors and action items sections yourself using AI's draft as a starting point rather than a final output. The final postmortem should read like it was written by a thoughtful engineer, not like it was generated from a template.

I'm writing a postmortem for the following incident. Please help me draft the timeline and contributing factors sections.

Here is the incident context package:
[paste your completed context package]

Please:
1. Reconstruct a clean, chronological timeline from the events I've described, noting any gaps where the sequence is unclear
2. List the contributing factors in two categories:
   - Immediate technical causes (what broke and why)
   - Contributing conditions (what made the system vulnerable to this failure)
3. For each contributing factor, suggest a specific, actionable follow-up item with an owner role (not a person name) and a rough effort estimate

Do not invent facts. If something is unclear from the context I've provided, mark it as "[NEEDS VERIFICATION]" rather than guessing.

Learning tip: Ask AI to mark uncertainties explicitly with "[NEEDS VERIFICATION]" rather than letting it fill gaps with plausible-sounding inferences. Postmortems with unverified assertions create incorrect organizational memory that outlasts the people who know they were uncertain.

Limitations of AI in Root Cause Analysis

AI has real and important limitations in incident investigation that every engineer using it must understand.

AI cannot access your live systems. It cannot read your current logs, query your database, or look at your monitoring dashboards. Everything it knows comes from what you provide. This is a feature in some respects — it forces information gathering discipline — but it means AI cannot discover the root cause independently; it can only reason about the evidence you've collected.

AI may hallucinate causes. Given an incomplete incident context, AI will fill the gaps with the most statistically plausible explanation given its training data. For common incident patterns (database connection pool exhaustion, memory leak in a long-running service, N+1 query under load), this is usually helpful. For unusual failure modes specific to your system's architecture, AI may confidently suggest a cause that is literally impossible given how your system is built. Always validate AI's hypotheses against your actual architecture.

AI anchors on what you give it. If your incident context package emphasizes certain symptoms, AI will produce root cause theories centered on those symptoms. Confirmation bias built into how you assembled the context package will be amplified, not corrected, by AI analysis. Deliberately include information that challenges your current hypothesis.

AI has no memory of your incident history. It cannot tell you "this looks like the same kind of failure you had three months ago" unless you explicitly provide that historical context. Your team's institutional memory is not accessible to AI without you surfacing it.

Learning tip: After AI produces its RCA, ask it to play devil's advocate: "What is the strongest argument against your proposed root cause, and what evidence would you need to see to rule it out?" This surfaces the gaps in the analysis.

Hands-On: Investigating a Payment Processing Latency Spike

This exercise walks through a complete AI-assisted RCA for a realistic production incident — a payment processing latency spike that caused a 12-minute degraded checkout experience.

Scenario: On a Monday morning at 09:47 UTC, your payment service p99 latency spiked from a baseline of 220ms to 4.2 seconds. The checkout error rate increased from 0.3% to 8.1%. The incident lasted 12 minutes before auto-resolving. No alerts fired during the incident — it was caught by a customer support ticket.

  1. Assemble the incident context package. Take the scenario details and add plausible supporting data:
I need help with root cause analysis for the following production incident. Here is my context package:

INCIDENT: Payment service latency spike
Duration: 09:47–09:59 UTC, Monday
User impact: Checkout p99 latency 220ms → 4.2s; checkout error rate 0.3% → 8.1%

TIMELINE
09:47 — Payment service p99 latency begins rising (first observed in metrics)
09:49 — Checkout error rate crosses 5%
09:52 — On-call engineer paged by customer support ticket (no automated alert)
09:55 — Engineer begins investigation
09:59 — Latency returns to baseline without intervention
10:08 — Incident declared resolved; RCA investigation begins

METRICS SNAPSHOT
Payment service p99 latency: 220ms → 4,200ms → 225ms
Checkout error rate: 0.3% → 8.1% → 0.4%
Payment service CPU: 12% → 14% (no significant change)
Payment service memory: 68% → 69% (no significant change)
PostgreSQL connection pool: 18/50 → 47/50 → 19/50 (spike to near saturation)
Stripe API p99 (from Stripe dashboard): 180ms → 3,900ms → 190ms (Stripe-side spike confirmed)

ERROR LOG SAMPLES
09:47:03 payment-service [ERROR] Stripe API timeout after 3000ms — idempotency_key=pay_8f3k2j
09:47:11 payment-service [ERROR] Stripe API timeout after 3000ms — idempotency_key=pay_9d1m4n
09:48:42 payment-service [WARN] DB connection pool at 90% capacity — waiting for available connection
09:51:17 payment-service [ERROR] DB connection pool exhausted — request queued for 2100ms

RECENT CHANGES (last 72 hours)
Friday 17:30 UTC — payment-service: bumped Stripe SDK from 10.12.0 to 10.14.0 (routine dependency update)
Friday 17:30 UTC — payment-service: reduced Stripe API timeout from 5000ms to 3000ms (performance improvement ticket)
Saturday 09:00 UTC — No changes

HYPOTHESES ALREADY INVESTIGATED
Spike was internal to our service (CPU/memory) — Eliminated: resource metrics unchanged
  1. Run the 5 Whys with AI:
Using the incident context package above, walk me through a 5 Whys analysis. For each "Why", state what you believe the cause was and what evidence from the context package supports that belief. If a "Why" requires evidence I haven't provided, mark it as [NEEDS VERIFICATION] rather than assuming.
  1. Review AI's 5 Whys chain and identify which steps are well-evidenced versus which are inferred. The chain should lead from "checkout latency spiked" → "Stripe API was slow" → "our timeout was too short to survive the slowdown" → "the reduced timeout was deployed Friday" → "systemic: timeout changes were not validated against Stripe's historical p99 distribution."

  2. Generate the fishbone diagram input:

Using the same incident context, generate a fishbone diagram for this incident. Organize contributing causes into these branches: Technology (code/infrastructure), Process (how we work), Measurement (how we detect problems), and Dependencies (external systems). List 2-3 items under each branch where the evidence supports it.
  1. Draft action items from the findings:
Based on the 5 Whys and fishbone analysis, generate a list of action items for the postmortem. Each action item should:
- Address a specific contributing factor (not just "improve monitoring" but exactly what to monitor)
- Have an assigned owner role (e.g., "payment team engineer", "SRE on-call")
- Have a rough effort estimate (hours / days / weeks)
- Be marked as either "prevents recurrence" or "improves detection speed"
  1. Validate the output by checking each action item against the actual evidence in the context package. Remove or flag any action item that AI generated without clear evidential support.

  2. Write the postmortem summary using AI's structured output as a first draft, adding your team's specific context, correcting any technical inaccuracies, and making the systemic observations that require organizational knowledge AI doesn't have.

Expected result: A complete, evidence-based postmortem draft in 30-45 minutes rather than 2-3 hours, with clear action items that directly address the actual causes rather than generic reliability improvements.

Key Takeaways

  • AI can facilitate structured RCA frameworks (5 Whys, fishbone) faster and more consistently than a group under incident pressure, but requires high-quality, time-ordered input to produce reliable output.
  • The incident context package — timeline, metrics, error log samples, recent changes — is the fundamental unit of AI-assisted RCA; the completeness of the "Recent Changes" section alone resolves the majority of incidents.
  • AI's role in postmortem writing is to handle mechanical reconstruction (timeline, contributing factors, action items), not to identify systemic organizational failures, which require human judgment.
  • AI cannot access live systems and will hallucinate causes when context is incomplete — always validate its hypotheses against your actual architecture and ask it to explicitly mark uncertainties.
  • Deliberately including information that challenges your current hypothesis in the context package counteracts the confirmation bias amplification that AI analysis can produce.