Investigate & Document a Complex Production Bug

Every preceding topic in this module has given you a piece of the investigation toolkit: how to scaffold a bug investigation prompt, how to structure evidence, how to write AI-enhanced bug reports, how to analyze flaky tests, how to do log analysis. This topic puts it all together into a single, end-to-end walkthrough.

The scenario: a production incident has been reported. The initial report is incomplete and ambiguous. You have access to logs, a distributed trace, and some source code. Your task is to take this bug from "something went wrong in checkout" to a complete, developer-ready bug report with a confirmed root cause — using AI as your systematic analysis partner throughout.

This topic is written as a step-by-step investigation walkthrough. Each section corresponds to a phase of the investigation, with the exact prompts used at each step. You can follow this workflow against your own incidents by substituting your system's context and evidence.

How to start a bug investigation from an incomplete or ambiguous report?

The report that lands in your queue looks like this:

"Customer says checkout isn't working. They couldn't complete a purchase. Happened around 2 PM today. Order ID might be ORD-9412. Urgency: high."

This is approximately the worst kind of bug report: no error message, no environment, no reproduction steps, one data point, and a vague timestamp. It is also the kind of report that is most common in the wild.

Your first job is to convert this into an investigation brief — a structured document that captures what you know, what you don't know, and what you need to find out. AI helps you do this in two ways: it helps you identify the right questions to ask, and it helps you query the right systems once you know what to look for.

Step 1: Generate an investigation brief from a sparse report

Prompt:

I've received this bug report:

"Customer says checkout isn't working. They couldn't complete a purchase. Happened around 2 PM today. Order ID might be ORD-9412. Urgency: high."

This is the complete report. I need to turn this into a structured investigation brief.

My system context:
- Application: an e-commerce platform
- Checkout flow involves: frontend (React), API gateway, order-service, payment-service, inventory-service
- Order IDs are formatted as ORD-NNNN and are created in the order-service

Help me create an investigation brief that includes:
1. What we know (confirmed facts from the report)
2. What we don't know (critical missing information)
3. What we need to find (the specific data points to collect before analysis)
4. Where to look (which systems, logs, or teams to query for each missing data point)
5. Initial hypothesis categories (which layers of the system could produce "checkout isn't working" — frontend, API, payment, inventory, other)

Format this as an investigation brief I can share with my team.

The AI will produce an investigation brief that immediately makes the investigation structured and team-legible. Critically, it will remind you of missing information you might not have noticed was missing — for example, the report doesn't say whether the error was visible to the user (a UI error message) or silent (the button appeared to work but no order was placed).

Step 2: Gather initial evidence guided by the brief

From the investigation brief, you know you need to:
- Look up Order ORD-9412 in the order-service database or admin panel
- Check the API gateway access logs for 14:00–14:30 UTC for the customer's user ID or session
- Check the payment-service error logs for the same window
- Check the frontend error monitoring for the same session

You collect what you can in the first 15 minutes. You find:
- Order ORD-9412 does not exist in the database — the order was never created
- The API gateway access log shows a POST /api/v1/orders request at 14:17:22 UTC that returned HTTP 500
- The payment-service has no record of this transaction
- The frontend error monitoring shows a JavaScript console error: Uncaught TypeError: Cannot read properties of undefined (reading 'orderId')

Step 3: Refine the investigation brief with initial evidence

Prompt:

I've gathered initial evidence for the checkout investigation. Here is what I found:

1. Order ORD-9412 does NOT exist in the database — the order was never created
2. API gateway log shows: POST /api/v1/orders returned HTTP 500 at 14:17:22 UTC
3. Payment service has no record — consistent with order never being created
4. Frontend error: "Uncaught TypeError: Cannot read properties of undefined (reading 'orderId')"

Based on this initial evidence:
1. Refine the investigation brief — which hypotheses from the original list are now more or less likely?
2. What is the most probable failure location? (Frontend, API layer, order-service, or earlier in the call chain?)
3. The HTTP 500 from the API gateway is our strongest signal — what are the most likely causes of a 500 on POST /api/v1/orders?
4. The frontend TypeError appears after the 500 — is it likely a symptom or a cause?
5. What should I look at next? Prioritize by diagnostic value.

At this point, the AI will likely suggest that the 500 from the API gateway is the root event, the frontend TypeError is a downstream symptom of the API error response, and that the order-service logs for the specific request are the highest-priority next evidence item.

Learning Tip: In every investigation, distinguish between "origin faults" and "cascading symptoms" as early as possible. In this scenario, the frontend TypeError looks alarming but is probably a symptom of the API 500 — the frontend code tried to access response.data.orderId on a 500 error response that doesn't have that field. Investigating the TypeError first would lead you down the wrong path. AI is good at flagging this distinction when you ask for it explicitly.

How to build your investigation context iteratively with AI?

You now have a partial evidence set — enough to form strong hypotheses, but not yet enough for root cause confirmation. The goal of this phase is to build context incrementally: each AI analysis session tells you what evidence to collect next, you collect it, and you feed it back. The investigation is a loop, not a linear sequence.

Step 4: Pull the order-service logs and present them

You query the order-service logs for the 14:16:00–14:18:30 UTC window, filter for the request path /api/v1/orders, and find the following (trimmed):

14:17:21.345 INFO  OrderService - Incoming request: POST /api/v1/orders, user_id=U-4421
14:17:21.348 INFO  OrderService - Order validation started
14:17:21.352 INFO  OrderService - Validating cart: cart_id=CART-9921, items=3
14:17:21.390 INFO  InventoryService - Inventory check: item_id=SKU-441 quantity=2 → available
14:17:21.412 INFO  InventoryService - Inventory check: item_id=SKU-882 quantity=1 → available
14:17:21.415 INFO  InventoryService - Inventory check: item_id=SKU-118 quantity=1 → available  # <-- note: 3rd item is a digital product
14:17:21.416 INFO  OrderService - Inventory validation passed
14:17:21.417 INFO  OrderService - Calling PaymentService to create payment intent
14:17:21.892 ERROR OrderService - PaymentService call failed: HTTP 422 Unprocessable Entity
14:17:21.893 ERROR OrderService - Response body: {"error":"INVALID_CURRENCY","message":"Currency 'USD' not supported for digital products. Use 'DIGITAL_CREDIT' instead."}
14:17:21.894 ERROR OrderService - Order creation failed, rolling back
14:17:21.895 INFO  OrderService - Cart CART-9921 restored to active state
14:17:21.895 ERROR OrderService - Returning 500 to API gateway

Prompt:

I now have the order-service logs. Here is what they show (trimmed):

[paste logs above]

This is new, critical information. Analyze:
1. What is the actual root cause of the order failure?
2. Is this a bug in order-service, payment-service, or a configuration/data issue?
3. The payment service returned HTTP 422 with error INVALID_CURRENCY for digital products. What are the plausible causes of this?
   a. Is this a new validation rule in payment-service that order-service doesn't know about?
   b. Is this an order-service bug where it's sending the wrong currency for digital products?
   c. Is this a data issue with the product catalog (SKU-118 might be miscategorized)?
   d. Is this a recent code change that introduced or removed digital product currency handling?
4. Update our hypothesis ranking. What is the most likely root cause now?
5. What do I need to look at next to confirm?

The AI will likely identify that we have the root cause for why the order failed (payment-service rejects USD for digital products), but we still need to determine whether this is a new rule, a code regression, or a data problem.

Step 5: Investigate the payment service contract

Prompt:

Based on the payment-service error INVALID_CURRENCY for digital products, I need to understand whether this is expected behavior or a regression.

Please help me form specific investigation questions for:
1. The payment-service team: what would I ask them to determine if this validation is new, always-existed, or was recently changed?
2. The payment-service API contract: what should I look for in the API docs or OpenAPI spec to understand expected currency handling for digital vs physical products?
3. The order-service code: what specifically should I look for in the order-service code to find where currency is set for a payment intent?
4. The product catalog: what database query would help me verify whether SKU-118 is correctly classified as a digital product?
5. Recent deployments: given that this appeared today, what deployment timeline question would help me narrow whether this is a regression?

You then conduct those investigations. You find: (a) the payment-service added a new validation 3 days ago requiring DIGITAL_CREDIT currency for digital products; (b) the order-service code does not have any special currency handling for digital products — it always sends USD; (c) SKU-118 is correctly marked as type=digital in the product catalog.

The root cause is confirmed: order-service sends USD for all products regardless of type, and payment-service now requires DIGITAL_CREDIT for digital products. The order-service was not updated to handle the new payment-service validation.

Learning Tip: When you find a root cause that's an integration contract mismatch (Service A changed its contract, Service B wasn't updated), immediately ask: "How many other service pairs could have the same problem?" Contract mismatches due to uncoordinated deployments are systemic — fixing one instance without addressing the communication/coordination gap means the same class of bug will recur.

How to run a full AI-assisted root cause analysis step by step?

With the full evidence set in hand, you now run a formal root cause analysis. The goal is to produce a root cause statement that is precise, falsifiable, and sufficient to guide the fix — without speculation.

Step 6: Formal root cause analysis

Prompt:

I've completed my evidence gathering. Here is the full context for this production bug.

## Summary of Findings
- Symptom: Customer checkout fails for cart containing digital product SKU-118
- User impact: No order created, customer sees generic error
- Evidence:
  1. API gateway: POST /api/v1/orders returns HTTP 500 at 14:17:21 UTC
  2. Order-service logs: PaymentService returned HTTP 422 INVALID_CURRENCY for digital product
  3. Payment-service error detail: "Currency 'USD' not supported for digital products. Use 'DIGITAL_CREDIT' instead."
  4. Payment-service change log: New digital product currency validation added 3 days ago (deployment v2.41.0)
  5. Order-service code review: Order-service always sends currency='USD' in payment intent creation — no special handling for digital products
  6. Product catalog: SKU-118 is correctly typed as type=digital

## Root Cause Analysis Request
Based on this evidence, write a formal root cause analysis:

1. **Primary Root Cause**: The specific code/system/configuration failure
2. **Contributing Factor**: What organizational or process condition allowed this to occur
3. **Why It Didn't Surface Earlier**: What detection gap allowed this to reach production
4. **Causal Chain**: The sequence from cause to user impact (use the format: [cause] → [intermediate effect] → [user impact])
5. **Confirmation**: What specific evidence confirms this is the root cause (and rules out alternatives)
6. **Fix Required**: The minimal change that would prevent this specific failure
7. **Systemic Fix**: The process or architectural change that would prevent this class of failure

Be precise. Reference specific evidence items by number. Do not include speculation.

Step 7: Assess scope and impact

Prompt:

Given the root cause — order-service sends currency='USD' for all products, but payment-service v2.41.0 now requires 'DIGITAL_CREDIT' for digital products — assess the scope and impact:

1. Scope: is this failure limited to the specific scenario (digital product + USD checkout) or could it affect other scenarios?
2. Affected users: estimate the percentage of checkout attempts that include at least one digital product (assume a product catalog with 15% digital products and average cart size of 2.3 items)
3. Duration: the payment-service v2.41.0 was deployed 3 days ago — have all digital product checkouts failed for 3 days?
4. Revenue impact: if affected, every checkout with a digital product has failed silently — how should I frame this for severity assessment?
5. Immediate mitigation options: are there server-side workarounds available without a code deployment? (Consider: feature flags, payment-service rollback, order-service emergency patch)
6. Detection gap: why wasn't this caught by automated tests? What test should have caught this integration failure?

Learning Tip: Scope assessment is as important as root cause identification for production bugs. "Single user affected" versus "all users with digital products affected for 3 days" drives completely different response priorities. Always ask the AI to assess scope and duration before you write the bug report — the impact section determines how the team prioritizes the fix.

How to produce a complete, developer-ready bug report using AI?

You now have everything needed to write the definitive bug report: a confirmed root cause, full evidence trail, scope assessment, and a proposed fix. This is the synthesis step.

Step 8: Generate the developer-grade bug report

Prompt:

Using the complete investigation findings below, write a developer-ready bug report.

## Complete Investigation Findings

**Root Cause**: Order-service does not differentiate currency by product type when creating payment intents. It always sends currency='USD'. Payment-service v2.41.0 (deployed 2024-03-12) introduced a new validation requiring currency='DIGITAL_CREDIT' for digital products. The two services were not coordinated on this change, causing all checkout attempts containing digital products to fail with HTTP 500 since v2.41.0 was deployed.

**Scope**: All customers attempting to checkout with any cart containing at least one digital product have experienced checkout failure for the past 3 days. Based on product catalog composition, this affects an estimated 22–28% of all checkout attempts.

**Evidence**: [summarize the 6 evidence items from Step 6]

## Bug Report Requirements
- Format: Jira markdown
- Audience: Order-service backend developer
- Severity: Critical (major feature failure, wide user scope, revenue impact)
- Include all sections: Title, Summary, Environment, Reproduction Steps, Expected Behavior, Actual Behavior, Root Cause Analysis, Code Pointer, Impact, Evidence, Suggested Fix
- Title should be component-specific and describe the behavior, not just "checkout broken"
- Reproduction steps must be followable by a developer who doesn't have the customer's account
- Code pointer: reference the order-service payment intent creation code
- Suggested fix: minimal code change in order-service to send correct currency for digital products

Step 9: Generate the stakeholder summary

Prompt:

Write a separate stakeholder summary for the product manager and customer success team. This is NOT the technical bug report — it's a plain-language impact summary.

Requirements:
- Maximum 250 words
- No technical jargon
- Lead with user impact: what couldn't customers do?
- State scope clearly: approximately what percentage of customers were affected
- State duration: how long has this been occurring
- State current status: is it fixed, being worked on, or in investigation?
- State ETA: if known, when will it be resolved?
- Avoid blame — describe what happened, not who caused it
- End with: what we're doing to prevent this class of issue in the future

Current status: Root cause confirmed, fix being developed, ETA [4 hours / next deployment window].

Step 10: Create the post-fix verification checklist

A bug report isn't complete until there's a way to verify the fix. Ask AI to generate the verification checklist:

Prompt:

The proposed fix for this bug is: modify order-service to check product type and send currency='DIGITAL_CREDIT' instead of 'USD' when the cart contains digital products.

Generate a QA verification checklist for this fix. Include:
1. Positive test cases: scenarios that should now succeed (were failing before the fix)
2. Regression test cases: scenarios that worked before and should still work after (physical product checkout, mixed cart, edge cases)
3. Integration test cases: verify the correct currency value is being sent to payment-service in the payment intent
4. Edge cases: what unusual scenarios should be specifically tested? (empty cart, all-digital cart, mixed cart, free digital product with $0 price, digital product with coupon applied, digital subscription product)
5. Environment: should this be verified in staging before prod deployment, and what data setup is needed in staging?
6. Rollback criteria: if the fix is deployed but causes new failures, what would you observe that would trigger a rollback decision?

Format as a numbered checklist grouped by category.

The complete investigation output

After this 10-step workflow, you've produced:

A structured investigation brief (from the sparse initial report)
An iterative evidence collection log (showing what you gathered and when)
A formal root cause analysis document
A scope and impact assessment
A developer-grade bug report in Jira format
A stakeholder summary for non-technical audiences
A QA verification checklist for the fix

The entire workflow, with AI assistance, takes 30–60 minutes for a complex production bug. Without AI, the same investigation takes 2–4 hours, and the documentation is typically less complete.

Prompt (final synthesis):

I've completed an AI-assisted investigation of a production bug. Here is a summary of what we produced:

[paste summary of findings, root cause, reports generated]

Before I close this investigation session:
1. Is there any step in a complete bug investigation that I haven't covered?
2. Is there any evidence I collected that I didn't use in the analysis — and should I use it?
3. Is there any claim in the bug report that isn't fully supported by the evidence I provided?
4. What follow-up investigations or monitoring should I set up to confirm the fix is effective?
5. What would a good incident retrospective include for this bug? List the retro action items.

This final synthesis prompt is the AI equivalent of a peer review — it catches gaps before the report is filed and ensures the investigation is truly complete.

Learning Tip: Build the habit of running the "before I close this session" prompt at the end of every investigation. Even experienced QA engineers regularly discover that they've made an unsupported assumption in the root cause section, or that they collected evidence they forgot to use. The prompt takes 30 seconds to run and consistently improves final report quality.

This hands-on investigation walkthrough is a reusable template. The system context changes with every bug; the investigation structure stays the same. Master this structure, adapt the prompts to your stack, and you'll have a consistent, repeatable process that produces high-quality bug reports regardless of the complexity of the incident — and that makes you one of the most effective investigators on your team.