·

Root cause analysis for flaky tests

Root cause analysis for flaky tests

Flaky tests are the broken windows of a test suite. A test that sometimes passes and sometimes fails — without any code change — erodes trust in the entire test suite. Once engineers start seeing "oh, that's just a flaky test" as a normal response to a red build, they stop investigating failures seriously. Coverage degrades. Bugs ship. The flaky test problem is not a technical annoyance; it is an organizational risk.

Diagnosing flaky tests has historically been expensive. A test that fails 10% of the time requires multiple observation cycles to gather enough data for pattern analysis. The failure condition is often race-dependent, timing-dependent, or state-dependent in ways that don't surface on a single inspection. QA engineers often quarantine flaky tests — tag them and suppress their failures — rather than investigating root cause, because investigation takes longer than the team can afford.

AI changes this calculus. With the right evidence and prompting strategy, you can compress flaky test diagnosis from days of observation to hours, identify root causes that would have taken days of manual log correlation to find, and make fix-or-quarantine decisions based on evidence rather than intuition.


What are the most common causes of flaky tests and how does AI detect them?

Before you can prompt AI to find the root cause of a specific flaky test, you need to understand the taxonomy of flaky test causes. Different causes leave different evidence signatures, and your evidence-gathering strategy depends on which category you're investigating.

The flaky test cause taxonomy

1. Timing and concurrency issues

The test assumes that an asynchronous operation completes within a fixed time window. When system load is high, the window is exceeded. Evidence signature: test passes consistently in isolation, fails under parallel test execution; failure timing clusters around the same step; logs show assertions executing before the awaited operation completes.

AI detection pattern: Compare pass and fail run timestamps — clustering at high-load periods is diagnostic. Compare parallel versus serial execution results.

2. Shared state contamination

A test passes or fails depending on the order in which tests run, because a previous test leaves global state, database records, or mock configurations that the current test's setup doesn't clean up. Evidence signature: test fails only when run in a full suite but passes in isolation; failure rate correlates with specific predecessor tests.

AI detection pattern: Analyze test execution order in failing vs. passing runs. Look for database state or mock assertion residue from preceding tests.

3. External dependency instability

The test depends on an external service (a sandbox API, a third-party service, a real database) that has its own availability or response variability. Evidence signature: failure messages are network errors or timeout errors; failure rate correlates with external service SLA incidents; the test does not own the setup and teardown of its dependencies.

AI detection pattern: Correlate test failure timestamps with external service monitoring data. Look for timeout patterns in logs.

4. Environment and resource dependencies

The test behaves differently depending on the execution environment: available memory, available disk space, OS-level file locking, environment variables, or platform-specific APIs. Evidence signature: test fails on CI but passes locally; failure rate correlates with resource utilization peaks on CI agents.

AI detection pattern: Compare CI agent specifications against local dev specs. Look for resource limit errors in CI build logs.

5. Test data dependency

The test depends on specific data being present in the database, assumes a clean state that isn't guaranteed, or generates data with non-deterministic identifiers that sometimes collide. Evidence signature: test fails with "record not found" or "duplicate key" errors; failure rate is low but non-zero even without code changes.

AI detection pattern: Analyze database error messages and test data setup/teardown logic.

6. Assertion precision issues

The test asserts on values that legitimately vary: timestamps, ordering of unordered collections, floating-point calculations, or time-dependent computed fields. Evidence signature: failure messages show "expected X but got Y" where Y is a plausible variant of X; the failed value is different each failure but always similar.

AI detection pattern: Compare the expected and actual values across multiple failure instances — variation in the "actual" value is diagnostic.

Prompt:

I have a flaky test. Here is its test code and a description of what it tests:

## Test Code

[paste test code]


## What This Test Covers
[describe the feature, operation, or behavior being tested]

Based on the flaky test cause taxonomy (timing, shared state, external dependency, environment, test data, assertion precision), which categories are plausible for this test? For each plausible category:
1. Explain what the specific failure mechanism would look like for this test
2. List what evidence would confirm or rule out this category
3. Rate the likelihood based on what you can infer from the test code alone (before seeing failure data)

Learning Tip: Paste this taxonomy into a reference card or team wiki page. When a flaky test is first reported, assign it a "suspected category" based on its failure message before you start gathering evidence. The category assignment drives your evidence collection strategy — if you suspect shared state, you need execution-order data; if you suspect timing, you need duration histograms. Starting with a hypothesis cuts collection time significantly.


How to feed flaky test history and CI logs to AI for pattern recognition?

Flaky test analysis requires more data than a single failure. You need a failure history — multiple runs across different conditions — to distinguish the signal (the pattern) from the noise (the random variation). The more structured your failure history, the better AI can pattern-match.

Structuring your failure history for AI analysis

Collect a run history table with the following columns for each test execution over the last 20–50 runs:

Run ID Timestamp Pass/Fail Failure Message (truncated) Parallel? CI Agent Branch Test Suite Execution Order Position

You don't need to manually build this — most CI systems expose this data through their API or UI. GitHub Actions, GitLab CI, and Jenkins all have test result aggregation. Export or copy the data into a structured format.

Prompt:

Here is the run history for a flaky test over the last 30 CI runs.

## Test: [test name and file path]
## Run History

| Run ID | Timestamp (UTC) | Result | Failure Message | Parallel | Agent | Order Position |
|--------|-----------------|--------|-----------------|----------|-------|----------------|
[paste run history table]

Analyze this history for patterns:
1. Is there a temporal pattern? (Time of day, day of week, clustering)
2. Is there a concurrency pattern? (Fails more when parallel=true)
3. Is there a position pattern? (Fails more when run later in the suite order)
4. Is there an agent pattern? (Fails on specific CI agents)
5. Is the failure message consistent across failures, or does it vary?
6. What is the estimated true flake rate (% of runs that fail)?
7. Based on these patterns, what is the most likely root cause category?

Analyzing CI logs for flaky test failure context

Beyond the run history table, you need the actual log output for a sample of failing runs. Collect logs for 3–5 failing runs and structure them for comparison:

Prompt:

Below are log excerpts from 4 failing runs of the flaky test "[test name]". I've included the last 20 log lines before the failure assertion and the failure message itself for each run.

## Failing Run 1 (Run ID: #1234, 2024-03-15 02:14 UTC)

[log excerpt + failure message]


## Failing Run 2 (Run ID: #1251, 2024-03-16 14:38 UTC)

[log excerpt + failure message]


## Failing Run 3 (Run ID: #1287, 2024-03-18 09:22 UTC)

[log excerpt + failure message]


## Failing Run 4 (Run ID: #1301, 2024-03-20 17:45 UTC)

[log excerpt + failure message]


And here is a log excerpt from a passing run for comparison:

## Passing Run (Run ID: #1290, 2024-03-18 11:05 UTC)

[log excerpt]


Compare the failing runs to the passing run:
1. What is consistently present in the failing runs but absent in the passing run?
2. What is consistently absent in the failing runs but present in the passing run?
3. Are the failure messages consistent or variable? What does variation suggest?
4. Is there any timing difference in the log sequence between failing and passing runs?
5. What root cause does this comparison most support?

Using test execution order as a diagnostic variable

If you suspect shared state contamination, you need to test with controlled execution order:

Prompt:

I'm testing the hypothesis that this test fails due to shared state contamination from a preceding test.

Here are the tests that ran before the flaky test in the runs where it failed:
Run #1234 (fail): [list test names in execution order]
Run #1251 (fail): [list test names in execution order]
Run #1290 (pass): [list test names in execution order]

Identify:
1. Which test(s) appear in both failing runs but not (or later) in the passing run?
2. These candidate predecessor tests are likely contaminating shared state. What shared state could they each affect? (Consider: database tables, cache state, mock setups, global variables, environment variables)
3. What specific teardown or cleanup would each candidate test need to add to eliminate the contamination?

Learning Tip: The highest-signal pattern for shared state contamination is this: the flaky test passes 100% of the time when run in isolation, and fails consistently when a specific other test runs before it. Before doing any AI analysis, run the flaky test in isolation ten times. If it passes all ten times, shared state is almost certainly the category.


How to use AI to validate root cause hypotheses for flaky tests?

Pattern recognition narrows the hypothesis space, but it rarely produces a definitive root cause. After pattern analysis, you'll have one or two high-confidence hypotheses. Validating them requires generating specific, falsifiable predictions that you can test.

The hypothesis validation loop

For each candidate root cause hypothesis, AI can generate:
1. A specific test or observation that would confirm the hypothesis
2. A specific test or observation that would rule it out
3. A code or configuration change that should eliminate the flakiness if the hypothesis is correct
4. A description of what you'd expect to see in the test results after the change

Prompt:

I have two candidate root cause hypotheses for this flaky test:

**Hypothesis A**: The test fails due to a race condition — the test assertion executes before an async event handler completes. Supporting evidence: failure message shows the target element not yet updated; failures cluster in the 2–5% of runs where CI agent load is highest.

**Hypothesis B**: The test fails due to test data contamination from TestSuite_UserManagement tests that run before it. Supporting evidence: fails more often when UserManagement suite runs first; the test queries a users table that UserManagement tests write to.

For each hypothesis:
1. Design a controlled experiment that would confirm or falsify it (something I can run in the next hour)
2. Predict exactly what I should observe if the hypothesis is true
3. Describe what code or configuration change would fix the flakiness if true
4. Estimate how confident you are in this hypothesis based on the evidence (High/Medium/Low)

Which hypothesis should I test first and why?

Controlled experiment design for timing hypotheses

Timing and concurrency hypotheses require specific experimental designs:

Prompt:

I'm testing the hypothesis that this test fails due to a timing race condition.

The test code is:

[paste test code]


Design a controlled experiment to confirm or falsify this hypothesis:
1. What should I add to the test code to surface the race condition explicitly? (e.g., artificial delays, logging of timing)
2. What should I observe in the logs or test output if the race condition exists?
3. If I add a [waitFor / explicit await / sleep] at [specific location], should the failure rate drop? What would it drop to if the hypothesis is correct vs. incorrect?
4. What's a better fix than a sleep — and what would correctly fix this race condition without masking it?

Code-level root cause confirmation

Once you have a strong hypothesis with experimental support, use AI to confirm it at the code level:

Prompt:

I have high confidence (from experimental evidence) that this flaky test fails due to [specific mechanism].

Here is the relevant test code and the source code of the system under test:

## Test Code

[paste test code]


## Source Code — Relevant Section

[paste source code of the async/shared-state/data-setup section suspected]


Confirm or refine the root cause at the code level:
1. Walk through the execution of the test code step by step, identifying the specific line where the race condition / state dependency / data assumption occurs
2. Identify the exact condition that must be true for the test to fail
3. Is there a single code location where the fix should be applied? Which is it?
4. Write the corrected test code (or the corrected setup/teardown code) that eliminates the root cause

Learning Tip: For timing-related flaky tests, always ask AI to suggest two solutions: the "quick fix" (usually adding a wait or retry) and the "correct fix" (addressing the async design). Quick fixes mask the problem and can hide real bugs; correct fixes eliminate the underlying design issue. Bring both to the team so the decision is explicit, not accidental.


How to decide whether to fix or quarantine a flaky test using AI?

Not every flaky test should be fixed immediately. The fix decision is a resource allocation question: how much engineering time does this test cost (in CI noise, investigation time, false blocked merges) versus how much does fixing it cost? Quarantining a flaky test — tagging it so it doesn't block CI — is a legitimate short-term strategy, but it needs a plan and a timeline.

The fix-or-quarantine decision framework

Factor Fix Quarantine
The test covers critical functionality Yes No
The root cause is identified Yes Root cause unknown
The fix is straightforward (< 2h) Yes Fix is complex
The failure rate is high (> 20%) Yes Failure rate is low (< 5%)
The test has caused recent missed defects Yes No
The flakiness is in external dependency (not your code) Consider Likely
The test would be expensive to rewrite Consider Possibly

Prompt:

I need to decide whether to fix or quarantine this flaky test. Here is the context:

## Test Profile
- Test name: [name]
- What it covers: [description of feature or behavior — include criticality]
- Estimated fix effort: [from the root cause analysis above]
- Current failure rate: [% from run history]
- Failures in last 30 days that caused: blocked PRs [N], false bug investigations [N], developer interruptions [N]

## Root Cause Status
- Root cause identified: [Yes / Partially / No]
- Fix complexity: [describe the code change needed]
- Fix confidence: [how confident are we the proposed fix will resolve the flakiness]

## Team Context
- Sprint capacity: [available hours this sprint]
- Upcoming deadlines or feature freeze: [any relevant timeline context]

Given this information, recommend fix or quarantine, with reasoning. If quarantine, specify:
1. The exact quarantine tag / label to use
2. What evidence to attach to the quarantine ticket (so future engineers have context)
3. A recommended review date to revisit the decision
4. What observable event (a specific failure, a deadline, a code change) should trigger re-investigation even before the review date

Generating the quarantine ticket

If the decision is to quarantine, AI can draft the quarantine ticket efficiently:

Prompt:

Write a quarantine ticket for this flaky test.

The ticket should include:
1. Title: "[FLAKY] [test name] — [one-line root cause description or "unknown"]"
2. Context: what the test covers, when it started failing, flake rate
3. Evidence summary: what investigation was done, what was found
4. Root cause status: identified / suspected / unknown
5. Quarantine impact: what coverage is now suppressed
6. Re-evaluation criteria: under what conditions should this be re-investigated (time-based, code-change-based, or incident-based)
7. Fix acceptance criteria: what would a correct fix need to demonstrate to close this ticket

Measuring the cost of quarantine over time

If your team quarantines frequently, track the cumulative cost:

Prompt:

I have a list of currently quarantined tests with their quarantine dates and suspected causes:

[paste list of quarantined tests]

Analyze this quarantine inventory:
1. Which tests have been quarantined longest? Should any be re-evaluated now?
2. Are there patterns in the root cause categories — are we quarantining the same type of problem repeatedly?
3. Estimate the total coverage gap from all quarantined tests (which functional areas are unmonitored)
4. Prioritize the top 3 tests to fix this sprint based on: coverage criticality, estimated fix effort, and how long they've been quarantined

Learning Tip: Set a team rule: no test stays in quarantine for more than one full sprint without a re-evaluation. A "quarantine review" as a standing agenda item in sprint planning takes 5 minutes and prevents the quarantine list from becoming a permanent dead zone. Use the quarantine ticket review prompt above to prep for that meeting in under 10 minutes.


Flaky test diagnosis with AI works best when you approach it as a structured data analysis problem, not a debugging expedition. The engineers who reduce their flaky test backlog fastest are the ones who systematically collect run history, run controlled experiments, and use the hypothesis validation loop. The prompts in this topic give you the scaffolding for every step of that process.