·

AI-guided exploratory testing in the agentic loop

AI-guided exploratory testing in the agentic loop

Why does exploratory testing come after scripted testing in the agentic loop?

This ordering surprises some QA engineers. The instinct is often to explore first — go find the bugs before writing scripts, so the scripts cover what you already know is risky. In the agentic loop, the sequence is deliberately reversed: scripted tests run first, then exploratory testing follows. This sequencing is not arbitrary; it is based on how the two activities complement each other when each has access to outputs from the other.

The Information Problem with Exploring First

When you explore before scripted tests run, you are operating without the most valuable signal the loop can provide: which scripted scenarios passed and which failed. This creates two problems:

Problem 1 — Redundant exploration: You spend exploratory time investigating paths that the automated tests have already validated. Without knowing the scripted test results, you cannot efficiently focus your exploration on the gaps.

Problem 2 — Uninformed risk prioritization: The highest-risk areas for exploration are often the edges of what scripted tests cover — the adjacent paths, the unchecked state combinations, the error recovery scenarios. Without scripted test results, you don't know where the scripted coverage ends, so you can't efficiently target the adjacencies.

What Scripted Test Results Tell the Explorer

After scripted tests execute, the QA engineer has access to a rich signal:

Coverage signal: Which scenarios from the test plan have been validated and which have not. Exploration should prioritize uncovered scenarios, not re-cover validated ones.

Failure signal: Which scripted tests failed. Failed tests indicate unstable areas that deserve deeper exploration — if the happy path is failing, the error paths and edge cases adjacent to that failure are high-priority exploration targets.

Flakiness signal: Which tests are flaky (passing on retry). Flaky tests signal non-deterministic behavior — race conditions, state leakage, timing dependencies — that is prime territory for exploratory investigation.

Coverage delta signal: Compared to the previous test run or the baseline, which coverage metrics changed? Coverage regression often indicates unintended behavior changes in adjacent code.

The Sequencing in the Loop

Scripted tests execute
         │
         ▼
┌─────────────────────────────────────────┐
│  EXPLORATORY INPUT SIGNALS              │
│                                         │
│  • Failed tests → explore adjacent paths│
│  • Flaky tests → explore timing/state  │
│  • Coverage gaps → explore uncovered AC │
│  • New code areas → risk-based charters │
└─────────────────────────────────────────┘
         │
         ▼
AI generates risk-ranked exploratory charters
         │
         ▼
QA engineer executes charters
         │
         ▼
Findings feed bug triage and coverage report

This sequence turns the exploratory step into a precision instrument rather than an open-ended investigation. The explorer knows exactly where the scripted tests ended and where they should begin — and the AI generates charters that target precisely those boundaries.

When to Explore First (The Exception)

There is one valid exception to the scripted-first sequence: when the feature is so new that no automated tests exist yet and the risk is high enough that you need exploratory findings to inform the test plan. In this case, a brief exploratory spike (one to two hours maximum) can precede the scripted cycle. But this is a planning-phase exploration — closer to an acceptance testing session than the systematic exploratory step in the loop.

Learning Tip: Time-box the post-scripted exploratory step tightly: no more than 20–25% of total QA time for a given feature. When exploratory testing expands to fill available time, it stops being targeted risk investigation and starts being unstructured testing without a clear stop condition. The scripted test results give you a natural scoping mechanism — use them to define explicit charter targets and stop conditions before the session begins.


How to use code-change context to generate risk-based charters automatically?

In Module 7, you learned to generate exploratory test charters from feature specs and test plans. In the agentic loop, charter generation adds a third input: the code-change diff. This is the most powerful input for risk-based exploration because it tells the agent not just what the feature is supposed to do, but exactly which code paths changed — and therefore which paths are most likely to contain unintended side effects.

The Charter Generation Prompt with Code-Change Context

TEST_PLAN="qa-artifacts/test-plan-checkout-validation-20240113.md"
CI_RESULTS="qa-artifacts/ci-results-$(git rev-parse --short HEAD).md"

claude --print "
You are an experienced QA engineer generating risk-based exploratory test charters.

## CODE CHANGE (primary signal for risk areas)
---
$(git diff origin/main...HEAD)
---

## TEST PLAN (defines what's been scripted, what's been scoped)
---
$(cat $TEST_PLAN)
---

## CI EXECUTION RESULTS (what scripted tests found)
---
$(cat $CI_RESULTS 2>/dev/null || echo 'No CI results available — use test plan scope only')
---

## CHARTER GENERATION INSTRUCTIONS

Generate exploratory test charters that TARGET what scripted tests cannot cover:
1. State combinations not covered by single-scenario tests
2. Error recovery paths (what happens AFTER an error, not just when it occurs)
3. Adjacent features that the code change could have inadvertently affected
4. User flows that cross multiple changed components
5. Race conditions or timing-dependent behavior in async code changes

For each charter:
- ID: EC-{N}
- Title: one sentence describing the exploration target
- Risk rationale: why this area is high-risk based on the diff
- Scope: what to explore (specific pages, flows, states)
- Out of scope: what to explicitly avoid in this charter
- Duration: estimated time in minutes (be realistic — 30-60 min per charter)
- Oracles: how to recognize a bug vs. expected behavior in this area
- Entry state: what system state to start from
- Stop condition: what signal indicates this area has been sufficiently explored

Prioritize charters by risk level. Generate 4-6 charters total.
" > qa-artifacts/exploratory-charters-$(date +%Y%m%d).md

Example Generated Charter

A well-generated charter from this prompt looks like this:

### EC-001: Address validation state during rapid form editing
**Risk Rationale**: The diff shows the validation fires on blur events with a 500ms debounce
and an async API call. If a user edits a field immediately after the debounce fires but before
the API responds, the validation state machine may show a stale result. This is a race condition
not covered by any scripted test (which all wait for the async response to complete).

**Scope**: Checkout form address fields, specifically during rapid sequential editing
**Out of Scope**: The validation API itself — only the UI state management
**Duration**: 45 minutes

**Entry State**: Logged-in user on /checkout with one item in cart, address section visible

**Oracles**:
- Bug: Green checkmark visible while an invalid address is being typed
- Bug: Error message persists after correcting a field
- Bug: Validation error from a previous field appears on a different field
- Expected: Validation state is always consistent with the last-committed field value

**Exploration Path**:
1. Start by typing slowly — does validation fire reliably on blur?
2. Type and immediately tab — does the debounce hold or does it fire early?
3. Type a valid value, then immediately overtype with invalid value before 500ms — what state?
4. Edit multiple fields in rapid sequence — does each field validate independently?
5. Edit city after zip validated — does zip re-validate?

**Stop Condition**: You have either found a reproducible inconsistency in validation state,
or you have exercised all 5 exploration paths with no anomaly found.

This charter is actionable because it gives the explorer a specific mechanism to investigate (the debounce race condition), concrete oracles (what to look for), and a stop condition that prevents aimless exploration.

Adapting Charters for Different Test Disciplines

For backend/API-focused QA engineers, the same charter generation prompt produces API-oriented charters:

claude --print "
[same context as above]

Generate charters specifically for BACKEND/API exploration:
- Focus on concurrent request scenarios
- Error response body consistency across error types
- Database state after failed partial operations
- Authentication token edge cases during session transitions
- Logging and audit trail correctness
"

For mobile QA engineers:

claude --print "
[same context as above]

Generate charters specifically for MOBILE exploration:
- App state restoration after background/foreground cycle during validation
- Network interruption during the async validation API call
- Form state persistence when phone call interrupts checkout
- Keyboard dismissal behavior at various validation states
- Accessibility: VoiceOver/TalkBack behavior during validation error state
"

Learning Tip: Generate one broad set of charters and one set targeted at your specific discipline (frontend, API, mobile). Run the broad charters in a team session and the discipline-specific charters individually. The cross-discipline findings from the broad session often surface integration-layer bugs that discipline-specific exploration misses, while the individual discipline sessions find deeper stack-specific issues that broad exploration does not reach.


How to feed exploratory findings back into the test plan and bug pipeline?

Exploratory testing that produces only verbal observations adds no lasting value. In the agentic loop, exploratory findings are structured artifacts that feed two downstream processes: the bug triage pipeline and the test plan coverage record. The QA engineer's job during and after an exploratory session is to capture findings in a format that the agent can process.

Real-Time Note Capture Format

During an exploratory session, take notes in a structured format that is optimized for AI processing. Avoid free-form prose that requires AI interpretation — use a semi-structured format instead:

**Charter**: EC-001 — Address validation state during rapid form editing
**Date**: 2024-01-15
**Engineer**: [name]
**Session duration**: 47 minutes

## Observations

### OBS-001 [BUG-CANDIDATE]
**Action**: Typed "02101" in zip field, tabbed immediately (before debounce, ~200ms)
**Observed**: Green checkmark appeared for 800ms, then switched to error state "Enter a valid ZIP code"
**Expected**: No checkmark until validation completes
**Reproducibility**: Reproduced 3/3 attempts
**Environment**: Chrome 121, macOS, test env
**Screenshot**: [filename]

### OBS-002 [OBSERVATION - NOT BUG]
**Action**: Edited city after zip was validated
**Observed**: Zip field re-validated (expected behavior based on spec AC1)
**Note**: This is correct — all fields validate when any field changes

### OBS-003 [BUG-CANDIDATE]
**Action**: Typed "PO Box 123, Boston, MA 02101" in street field (Chrome autofill)
**Observed**: No warning appeared for 3 seconds, then warning flashed and immediately disappeared
**Expected**: Warning persists until address is changed
**Reproducibility**: Reproduced 2/2 (requires Chrome autofill — not reproducible with manual typing)
**Environment**: Chrome 121 with autofill history, test env
**Screenshot**: [filename]

## Summary
- Total observations: 3
- Bug candidates: 2 (OBS-001, OBS-003)
- Interesting observations: 1
- Charter coverage: 4/5 exploration paths completed (path 3 incomplete due to time)
- Recommended follow-up: Run OBS-003 on Safari autofill (different autofill mechanism)

Converting Session Notes to Bug Reports

After the session, run the synthesis prompt to convert raw notes into structured bug reports:

claude --print "
You are a QA engineer converting exploratory session notes into structured bug reports and test plan updates.

SESSION NOTES:
---
$(cat qa-artifacts/session-notes-EC-001-20240115.md)
---

TEST PLAN:
---
$(cat qa-artifacts/test-plan-checkout-validation-20240113.md)
---

SPEC:
---
$(cat specs/checkout-address-validation.md)
---

Your tasks:

## Task 1: Generate Bug Reports
For each BUG-CANDIDATE observation, generate a complete bug report:

### Bug Report Format:
**ID**: BUG-{date}-{N}
**Summary**: [one sentence]
**Severity**: CRITICAL / HIGH / MEDIUM / LOW
**Affected AC**: [AC number(s)]
**Steps to Reproduce**: [numbered steps, specific enough to reproduce]
**Expected Behavior**: [from spec]
**Actual Behavior**: [what was observed]
**Environment**: [from session notes]
**Reproducibility**: [X/N attempts]
**Root Cause Hypothesis**: [your analysis]
**Suggested Fix Direction**: [for developer context]

## Task 2: Test Plan Updates
For each bug found:
1. Identify which test plan scenario should have caught this bug
2. If the scenario exists: explain why the scripted test missed it
3. If the scenario doesn't exist: add it to the test plan as a new scenario
4. Recommend whether to add an automated regression test for this bug after it's fixed

## Task 3: Coverage Record Update
Update the coverage record to show:
- Which charter was executed
- Which areas were covered
- Which AC were validated through exploration
- What follow-up is recommended
" > qa-artifacts/session-synthesis-EC-001-$(date +%Y%m%d).md

The Bug Pipeline Integration

Once bug reports are generated, they flow directly into the team's bug tracking system. For Jira-integrated workflows:


jira issue create \
  --project QA \
  --type Bug \
  --priority High \
  --summary "$(grep 'Summary:' qa-artifacts/session-synthesis-EC-001-*.md | head -1 | cut -d: -f2)" \
  --description "$(cat qa-artifacts/session-synthesis-EC-001-*.md)"

Learning Tip: Capture session notes during the session — not after. Memory degrades rapidly after a focused testing session, and the specific timing details (200ms before debounce, 3 seconds before warning) that make a bug reproducible are lost if you rely on post-session recall. A Markdown file open in a split screen during the exploratory session is the minimum viable note-taking setup. Some engineers use voice transcription (dictating observations aloud) which they then post-process with AI.


How to balance exploration time with sprint constraints using AI prioritization?

Exploratory testing has an open-ended nature that conflicts with sprint time-boxing. Without explicit time allocation and charter prioritization, exploratory sessions either consume too much time (exploring low-risk areas in depth) or too little (a rushed 30-minute session that misses critical paths). The agentic loop uses AI-driven prioritization to time-box exploration precisely.

The Sprint Capacity Calculation

Before generating charters, establish how much time is available for exploration:

Sprint QA capacity calculation:
- Total QA engineer time available: [N hours]
- Time already committed to manual test execution: [X hours]
- Time committed to CI monitoring and triage: [Y hours]
- Time available for exploratory testing: N - X - Y hours

Maximum charter count = available hours / 1 hour per charter
(Use 1 hour per charter as a conservative baseline; adjust after first sprint)

Feed this capacity into the charter generation prompt:

claude --print "
[standard charter generation context]

SPRINT CAPACITY CONSTRAINT:
Available exploratory testing time: 3 hours (180 minutes)
Maximum charters: 3 (one per hour, including note capture)

Given this constraint, generate EXACTLY 3 charters.
Prioritize ruthlessly based on:
1. Risk level from the code diff (highest change volume = highest risk)
2. Gap in scripted test coverage (AC with no automated test = higher priority)
3. Business impact of failure (payment flow > informational display)

For each charter, justify why it was selected over other possible charters.
Include one LOW-risk charter if time permits, labeled as 'Optional — run if time allows.'
"

Dynamic Re-Prioritization Mid-Sprint

If new information arrives mid-sprint (a developer mentions an implementation shortcut, a related bug is found in production), re-run the charter prioritization with the new context:

cat > /tmp/new-context.md << 'EOF'
## Mid-Sprint Update
Developer disclosed that the address validation API has a known timeout issue under load.
The API sometimes returns HTTP 408 after 3 seconds with no error body.
This case is not handled in the current implementation.
EOF

claude --print "
Re-prioritize the exploratory charters given new information.

CURRENT CHARTER SET:
---
$(cat qa-artifacts/exploratory-charters-20240115.md)
---

NEW INFORMATION:
---
$(cat /tmp/new-context.md)
---

1. Does this new information change the risk level of any existing charter?
2. Should a new charter be added to cover the timeout scenario?
3. If a new charter is added and time is fixed at 3 hours, which existing charter should be deprioritized?

Output: updated prioritized charter list with reasoning.
"

Using Time-Boxing Signals During Sessions

The most effective time management technique during an exploratory session is the oracle-based stop condition from each charter. When you have executed all listed exploration paths or hit the stop condition (found a reproducible bug or exhausted the path space), the session is complete — even if the time-box has not expired. Stopping early is correct behavior.

When the stop condition is not met at time-box expiry:
- Document which exploration paths were not executed
- Flag the incomplete charter for follow-up in the next sprint
- Do not extend the session beyond the time-box in the current sprint

## Charter Completion Record
**Charter**: EC-001
**Time-boxed**: 60 minutes
**Time used**: 47 minutes
**Status**: Complete (stop condition met — found reproducible bug OBS-001)
**Paths completed**: 5/5
**Follow-up needed**: None

**Charter**: EC-002
**Time-boxed**: 60 minutes
**Time used**: 60 minutes (full time-box exhausted)
**Status**: Incomplete — 3/5 paths explored
**Paths NOT completed**: Path 4 (concurrent users), Path 5 (session expiry during validation)
**Follow-up needed**: Create follow-up ticket for Path 4 and 5 in next sprint

Learning Tip: After three sprints of using the agentic exploratory loop, analyze the charter data: how many bugs did each charter find? What was the average session time? Which charter categories (race conditions, error recovery, mobile-specific) had the highest bug yield? Use this data to tune your charter generation prompt — prioritizing the categories that historically yield the most bugs on your specific application. Exploratory testing ROI varies significantly by application type, and your historical data is more accurate than any heuristic.