Agentic Test Planning | Autonomous QA

How to generate a complete test plan by combining spec and code-change context?

Agentic test planning is the first high-value stage of the agentic QA loop, and it is also where the quality of your context engineering determines the quality of everything that follows. A weak test plan at this stage produces weak test cases, incomplete automation, and blind exploratory sessions. A strong test plan cascades precision through every subsequent stage.

The foundation of strong agentic test planning is the dual-context model described in Module 2: feeding both the spec context (what the feature is supposed to do) and the code-change context (what was actually implemented) into a single planning prompt. This combination lets the agent cross-reference intent against implementation before a single test is written.

Preparing the Dual-Context Input

Before running the planning prompt, gather and structure your two context sources:

cat > /tmp/feature-spec.md << 'EOF'
## Feature: Checkout Address Validation

### User Story
As a customer, I want my delivery address validated in real-time so that I don't complete checkout with an undeliverable address.

### Acceptance Criteria
- AC1: Address fields validate on blur (when user leaves each field)
- AC2: Invalid zip code shows inline error: "Enter a valid ZIP code"
- AC3: Invalid city/state combination shows: "City does not match state"
- AC4: PO Box addresses show warning: "PO Boxes not accepted for same-day delivery"
- AC5: Valid address shows green checkmark confirmation
- AC6: Form cannot be submitted if any address field has a validation error
- AC7: Validation must complete within 500ms
- AC8: Works on mobile viewports (320px–428px width)
EOF

git diff origin/main...HEAD > /tmp/feature-diff.txt

wc -l /tmp/feature-spec.md /tmp/feature-diff.txt

The Complete Agentic Test Planning Prompt

claude --print "
You are a senior QA engineer generating a complete test plan for a feature.

## SPEC CONTEXT
---
$(cat /tmp/feature-spec.md)
---

## CODE CHANGE CONTEXT
---
$(cat /tmp/feature-diff.txt)
---

## CODEBASE CONTEXT
Read the following files to understand the testing framework and conventions:
- CLAUDE.md (for test stack and conventions)
- tests/e2e/ (for existing test patterns)
- src/components/checkout/ (for component structure)

## YOUR TASK
Generate a complete, prioritized test plan with the following sections:

### 1. Feature Summary (QA perspective)
Describe the feature's testable scope in 2-3 sentences.

### 2. Spec-Implementation Alignment Check
List any discrepancies between the spec and the code change:
- Spec requirements not implemented in the diff
- Code changes not reflected in the spec
- Ambiguous AC that could be interpreted multiple ways

### 3. Risk Assessment
For each risk, classify as HIGH / MEDIUM / LOW and explain why.
Focus on: data loss, security, payment flow impact, regression risk on adjacent features.

### 4. Test Scope
#### In Scope
[explicit list]
#### Out of Scope
[explicit list with rationale]
#### Dependencies / Blockers
[anything that must be ready before testing can complete]

### 5. Test Scenarios (Risk-Ordered)
For each scenario, include:
- Scenario ID: TS-{N}
- Title
- Type: E2E / API / Manual / Exploratory
- Risk level: HIGH / MEDIUM / LOW
- AC reference (e.g., AC3)
- Description

### 6. Coverage Goals
Define pass criteria for each test level:
- E2E automation: target % of HIGH and MEDIUM scenarios
- API: which endpoints need contract coverage
- Manual: which scenarios require human judgment
- Exploratory: which risk areas to charter

### 7. Entry and Exit Criteria
- Entry: what must be true before QA begins
- Exit: what must be true to approve the feature for merge

Output the complete test plan. Be specific and project-aware.
" > qa-artifacts/test-plan-checkout-validation-$(date +%Y%m%d).md

This prompt produces a structured, persisted test plan that you can review, approve, and feed into the next stages of the loop. The key is the three-part context: spec, diff, and live codebase exploration. Without all three, the plan will have gaps.

Example Output: Risk Assessment Section

A well-executed version of this prompt produces a risk assessment like this (partial example):

### 3. Risk Assessment

**HIGH — Payment form submission bypass (TS-003)**
The diff modifies the form validation guard on line 247 of CheckoutForm.tsx.
If validation state is not correctly propagated to the submit handler,
a user could submit with an invalid address. This directly impacts order
fulfillment. Existing test `checkout.spec.ts#L89` does not cover the
async validation path.

**HIGH — Mobile viewport regression (TS-011)**
The new inline error components use absolute positioning. The diff shows
no responsive breakpoint overrides. On 320px viewports, error messages
may overflow and be clipped. This is a regression risk vs. current behavior.

**MEDIUM — 500ms performance target (TS-008)**
AC7 specifies 500ms. The diff introduces a third-party address validation
API call with no caching layer. Under load or with slow network conditions,
this could exceed the threshold. Performance test needed.

**LOW — PO Box warning on existing addresses (TS-015)**
Edge case: users with saved addresses that are PO Boxes. The warning
logic runs on blur but the save flow runs asynchronously. Low likelihood
of customer impact but worth one manual test case.

This level of specificity — referencing exact line numbers, existing test IDs, and real AC numbers — is what makes the plan immediately actionable rather than generic.

Learning Tip: Run the test planning prompt twice with slightly different phrasings and compare the outputs. The sections that are consistent across both runs are solid. The sections that differ significantly are areas of genuine ambiguity in your spec or code change — and those are exactly the areas that need a human QA engineer to clarify before generation begins. Use the divergent sections as a list of clarifying questions to bring to your product owner.

How does AI generate risk prioritization, scope, and coverage goals in one pass?

Understanding how the AI produces risk prioritization helps you evaluate and calibrate the output rather than accepting it blindly. The agent applies a multi-factor reasoning process when it assesses risk, and knowing this process lets you prompt it more effectively and catch cases where it is reasoning incorrectly.

How AI Derives Risk Priority

The agent cross-references four signals to assign risk levels:

Signal 1 — Change Impact Radius: The agent reads the diff and identifies which modules, functions, and components were modified. Changes to shared utilities, payment processors, authentication handlers, and data persistence layers automatically carry higher risk than changes to UI rendering components or static copy.

Signal 2 — Acceptance Criteria Complexity: AC items with multiple conditions, timing requirements, or cross-device behavior carry higher risk than single-condition ACs. "Validation must complete within 500ms" (AC7) is higher risk than "Invalid zip code shows error" (AC2) because it has a performance dimension that is hard to verify manually.

Signal 3 — Existing Coverage Gap: The agent looks at existing test files and checks whether the changed code paths are currently covered. Uncovered code paths that are being modified carry much higher risk than modified paths with solid existing coverage.

Signal 4 — Domain Context from CLAUDE.md: If your CLAUDE.md includes domain model information (payment states, user roles, session handling), the agent uses it to flag scenarios where domain-critical state transitions are involved. This is why a well-maintained CLAUDE.md dramatically improves risk assessment quality.

Prompting for Explicit Risk Reasoning

To get traceable risk reasoning (not just labels), add this to your planning prompt:

For each HIGH and MEDIUM risk scenario, provide:
- Which signal drove the high-risk classification (change impact / AC complexity / coverage gap / domain criticality)
- The specific code location or AC that triggers the risk
- The test scenario that most directly addresses this risk

This forces the agent to show its work, which lets you verify the reasoning and catch cases where it has mislabeled a risk.

How AI Defines Scope

Scope definition in an AI-generated test plan has two failure modes: over-scoping (planning to test everything including stable, unrelated features) and under-scoping (planning to test only the literal changed lines, missing regression risk in adjacent code).

To correct for over-scoping:

Limit the test scope to:
1. Features directly modified in the code change
2. Features that call or depend on the modified components
3. Data flows that pass through the modified code paths

Explicitly exclude: features in other modules with no dependency on the changed files.

To correct for under-scoping:

Expand the scope analysis to include:
1. All callers of the modified functions (grep the codebase for usage)
2. Integration points where the modified component connects to other modules
3. User journeys that include the modified feature as a sub-step

How AI Sets Coverage Goals

Coverage goals in an AI-generated plan are not arbitrary percentages. They are derived from the risk assessment:

Risk Level	Expected Coverage
HIGH	100% scenario coverage, automated where possible
MEDIUM	80%+ scenario coverage, automate critical path
LOW	Best effort, cover in exploratory, skip automation unless trivial
Out of scope	Exclude entirely

Add this reasoning explicitly to your planning prompt:

Set coverage goals using this formula:
- HIGH risk scenarios: must have automated test coverage before merge
- MEDIUM risk scenarios: manual test case required; automate if simple
- LOW risk scenarios: include in exploratory charter; no automation required
- Generate the total estimated QA effort in hours based on this breakdown

Learning Tip: When you review an AI-generated test plan, the first thing to check is whether the HIGH-risk items match your team's intuition about the most dangerous parts of this change. If the agent assigned MEDIUM to something your tech lead would consider HIGH, that is a signal that your CLAUDE.md is missing domain context — specifically around which parts of the system are most business-critical. Add that context immediately and re-run the plan.

How to review and approve an AI-generated test plan before execution?

An AI-generated test plan is a draft, not a deliverable. The approval process is a structured review that catches the gaps, misclassifications, and missing context that the agent cannot resolve on its own. This review should take 15–30 minutes for a typical sprint feature, not an hour.

The Five-Point Review Checklist

1. Spec Coverage Check
Every acceptance criterion should map to at least one test scenario. Open the test plan alongside the spec and run through each AC:

Review prompt (run after initial plan generation):
"Cross-reference the test plan you just generated against these acceptance criteria:
[paste ACs]

For each AC, confirm: is there at least one test scenario in the plan that directly
validates this criterion? List any ACs with no corresponding scenario."

If any AC is uncovered, add a scenario before approving the plan.

2. Risk Sanity Check
Read each HIGH-risk scenario and ask: "If this scenario fails in production, what is the business impact?" If the answer is "minor inconvenience," the risk classification is wrong. If the answer is "revenue loss, data corruption, or security breach," the classification is right.

3. Test Level Appropriateness
For each scenario classified as E2E, ask: "Could this be tested at the API level instead?" E2E tests are expensive to maintain. If the scenario validates a business rule that lives in the API layer, an API test is preferable. The agent tends to over-classify scenarios as E2E because it cannot reason about maintenance cost.

Prompt to recalibrate test levels:

Review the test level assignments in this plan. For each scenario classified as E2E,
evaluate whether the same coverage could be achieved with an API test or unit test.
Recommend downgrades where appropriate, with reasoning.

4. Entry and Exit Criteria Realism
The agent's entry criteria often includes items that are genuinely not available at QA start time (e.g., "all API endpoints implemented"). Review each entry criterion and flag anything that would block you immediately. If the entry criteria cannot be met, the test plan needs a phased execution strategy.

5. Effort Estimation Calibration
Compare the estimated effort to your team's velocity. If the plan says "12 hours of QA effort" for a two-day sprint, something is wrong — either the scope is too large, or the risk classifications have over-categorized LOW items as HIGH. Use this as a forcing function to cut scope or negotiate sprint capacity.

The Approval Gate Workflow

1. Generate plan → claude generates test-plan.md
2. Run spec coverage check → fix any uncovered ACs
3. Run risk sanity check → adjust any mis-classified risks
4. Run test level review → downgrade over-classified E2E scenarios
5. Review entry/exit criteria → flag blockers
6. Check estimated effort → cut scope if over capacity
7. Commit approved plan → git add qa-artifacts/test-plan-*.md && git commit -m "Add approved QA test plan: [feature]"
8. Share with team → post link in sprint Slack channel or Jira ticket

Committing the approved plan to the repository is important. It creates a versioned artifact that future iterations of the loop can reference, and it makes the plan reviewable by developers and product owners.

Learning Tip: Run the plan review with a developer, not just with the QA team. Developers often immediately spot when the agent has misunderstood the change radius — because they wrote the code, they know whether a function change affects ten callers or one. A 10-minute plan review with the feature developer before execution regularly catches scope errors that would otherwise surface as missing coverage in the final report.

How to adapt an AI-generated test plan as the feature evolves mid-sprint?

Features change mid-sprint. Acceptance criteria get revised. Developers discover implementation constraints. Product owners add scope. When this happens, an AI-generated test plan that is not updated becomes a liability — teams test against the old spec while the feature drifts.

The agentic approach to mid-sprint adaptation is to treat plan updates as a regeneration task, not a manual editing task.

Detecting That the Plan Needs an Update

Three signals indicate the test plan is out of date:

Signal 1 — Spec delta: The product owner updates an acceptance criterion or adds a new AC. Check your test management tool and Jira tickets daily during active sprints.

Signal 2 — Code change delta: A new commit lands on the feature branch after the plan was generated. Check:

git log --oneline origin/main..HEAD --after="$(date -d '2 days ago' +%Y-%m-%d)"

If there are new commits after the plan date, the plan may be stale.

Signal 3 — Failed test revealing scope gap: A test fails for a reason not anticipated in the plan. This is a retroactive signal that the plan's scope or risk assessment missed something.

Updating the Plan: Delta Regeneration

Do not re-run the full planning prompt on the full spec. Instead, run a delta prompt that takes the existing plan and the change as input:

cat > /tmp/spec-delta.md << 'EOF'
## Spec Change (2024-01-15)
AC4 has been updated:
- OLD: "PO Box addresses show warning: 'PO Boxes not accepted for same-day delivery'"
- NEW: "PO Box addresses BLOCK checkout with error: 'PO Boxes are not accepted for any delivery method'"

AC9 has been added (new):
- "Address validation must work with international addresses (CA, UK, AU)"
EOF

claude --print "
You are a QA engineer updating an existing test plan based on a spec change.

EXISTING TEST PLAN:
---
$(cat qa-artifacts/test-plan-checkout-validation-20240113.md)
---

SPEC DELTA (changes since the plan was written):
---
$(cat /tmp/spec-delta.md)
---

Your task:
1. Identify which existing test scenarios need to be updated due to the spec change
2. Identify which existing scenarios are now invalid and should be removed
3. Identify new scenarios that need to be added
4. Update the risk assessment if any new HIGH risks are introduced
5. Update the coverage goals and effort estimate accordingly

Output: A delta change log followed by the complete updated test plan.
" > qa-artifacts/test-plan-checkout-validation-$(date +%Y%m%d)-updated.md

Managing Plan Versions

Commit each version of the test plan as a distinct file with a date suffix:

qa-artifacts/
├── test-plan-checkout-validation-20240113.md    # Original
├── test-plan-checkout-validation-20240115-updated.md  # After spec change
└── test-plan-checkout-validation-20240117-final.md    # Pre-merge final

This versioning approach lets you diff between versions to see exactly what changed in scope, and it provides an audit trail if questions arise post-release about what was and was not tested.

Mid-Sprint Triage for Late Scope Additions

When scope is added late in a sprint (a common reality), the test plan update must also include a triage decision: can the new scope be tested within the sprint, or does it need to be deferred?

Triage prompt for late scope additions:

Given this new scope addition: [describe new AC or feature change]
And given the current test execution status: [describe what has been run so far]
And the remaining sprint capacity: [N hours remaining]

Recommend one of:
A) Include in this sprint: add scenarios, estimate effort, update plan
B) Partial coverage: automate the critical path, defer edge cases to next sprint
C) Defer: add as tech debt, create a follow-up ticket, document the gap

Provide reasoning for your recommendation.

Learning Tip: Set up a Slack bot or simple GitHub Action that notifies the QA channel whenever the spec document or Jira ticket is updated. This makes spec drift visible in real time rather than discoverable only at the retrospective. Pair this with a standing practice of re-running the plan delta prompt within one day of any spec change. The 20-minute investment in keeping the test plan current saves two hours of re-work when you discover a coverage gap in the final coverage report.