Run the Full Agentic QA Workflow | Hands-on

How to set up the agentic workflow — spec, feature branch diff, and codebase context?

This topic is a complete end-to-end walkthrough of the agentic QA workflow applied to a realistic feature. The feature is a password reset flow — a self-contained, moderately complex feature that exercises all layers of the agentic loop: business logic validation, UI interaction, API behavior, security edge cases, and state management. Work through each step in order, substituting your actual project's feature where instructed.

The Feature: Password Reset with Email Verification

Scenario: A user who has forgotten their password can request a reset link via email, click the link, enter a new password, and regain access to their account. The implementation includes a backend token-based flow and a frontend form.

Acceptance Criteria (save this as qa-artifacts/session-spec.md):

## Feature: Password Reset with Email Verification

### User Story
As a user who has forgotten my password, I want to reset it via email so that I can regain access to my account without contacting support.

### Acceptance Criteria
- AC1: Reset request form accepts email address and shows confirmation: "If an account exists, we'll send a reset link"
- AC2: Reset email is sent within 60 seconds of request
- AC3: Reset link expires after 24 hours
- AC4: Reset link is single-use — clicking it a second time shows: "This link has already been used or expired"
- AC5: New password must be at least 8 characters and contain one number
- AC6: Password and confirm-password fields must match before submission is enabled
- AC7: On successful reset, user is automatically logged in and redirected to dashboard
- AC8: User receives a confirmation email after successful password change
- AC9: After 5 failed password validation attempts, the form locks for 30 minutes
- AC10: Existing active sessions are invalidated after password change

Step 1: Verify Environment Ready

claude --version || { echo "Claude Code not installed"; exit 1; }
echo "ANTHROPIC_API_KEY: ${ANTHROPIC_API_KEY:0:10}..."

ls CLAUDE.md qa-context/ tests/ src/ || echo "Warning: Expected directories not found"

git branch --show-current
git diff origin/main...HEAD --stat

Expected output: feature branch name, list of changed files with sizes. If the diff is empty, you need code changes to work with — create a sample branch or use any existing feature PR.

Step 2: Capture All Context Artifacts

SESSION_DATE=$(date +%Y%m%d)
FEATURE_SLUG="password-reset"
mkdir -p qa-artifacts


git diff origin/main...HEAD > qa-artifacts/session-diff-${FEATURE_SLUG}.txt
echo "Diff size: $(wc -l < qa-artifacts/session-diff-${FEATURE_SLUG}.txt) lines"

ls qa-context/
cat CLAUDE.md | head -30  # Preview to confirm it's populated

ls tests/e2e/ tests/api/ tests/fixtures/

echo "Context setup complete. Proceeding to test planning."

Step 3: Validate the Dual-Context Setup

Before running the full loop, validate that the combined context is coherent:

claude --print "
Read the following and confirm your understanding:

SPEC:
$(cat qa-artifacts/session-spec.md)

CODE CHANGE:
$(cat qa-artifacts/session-diff-${FEATURE_SLUG}.txt | head -200)

CODEBASE CONTEXT:
$(cat CLAUDE.md)

Summarize in 3 bullet points what you understand about:
1. What this feature does
2. What the most significant code changes are
3. What the project's test conventions require

This is a context validation check — do not generate tests or analysis yet.
"

If the validation output correctly describes the feature, the code change, and the test conventions, your context setup is correct. If any section is wrong or generic, fix the relevant context file before proceeding.

Learning Tip: The context validation step saves time. Running a full test planning pass on bad context produces a bad test plan that wastes review time. A 30-second context validation catches the most common issues — wrong framework referenced, domain entities confused, diff misread — before they propagate through the entire workflow.

How to execute each step of the agentic QA loop with AI assistance?

With context validated, execute each stage of the loop sequentially. Each step produces a persisted artifact that feeds into the next step.

Stage 1: Generate the Test Plan

claude --print "
$(cat CLAUDE.md)
$(cat qa-context/*.md 2>/dev/null)

---

You are a senior QA engineer generating a complete test plan.

SPEC:
$(cat qa-artifacts/session-spec.md)

CODE CHANGE:
$(cat qa-artifacts/session-diff-${FEATURE_SLUG}.txt)

Generate a complete test plan following the structure:

## Test Plan: Password Reset with Email Verification
**Date**: $(date +%Y-%m-%d)
**Feature branch**: $(git branch --show-current)
**Risk level**: [overall feature risk assessment]

### Spec-Implementation Alignment Check
[misalignments found, or 'Alignment confirmed']

### Risk Assessment
[HIGH / MEDIUM / LOW for each identified risk with rationale]

### Test Scope
#### In Scope
#### Out of Scope
#### Dependencies

### Test Scenarios (Risk-Ordered)
[ID, title, type, risk level, AC reference for each scenario]

### Coverage Goals
[pass criteria per test level]

### Entry and Exit Criteria
[entry: what must be true before QA; exit: what must be true to ship]

### Automation Decision
[for each scenario: E2E / API / Manual / Exploratory with rationale]

### Estimated Effort
[hours breakdown by test type]
" > qa-artifacts/test-plan-${FEATURE_SLUG}-${SESSION_DATE}.md

echo "Test plan generated. Lines: $(wc -l < qa-artifacts/test-plan-${FEATURE_SLUG}-${SESSION_DATE}.md)"
echo "Scenarios generated: $(grep -c 'TS-' qa-artifacts/test-plan-${FEATURE_SLUG}-${SESSION_DATE}.md || echo 'n/a')"

Review gate: Read the test plan. Verify the alignment check, review the 3 highest-risk scenarios, confirm the automation decisions make sense. Make edits directly in the file. This review should take 10–15 minutes.

Stage 2: Generate Manual Test Cases

claude --print "
$(cat CLAUDE.md)

---

You are generating manual test cases from an approved test plan.

TEST PLAN:
$(cat qa-artifacts/test-plan-${FEATURE_SLUG}-${SESSION_DATE}.md)

SPEC:
$(cat qa-artifacts/session-spec.md)

Generate manual test cases for all scenarios where Automation Decision = Manual.
Follow the test case format from CLAUDE.md.
Use specific test data values — no generic placeholders.
Assign each case an ID in format: TC-TS-{N}
" > qa-artifacts/manual-tests-${FEATURE_SLUG}-${SESSION_DATE}.md

echo "Manual test cases generated: $(grep -c '^### TC-' qa-artifacts/manual-tests-${FEATURE_SLUG}-${SESSION_DATE}.md)"

Review gate: Verify AC coverage — every AC should have at least one manual test case. Check for generic test data (replace with real values from your test environment).

Stage 3: Generate Automated Tests

claude --print "
$(cat CLAUDE.md)

---

You are generating Playwright E2E tests. Read the existing tests in tests/e2e/ and
follow the exact same patterns, imports, and Page Object usage.

TEST PLAN (E2E scenarios only):
$(cat qa-artifacts/test-plan-${FEATURE_SLUG}-${SESSION_DATE}.md)

SPEC:
$(cat qa-artifacts/session-spec.md)

CODE CHANGE:
$(cat qa-artifacts/session-diff-${FEATURE_SLUG}.txt)

EXISTING TEST PATTERNS (read these):
$(cat tests/e2e/*.spec.ts | head -100)

AVAILABLE PAGE OBJECTS:
$(ls tests/e2e/pages/)

Generate the complete test file for: tests/e2e/password-reset.spec.ts
Include all E2E scenarios from the test plan.
Every test must be immediately runnable without modification.
" > tests/e2e/password-reset.spec.ts

claude --print "
$(cat CLAUDE.md)

---

Generate Jest + Supertest API tests for the password reset flow.

TEST PLAN (API scenarios only):
$(cat qa-artifacts/test-plan-${FEATURE_SLUG}-${SESSION_DATE}.md)

EXISTING API TEST PATTERNS:
$(cat tests/api/*.test.ts | head -100)

Generate the complete test file for: tests/api/password-reset.test.ts
" > tests/api/password-reset.test.ts

echo "E2E tests generated: $(grep -c 'test(' tests/e2e/password-reset.spec.ts)"
echo "API tests generated: $(grep -c 'test(' tests/api/password-reset.test.ts)"

Run immediately after generation:

npx playwright test tests/e2e/password-reset.spec.ts --reporter=line 2>&1 | tail -20

npx jest tests/api/password-reset.test.ts --verbose 2>&1 | tail -20

Review gate: If more than 2 tests fail due to structural errors (imports, undefined methods), the codebase context was insufficient. Add the missing context (the specific Page Object, the specific factory function) and re-generate. For application failures (the feature is broken), proceed to Stage 4 triage.

Stage 4: Triage Failures

npx playwright test tests/e2e/password-reset.spec.ts --reporter=json 2>/dev/null > /tmp/e2e-results.json || true
npx jest tests/api/password-reset.test.ts --json --outputFile=/tmp/api-results.json || true

claude --print "
$(cat CLAUDE.md)
$(cat qa-context/known-issues.md 2>/dev/null)

---

Classify each test failure.

E2E RESULTS:
$(cat /tmp/e2e-results.json | python3 -c 'import json,sys; d=json.load(sys.stdin); [print(f\"{s[\"title\"]}: {t[\"status\"]}\nError: {t.get(\"errors\",[{}])[0].get(\"message\",\"\")} \") for s in d.get(\"suites\",[]) for spec in s.get(\"specs\",[]) for t in spec.get(\"tests\",[]) if t.get(\"status\") != \"passed\"]' 2>/dev/null || cat /tmp/e2e-results.json | head -100)

API RESULTS:
$(cat /tmp/api-results.json | python3 -c 'import json,sys; d=json.load(sys.stdin); [print(f\"{t[\"fullName\"]}: {\"FAIL\" if not t[\"status\"]==\"passed\" else \"PASS\"}\n{t.get(\"failureMessages\",[\"\"])[0][:200]}\") for t in d.get(\"testResults\",[{}])[0].get(\"assertionResults\",[]) if t.get(\"status\")!=\"passed\"]' 2>/dev/null || cat /tmp/api-results.json | head -100)

RECENT CODE CHANGES:
$(git diff origin/main...HEAD -- src/ | head -300)

For each failure:
1. APPLICATION_BUG / TEST_CODE_ERROR / ENVIRONMENT_ISSUE / FLAKY
2. Specific root cause
3. Fix recommendation
4. Is this blocking merge?
" > qa-artifacts/triage-${FEATURE_SLUG}-${SESSION_DATE}.md

cat qa-artifacts/triage-${FEATURE_SLUG}-${SESSION_DATE}.md

Stage 5: Exploratory Testing

claude --print "
$(cat CLAUDE.md)

---

Generate exploratory test charters for the password reset feature.

CODE CHANGE (primary risk signal):
$(cat qa-artifacts/session-diff-${FEATURE_SLUG}.txt)

TEST RESULTS (what scripted tests found):
$(cat qa-artifacts/triage-${FEATURE_SLUG}-${SESSION_DATE}.md)

TEST PLAN (what's been scripted):
$(cat qa-artifacts/test-plan-${FEATURE_SLUG}-${SESSION_DATE}.md)

Generate 3-4 risk-based exploratory charters.
Focus on: state management, security edge cases, email flow behavior,
and adjacent features affected by session invalidation.

Available exploration time: 2 hours total.
" > qa-artifacts/exploratory-charters-${FEATURE_SLUG}-${SESSION_DATE}.md

echo "Charters generated. Review before executing:"
cat qa-artifacts/exploratory-charters-${FEATURE_SLUG}-${SESSION_DATE}.md

Execute the charters (this is the human step — no automation here):
- Open the application in your browser
- Execute each charter using the generated scope and oracles
- Capture session notes in qa-artifacts/session-notes-{charter-id}-{date}.md

After each session, synthesize findings:

claude --print "
Convert these exploratory session notes to bug reports and test plan updates.

SESSION NOTES:
$(cat qa-artifacts/session-notes-EC-001-${SESSION_DATE}.md)

TEST PLAN:
$(cat qa-artifacts/test-plan-${FEATURE_SLUG}-${SESSION_DATE}.md)

SPEC:
$(cat qa-artifacts/session-spec.md)

[Use the synthesis prompt format from Topic 5]
" >> qa-artifacts/triage-${FEATURE_SLUG}-${SESSION_DATE}.md

Stage 6: Generate the Coverage Report

claude --print "
$(cat CLAUDE.md)

---

Generate the final coverage report for this QA cycle.

TEST PLAN:
$(cat qa-artifacts/test-plan-${FEATURE_SLUG}-${SESSION_DATE}.md)

TRIAGE REPORT (automated + exploratory failures):
$(cat qa-artifacts/triage-${FEATURE_SLUG}-${SESSION_DATE}.md)

MANUAL TEST EXECUTION STATUS:
[List which manual tests were executed and their results — enter manually]

SPEC:
$(cat qa-artifacts/session-spec.md)

Generate the coverage report with:
1. AC coverage matrix (which ACs are covered, tested, and validated)
2. Test execution summary (automated and manual)
3. Bug status (blocking vs. non-blocking open issues)
4. Go / No-Go recommendation with explicit reasoning
5. Accepted risks (open issues you are knowingly shipping with)
6. Next steps (follow-up tickets, regression additions)
" > qa-artifacts/coverage-report-${FEATURE_SLUG}-${SESSION_DATE}.md

echo "Coverage report generated:"
cat qa-artifacts/coverage-report-${FEATURE_SLUG}-${SESSION_DATE}.md

Learning Tip: The first time you run this workflow end-to-end, it will take 3–4 hours for a mid-complexity feature. By the third time, the same feature complexity should take 90–120 minutes — because the persistent context improves with each cycle, the generated artifacts require less review and editing, and the workflow steps become routine. Time the workflow explicitly on your first three features to measure this improvement rate.

How to review outputs at each stage and make go/no-go decisions?

Every stage produces an output that requires a review gate before proceeding. The review gates are the human-in-the-loop checkpoints that prevent the agentic loop from running on autopilot in ways that erode quality and trust.

Review Gate 1: Test Plan Approval

What to review: The test plan document.

Time budget: 15 minutes.

Review checklist:

□ Every AC has at least one corresponding test scenario
□ Risk classifications match your intuition about the feature
□ HIGH-risk scenarios are designated for automation or explicit manual coverage
□ Spec-implementation alignment gaps (if any) are understood and addressed
□ Entry criteria can actually be met before QA starts
□ Effort estimate is within sprint capacity

Decision: APPROVE (proceed to Stage 2), REVISE (fix specific issues and re-generate), or ESCALATE (bring spec ambiguities to product owner before proceeding).

Review Gate 2: Generated Test Quality

What to review: The generated test files (E2E and API) and manual test cases.

Time budget: 20 minutes.

Review checklist:

□ E2E tests run without import errors or TypeScript compilation errors
□ E2E tests use correct Page Object methods (no invented methods)
□ API tests target correct endpoints with correct HTTP methods
□ Manual test cases have specific test data (no "enter valid email" placeholders)
□ Test coverage maps correctly to HIGH-risk scenarios in the test plan
□ No tests are asserting generic expected results not tied to AC

Decision: APPROVE (run tests), FIX AND RE-RUN (fix structural errors in test code), or REGENERATE (context was insufficient — add missing context and re-generate).

Review Gate 3: Triage Classification

What to review: The triage report classifying all failures.

Time budget: 20 minutes.

Review checklist:

□ Each failure classification makes sense (you agree with APPLICATION_BUG vs. TEST_CODE_ERROR)
□ APPLICATION_BUG classifications have enough evidence to file a defect report
□ No APPLICATION_BUGs are being dismissed as test errors
□ Known pre-existing issues are correctly excluded
□ Severity classifications match your team's severity rubric

Decision: For each APPLICATION_BUG: BLOCK MERGE or DEFER (with explicit rationale). For TEST_CODE_ERRORs: FIX TEST. For ENVIRONMENT_ISSUEs: FIX ENVIRONMENT.

Review Gate 4: Exploratory Charter Prioritization

What to review: The generated exploratory charters before executing them.

Time budget: 10 minutes.

Review checklist:

□ Charter risk rationales make sense (the agent correctly identified risky areas)
□ Charter scope is bounded (not "explore the whole checkout flow")
□ Oracles are clear (you know what constitutes a bug vs. expected behavior)
□ Total estimated time fits your available exploration window

Decision: EXECUTE AS-IS, MODIFY (adjust scope or oracles), or SKIP (if test results from Stage 3 were clean and the risk is genuinely low).

Review Gate 5: Coverage Report and Ship Decision

What to review: The final coverage report.

Time budget: 15 minutes.

Review checklist:

□ All HIGH-risk ACs show test coverage
□ All blocking bugs have been fixed or explicitly deferred with tickets
□ Coverage report accurately reflects the test execution that occurred
□ Go/No-Go recommendation matches your team's quality bar
□ Accepted risks are documented and acknowledged by product owner

Decision: GO (approve merge), NO-GO (block merge, list specific blocking issues), or CONDITIONAL GO (approve merge with specific conditions and follow-up tickets).

Document the decision in a PR comment using the template from Topic 6.

Learning Tip: Make the go/no-go decision explicit and document the reasoning every time — even when it is obviously GO. The documentation accumulates into a decision history that reveals patterns over time: which feature types consistently produce NO-GO decisions, which QA engineers consistently approve too aggressively, which test types consistently catch the bugs that would otherwise reach production. This history is invaluable for continuous process improvement and for building evidence-based arguments for QA investment.

What does the agentic workflow catch automatically vs. what still needs human judgment?

After running the full agentic QA workflow, it is important to have a clear-eyed understanding of what the system catches reliably without human intervention and what it cannot — or should not — handle autonomously. This is not a limitation to apologize for; it is the architecture of a trustworthy system.

What the Agentic Workflow Catches Automatically

Regression failures in existing code: When a code change breaks behavior that was working before, automated tests catch it immediately. The agentic loop runs these tests on every PR without human intervention and surfaces the failure with a root cause hypothesis. This is the highest-value autonomous capability.

Coverage gaps relative to the test plan: The coverage report automatically identifies which test plan scenarios were executed and which were not. If a developer shipped code for a scenario that was planned but not yet tested, the coverage report surfaces this gap automatically.

Spec-implementation alignment gaps: The dual-context planning prompt reliably detects when implemented code does not match the spec — scenarios specified but not implemented, or code changes with no corresponding spec requirement. This is valuable beyond testing: it surfaces product and development communication breakdowns.

Structural test code errors: When generated tests have wrong imports, missing methods, or incorrect assertions, the CI run catches this immediately. The AI triage layer distinguishes these from application bugs so they are not confused.

Known pre-existing issues: The known-issues.md persistent context prevents the system from re-reporting the same bugs in every CI run. This noise reduction is significant in older codebases with accumulated known issues.

Flaky test patterns across multiple runs: The multi-run comparison detects non-deterministic failures that single-run analysis misses. Flakiness is a structural quality issue that is easy to miss manually and reliably surfaced by the loop.

What Still Requires Human Judgment

The go/no-go ship decision: The coverage report makes a recommendation, but the final decision to approve a merge requires human accountability. The QA engineer has context the AI does not: upcoming release deadlines, the business impact of delays, the team's risk tolerance for specific types of accepted risk. This decision should never be delegated to the AI.

Ambiguous AC interpretation: When an acceptance criterion is ambiguous ("address validation must be fast"), only a human can determine whether the implemented behavior satisfies the intent. The AI will generate a test for what it infers, but if the inference is wrong, the test validates the wrong behavior.

Accessibility and usability evaluation: Automated tests verify whether elements are present and interactive. Whether those interactions meet accessibility standards for users with disabilities, or whether the UX is genuinely intuitive, requires human evaluation with real users or accessibility tools that go beyond what the agentic loop provides.

Security-critical code review: The agentic loop identifies security risk areas based on the diff and generates test scenarios for known vulnerability patterns. But a proper security review of authentication and payment code requires security-specialized human judgment, not just automated test coverage.

Novel failure patterns: The agentic loop is pattern-matching against known failure categories. A genuinely novel failure mode — one the system has never seen before — may be misclassified or missed entirely. Experienced QA engineers recognize novel failure patterns that don't fit expected categories. This is the unique value of the exploratory testing stage, but the exploration itself requires human intelligence to execute.

Business logic correctness at the domain level: A test can verify that an API returns status 200 and a JSON body. It cannot verify that the business logic is correct without knowing the business rules — rules that may live in a product manager's head, not in any spec document. Domain expert validation of complex business logic remains a human responsibility.

The judgment call on edge cases: When a test fails on a rare edge case that the team has already accepted as a known limitation, a human QA engineer decides whether to file a bug, add a disclaimer to the docs, or close without action. This triage of edge case priority is a judgment call that depends on knowing the product strategy, user impact estimates, and engineering cost — context that no persistent context file fully captures.

The Operating Model in Summary

┌──────────────────────────────────────────────────────────────────┐
│  AUTOMATED (no human required)                                   │
│  ─────────────────────────────                                   │
│  • Run regression tests on every PR                             │
│  • Classify failures by category                                │
│  • Generate test plans, test cases, scripts                     │
│  • Detect coverage gaps                                         │
│  • Surface spec-implementation misalignments                    │
│  • Generate risk-based exploratory charters                     │
│  • Synthesize exploratory findings into structured bug reports  │
└──────────────────────────────────────────────────────────────────┘

┌──────────────────────────────────────────────────────────────────┐
│  HUMAN JUDGMENT REQUIRED (AI assists, human decides)            │
│  ────────────────────────────────────────────────────           │
│  • Go/No-Go ship decisions                                      │
│  • Ambiguous AC interpretation                                  │
│  • Accessibility and UX quality evaluation                      │
│  • Security code review                                         │
│  • Novel failure pattern recognition                            │
│  • Business domain correctness validation                       │
│  • Edge case priority triage                                    │
│  • Persistent context maintenance and calibration               │
└──────────────────────────────────────────────────────────────────┘

This division of labor is the mature operating model of an agentic QA team. The AI handles the volume, repeatability, and synthesis tasks. The QA engineer handles the judgment, accountability, and domain-expertise tasks. Neither operates effectively without the other, and the team that understands this division operates with more speed and higher quality than either autonomous or purely manual approaches can achieve alone.

Learning Tip: After every release, spend 10 minutes asking: "What slipped through the agentic workflow that reached production?" Document the answer. After six months, you will have a clear picture of the specific failure patterns that your workflow consistently misses — and those patterns are the highest-priority targets for improving your persistent context, adding new charter categories, or investing in additional test tooling. The agentic loop is not finished when you set it up; it improves continuously as you feed experience back into its configuration.