Integrating Automated E2E & API Test Generation

How to decide what to automate vs. keep manual using the AI-generated test plan?

One of the most consequential decisions in the agentic QA workflow is the automate vs. manual decision. Get it wrong in either direction and you pay: over-automating creates a brittle, expensive-to-maintain test suite; under-automating leaves regressions undetected in future sprints. The AI-generated test plan, used correctly, makes this decision systematic rather than intuitive.

The Four-Factor Automation Decision Framework

The agentic loop uses a four-factor framework to classify each test scenario at the planning stage:

Factor 1 — Repeatability: Can this exact scenario be run identically every time without human judgment? Scenarios with deterministic inputs and verifiable outputs are strong candidates for automation. Scenarios that require visual assessment, usability judgment, or business interpretation are poor candidates.

Factor 2 — Regression Risk: Will this behavior need to be re-validated on every subsequent PR? High-frequency regression risk scenarios (core happy paths, payment flows, authentication) justify automation cost. One-time scenarios (a specific edge case that will never change) may not.

Factor 3 — Stability: Is the UI or API contract for this scenario stable? Automating against a UI that changes every sprint creates more maintenance work than it saves. API tests against versioned contracts are far more stable.

Factor 4 — Implementation Depth: Is this testing application behavior (what the system does) or user experience (how it feels to use)? Application behavior automates well. User experience requires human evaluation.

Adding the Decision Layer to the Planning Prompt

Embed this framework into the test planning prompt so the decision is made at plan generation time, not re-litigated scenario by scenario later:

claude --print "
[standard test planning prompt - see Topic 2]

### AUTOMATION DECISION RULES
For each test scenario, apply these rules to determine test type:

Automate as E2E if ALL are true:
- Has clear, deterministic pass/fail criteria
- Will need to run on every future PR touching this feature
- UI selectors are based on stable data-testid attributes (not CSS classes)
- No subjective visual or UX judgment required

Automate as API test if ANY are true:
- Testing business logic that lives at the API layer
- Testing data validation, error codes, or response schemas
- UI is unstable but API contract is versioned

Keep as Manual if ANY are true:
- Requires human judgment (visual design, UX feel, accessibility)
- One-time scenario unlikely to recur
- Would require complex test setup that exceeds the value of automation
- Mobile device-specific behavior requiring physical device

Mark as Exploratory if:
- The risk is high but the exact failure mode is unknown
- Adjacent to a change but not directly in its path
- Best discovered by human exploration rather than scripted steps
"

The Decision Matrix in Practice

The output test plan will include an explicit decision for each scenario. A well-generated decision matrix looks like this:

Scenario	Risk	Factor Scores	Decision	Rationale
TS-003: Submit blocked on invalid zip	HIGH	R:5 Reg:5 Stab:5 Dep:5	Automate E2E	Deterministic, high regression risk, stable selectors
TS-007: PO Box warning rendering	MEDIUM	R:5 Reg:3 Stab:3 Dep:2	Manual	Visual rendering judgment required
TS-009: 500ms validation timing	MEDIUM	R:5 Reg:4 Stab:5 Dep:5	Automate API	Performance SLA is measurable, API-level test preferred
TS-011: Mobile viewport overflow	HIGH	R:4 Reg:4 Stab:2 Dep:2	Manual	Requires visual assessment on real device
TS-012: Adjacent nav menu behavior	LOW	R:2 Reg:5 Stab:5 Dep:5	Automate E2E	Low risk but high regression value for quick check
TS-015: International address edge case	LOW	R:3 Reg:2 Stab:4 Dep:4	Exploratory	Edge case, best explored manually

The rationale column is critical — it makes the decision transparent and reviewable. A team that disagrees with a decision can override it with explicit reasoning rather than gut feeling.

Learning Tip: Track your automation decisions over two or three sprints and measure the stability of the decisions. If you find that 40% of your E2E-classified scenarios end up being manual because of selector instability, your stability scoring is miscalibrated. Adjust the framework: for your specific codebase, what is the actual stability rate of data-testid selectors vs. CSS class selectors? Use measured data to tune your automation decision rules.

How to trigger E2E and API test generation using combined codebase and spec context?

E2E and API test generation in the agentic loop is not a standalone task — it is triggered by the approved test plan and fed by three context sources: the spec, the code change, and the live codebase. Without all three, the generated tests will either fail to run (wrong imports, unknown utilities) or fail to test the right behavior (missing domain knowledge).

Context Preparation for Generation

Before running generation, prepare four context artifacts:

TEST_PLAN="qa-artifacts/test-plan-checkout-validation-20240113.md"

SPEC="specs/checkout-address-validation.md"

git diff origin/main...HEAD > /tmp/feature-diff.txt


cat tests/e2e/checkout.spec.ts > /tmp/existing-e2e-context.ts
cat tests/e2e/pages/CheckoutPage.ts >> /tmp/existing-e2e-context.ts
cat tests/fixtures/factories.ts >> /tmp/existing-e2e-context.ts

The codebase context (step 4) is the single biggest factor in whether the generated tests are immediately runnable. When the agent sees your actual Page Object Model, it generates tests that call your real methods rather than inventing ones.

The E2E Generation Prompt

claude --print "
You are a senior QA automation engineer generating Playwright E2E tests.

## TEST PLAN — SCENARIOS TO AUTOMATE
Extract all scenarios from the test plan below where Decision = 'Automate E2E':
---
$(cat $TEST_PLAN)
---

## SPEC CONTEXT
---
$(cat $SPEC)
---

## CODE CHANGE CONTEXT
---
$(cat /tmp/feature-diff.txt)
---

## CODEBASE CONTEXT — EXISTING TESTS AND PAGE OBJECTS
Study these files carefully. Your generated tests must follow these exact patterns:
---
$(cat /tmp/existing-e2e-context.ts)
---

## GENERATION RULES
1. Import from the same paths shown in the existing tests
2. Use the CheckoutPage Page Object — do NOT access DOM elements directly in tests
3. Use test data factories from tests/fixtures/factories.ts
4. Use data-testid selectors ONLY (no CSS classes, no XPath)
5. Every test must be independently runnable (no shared state between tests)
6. Use Playwright's built-in waitForResponse or waitFor locators — no hard sleeps
7. Each test must have: arrange (setup), act (action), assert (expectation)

## OUTPUT FORMAT
Generate complete, runnable TypeScript Playwright test file.
File should be placed at: tests/e2e/checkout-validation.spec.ts

Include:
- Proper imports
- Test suite description matching the feature
- beforeEach and afterEach hooks for setup/teardown
- All AUTO-E2E scenarios from the test plan
- Comments above each test referencing the scenario ID and AC number
" > tests/e2e/checkout-validation.spec.ts

echo "Generated tests:"
grep -c "^  test(" tests/e2e/checkout-validation.spec.ts

The API Test Generation Prompt

claude --print "
You are a QA automation engineer generating API tests using Supertest and Jest.

## TEST PLAN — SCENARIOS TO AUTOMATE
Extract all scenarios from the test plan below where Decision = 'Automate API':
---
$(cat $TEST_PLAN)
---

## API CONTEXT
Read the OpenAPI spec or infer endpoints from the code change:
---
$(cat api/openapi.yaml 2>/dev/null || cat /tmp/feature-diff.txt)
---

## EXISTING API TEST PATTERNS
---
$(cat tests/api/checkout.test.ts 2>/dev/null || echo 'No existing API tests found')
---

## GENERATION RULES
1. Use Supertest for HTTP requests, Jest for assertions
2. Test setup/teardown uses the test database (see tests/setup.ts)
3. Use factory functions for test data creation
4. Test HTTP status codes, response body schema, and error message text
5. Generate both positive path and explicit error case tests
6. Do not mock the validation service — call it via the test API instance

## OUTPUT FORMAT
Generate a complete Jest API test file for: tests/api/address-validation.test.ts
Include all API-level scenarios from the test plan.
" > tests/api/address-validation.test.ts

Validating That Generated Tests Actually Run

After generation, immediately run the tests and fix any structural errors:

npx playwright test tests/e2e/checkout-validation.spec.ts --reporter=list 2>&1 | head -100


npx jest tests/api/address-validation.test.ts --verbose 2>&1 | head -100

If more than 20% of generated tests fail to run due to structural errors (import errors, type errors, undefined methods), the codebase context provided to the agent was insufficient. The fix is to provide more context — specifically the exact file that contains the missing method or import.

Learning Tip: Run the generation prompt twice and compare the two outputs. They will often differ in how they handle test setup and teardown. The version with more explicit beforeEach/afterEach setup is generally more reliable because it makes state isolation explicit. Use the more explicit version and commit it. If both versions have the same setup pattern, the codebase context you provided was good enough that the agent converged on the correct approach.

How to run generated tests immediately and feed results back into the agentic loop?

Generated tests that sit unrun are no better than no tests. The agentic loop runs generated tests immediately after generation — in CI if the feature branch has been pushed, or locally if still in active development.

Local Immediate Validation

Before pushing to CI, run a local validation pass. This catches structural errors that would waste CI compute time:

npx playwright test tests/e2e/checkout-validation.spec.ts \
  --workers=4 \
  --reporter=line \
  2>&1 | tee /tmp/local-results.txt

PASSED=$(grep -c "passed" /tmp/local-results.txt || echo 0)
FAILED=$(grep -c "failed" /tmp/local-results.txt || echo 0)
echo "Local run: $PASSED passed, $FAILED failed"

if [ "$FAILED" -gt "0" ]; then
  claude --print "
  Review these test failures from freshly generated tests.

  TEST FAILURE OUTPUT:
  ---
  $(cat /tmp/local-results.txt)
  ---

  GENERATED TEST FILE:
  ---
  $(cat tests/e2e/checkout-validation.spec.ts)
  ---

  For each failure:
  1. Is this a test generation error (wrong selector, wrong method call, wrong assertion)?
  2. Is this a real application bug (the feature doesn't work as specified)?
  3. Is this a test environment issue (service not running, missing test data)?

  For generation errors: provide the exact fix needed in the test file.
  For application bugs: describe the bug with evidence.
  For environment issues: describe the required environment state.
  "
fi

CI Integration for Immediate Execution

After pushing the generated tests, CI picks them up automatically. Structure your CI pipeline to run new generated tests in a dedicated job:

name: Run Generated Tests

on:
  push:
    paths:
      - 'tests/e2e/**'
      - 'tests/api/**'

jobs:
  run-e2e-tests:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Setup Node and install dependencies
        uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci

      - name: Install Playwright browsers
        run: npx playwright install --with-deps chromium

      - name: Run E2E tests
        run: |
          npx playwright test \
            --reporter=json,html \
            --output=test-results/ \
            2>&1 | tee e2e-run-output.txt
        continue-on-error: true  # Don't fail job — we want to capture and analyze

      - name: Run API tests
        run: |
          npx jest tests/api/ \
            --json --outputFile=api-test-results.json \
            2>&1 | tee api-run-output.txt
        continue-on-error: true

      - name: AI results interpretation
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          npm install -g @anthropic-ai/claude-code
          claude --print "
          Analyze these CI test results and produce a QA status report.

          E2E RESULTS:
          $(cat e2e-run-output.txt | tail -100)

          API RESULTS:
          $(cat api-run-output.txt | tail -100)

          TEST PLAN:
          $(cat qa-artifacts/test-plan-*.md | tail -200)

          Report must include:
          1. Pass/fail counts and percentages
          2. For each failure: brief root cause classification (app bug / test error / env issue)
          3. Coverage delta: what scenarios from the test plan are now covered
          4. Risk assessment: are any HIGH-risk scenarios still uncovered or failing?
          5. Recommended next step: fix tests, fix app, or proceed to exploratory
          " > ci-qa-report.md
          cat ci-qa-report.md

      - name: Post results to PR
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const report = fs.readFileSync('ci-qa-report.md', 'utf8');
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body: `## CI Test Results\n\n${report}`
            });

      - name: Upload test artifacts
        uses: actions/upload-artifact@v4
        with:
          name: test-results-${{ github.run_id }}
          path: |
            test-results/
            ci-qa-report.md
            e2e-run-output.txt
            api-test-results.json

Feeding CI Results Back into the Loop

The CI results feed back into the agentic loop in two ways:

Immediate feedback: Failures trigger the bug triage pipeline (Stage 6). The CI results file becomes the input to the triage prompt, which classifies failures and generates bug reports.

Loop update: The coverage report for Stage 7 reads the CI results directly. The QA engineer does not manually populate the coverage report — the agent reads ci-qa-report.md and the raw results files to produce the consolidated coverage picture.

cp ci-qa-report.md qa-artifacts/ci-results-$(git rev-parse --short HEAD).md

git add qa-artifacts/ci-results-*.md
git commit -m "ci: add test results for $(git rev-parse --short HEAD)"

Learning Tip: When generated tests fail in CI, resist the urge to immediately fix the test code. First run the AI triage prompt to classify each failure. About 30% of failures in freshly generated tests are application bugs that the test correctly identified — if you fix the test to pass, you are suppressing a real defect. The triage step takes two minutes and potentially saves you from a production incident.

How to commit and maintain AI-generated automated tests as part of the workflow?

AI-generated tests become team assets only when they are treated with the same care as hand-written tests: reviewed, committed with clear attribution, and maintained as the application evolves. The agentic loop defines a specific commit and maintenance workflow to ensure generated tests are first-class citizens in the repository.

The Commit Convention for Generated Tests

Use a consistent commit message format that identifies AI-generated tests and their context:

git add tests/e2e/checkout-validation.spec.ts tests/api/address-validation.test.ts

git commit -m "$(cat <<'EOF'
test(checkout): add AI-generated E2E and API tests for address validation

Generated via agentic QA loop from:
- Test plan: qa-artifacts/test-plan-checkout-validation-20240113.md
- Spec: specs/checkout-address-validation.md
- Diff: origin/main...feature/checkout-address-validation

Coverage:
- E2E: TC-TS-001 through TC-TS-006 (HIGH risk scenarios)
- API: TC-TS-009 (500ms performance SLA), TC-TS-010 through TC-TS-012

Human review: Approved by [QA engineer name]
AI model: claude-opus-4-6
EOF
)"

The commit message records: what was generated, from what input, what it covers, who approved it, and which model produced it. This creates an audit trail for every test in the suite.

The CLAUDE.md Maintenance Section

Add a section to your project's CLAUDE.md that documents the generated test files and their generation context:

## AI-Generated Test Files

### E2E Tests
| File | Generated | From | Coverage |
|------|-----------|------|----------|
| tests/e2e/checkout-validation.spec.ts | 2024-01-13 | test-plan-checkout-validation-20240113.md | TS-001–TS-006 |
| tests/e2e/payment-flow.spec.ts | 2024-01-08 | test-plan-payment-flow-20240108.md | TS-001–TS-012 |

### API Tests
| File | Generated | From | Coverage |
|------|-----------|------|----------|
| tests/api/address-validation.test.ts | 2024-01-13 | test-plan-checkout-validation-20240113.md | TS-009–TS-012 |

### Maintenance Notes
- When a generated test fails after a non-related code change, run the update prompt before modifying the test manually
- Page Object changes require regenerating the tests that use the modified POM
- Do not refactor AI-generated tests without updating the generation context files

Maintaining Generated Tests After Application Changes

When the application changes in a way that breaks generated tests, use a targeted update prompt rather than manually editing the test:

BROKEN_TEST="tests/e2e/checkout-validation.spec.ts"
FAILING_TESTS="TC-TS-003, TC-TS-005"

claude --print "
These generated E2E tests are failing after a recent code change.

FAILING TEST FILE:
---
$(cat $BROKEN_TEST)
---

RECENT CODE CHANGE THAT BROKE THEM:
---
$(git diff HEAD~1 HEAD -- src/components/checkout/)
---

TEST FAILURE OUTPUT:
---
$(npx playwright test $BROKEN_TEST --reporter=list 2>&1 | grep -A 5 'Error\|Failed')
---

CURRENT PAGE OBJECT (may have changed):
---
$(cat tests/e2e/pages/CheckoutPage.ts)
---

Your task:
1. Identify the exact cause of each failure (selector changed? method signature changed? behavior changed?)
2. Generate the minimal update to the test file to fix each failure
3. Do NOT change what is being tested — only fix the test to work with the updated application
4. If a test is failing because the application behavior changed (not just the implementation),
   flag it as a potential behavioral regression rather than a test fix.

Output: a list of specific changes to make to the test file, with line numbers.
"

This approach preserves the test's intent while fixing its implementation — the critical distinction between "fixing a broken test" and "suppressing a real failure."

Learning Tip: Create a rule in your team's working agreement: AI-generated tests require one human reviewer before merge, the same as production code. The reviewer's job is not to verify correctness line by line — it is to confirm that the tests actually test what the test plan intended and that no real failures are being masked. A 10-minute review of a generated test file at merge time is a much better investment than discovering a masked defect in production two sprints later.