·

Continuous feedback, autonomous execution, and regression analysis

Continuous feedback, autonomous execution, and regression analysis

How does autonomous test execution work and how do results surface to the team?

Autonomous test execution is the stage where the agentic QA loop runs without a QA engineer actively driving it. Tests run on a schedule or on a trigger, results are interpreted by an AI agent, and the team receives a synthesized report rather than raw test output. Understanding how this machinery works — and where it can fail — is critical for configuring it reliably.

The Autonomous Execution Architecture

Trigger Event
     │ (PR push, scheduled cron, manual dispatch)
     ▼
┌─────────────────────────────────────────────────────┐
│  CI/CD Pipeline                                     │
│                                                     │
│  1. Checkout code at trigger commit                 │
│  2. Install dependencies and test tooling           │
│  3. Load persistent context (CLAUDE.md + qa-context)│
│  4. Execute automated test suite                    │
│     ├── Playwright E2E (all feature specs)          │
│     ├── Jest API tests                              │
│     └── Unit tests (if included)                   │
│  5. Capture raw results (JSON + logs)               │
│  6. Run AI interpretation layer                     │
│  7. Generate summary report                         │
│  8. Surface to team via PR comment / Slack / email  │
└─────────────────────────────────────────────────────┘
     │
     ▼
Team receives: structured report, not raw test output

The AI interpretation layer (step 6) is what separates autonomous execution from traditional CI. Without it, the team receives a wall of raw test output — pass counts, stack traces, timing data — and must manually synthesize it into actionable information. With it, the team receives a structured report that classifies each failure, identifies patterns, and recommends next steps.

Configuring the Autonomous Execution Schedule

Set up a scheduled run that executes the full test suite daily against the main branch, independently of PR activity:

name: Scheduled Regression — Full Suite

on:
  schedule:
    - cron: '0 6 * * 1-5'    # 6 AM UTC, weekdays
  workflow_dispatch:           # Manual trigger

jobs:
  full-regression:
    runs-on: ubuntu-latest
    timeout-minutes: 60

    steps:
      - uses: actions/checkout@v4

      - name: Setup and install
        uses: actions/setup-node@v4
        with:
          node-version: '20'
      - run: npm ci
      - run: npx playwright install --with-deps chromium firefox

      - name: Run full E2E suite
        run: |
          npx playwright test \
            --reporter=json,html \
            --workers=4 \
            2>&1 | tee e2e-output.txt
        continue-on-error: true  # Capture failures without stopping the job
        env:
          BASE_URL: ${{ secrets.STAGING_URL }}
          TEST_USER_TOKEN: ${{ secrets.TEST_USER_TOKEN }}

      - name: Run API test suite
        run: |
          npx jest tests/api/ \
            --forceExit \
            --json --outputFile=api-results.json \
            --verbose 2>&1 | tee api-output.txt
        continue-on-error: true
        env:
          API_BASE_URL: ${{ secrets.STAGING_API_URL }}

      - name: Generate AI interpretation
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: |
          npm install -g @anthropic-ai/claude-code

          # Load persistent context
          cat CLAUDE.md qa-context/*.md > /tmp/persistent-context.md

          claude --print "
          $(cat /tmp/persistent-context.md)

          ---

          You are interpreting the results of a scheduled full regression run.
          Date: $(date -u +%Y-%m-%d)
          Branch: main

          E2E RESULTS:
          $(cat e2e-output.txt | tail -200)

          API RESULTS:
          $(cat api-output.txt | tail -200)

          Generate a daily regression report with:
          1. Executive summary: pass/fail/skip counts, overall health (GREEN/AMBER/RED)
          2. New failures since last run (compare with known baseline if available)
          3. Persistent failures (failing for 2+ consecutive runs)
          4. Flaky tests detected (passed on retry)
          5. Coverage health: any test coverage regressions
          6. Action required: specific issues the team must address today
          7. Trend: is the test suite health improving, stable, or degrading?

          Format as: ## Daily Regression Report — {date}
          " > daily-regression-report.md

          cat daily-regression-report.md

      - name: Upload artifacts
        uses: actions/upload-artifact@v4
        with:
          name: regression-results-$(date +%Y%m%d)
          path: |
            daily-regression-report.md
            e2e-output.txt
            api-results.json
            playwright-report/

      - name: Post to Slack
        if: always()
        uses: slackapi/slack-github-action@v1
        with:
          payload: |
            {
              "text": "Daily Regression Report — $(date +%Y-%m-%d)",
              "blocks": [
                {
                  "type": "section",
                  "text": {
                    "type": "mrkdwn",
                    "text": "$(cat daily-regression-report.md | head -20)"
                  }
                }
              ]
            }
        env:
          SLACK_WEBHOOK_URL: ${{ secrets.SLACK_QA_WEBHOOK }}

What the Team Sees: Report Design

The report the team receives should be scannable in 60 seconds. Design the AI report output with this constraint in mind:

## Daily Regression Report — 2024-01-15

### Health Status: AMBER

**Summary**: 127 tests run. 124 passed (97.6%). 3 failed. 2 flaky.

---

### Failures Requiring Action

**[NEW FAILURE]** `checkout-validation.spec.ts — TC-TS-003`  
Error: Element `[data-testid="zip-error"]` not found  
Hypothesis: Selector change in CheckoutForm.tsx (PR #247 merged yesterday)  
Action needed: QA or dev review — likely selector update required  

**[PERSISTENT — Day 3]** `payment-flow.spec.ts — TC-PAY-019`  
Error: Timeout after 30s waiting for confirmation page  
Hypothesis: Intermittent payment gateway latency in staging  
Action needed: Escalate to infra team if persists tomorrow  

---

### Flaky Tests (Investigate, Not Urgent)

`auth.spec.ts — TC-AUTH-007` — Passed on retry (2nd attempt)  
Hypothesis: Race condition in token refresh  
Action needed: Add to flaky test backlog  

---

### No Action Needed
- 124 tests passing consistently
- No coverage regression detected
- Performance metrics within baseline

### Next Steps
1. Fix selector in checkout-validation spec (15 min, dev or QA)
2. Escalate payment gateway timeout to infra if it occurs tomorrow
3. Schedule flaky test investigation for TC-AUTH-007 (add to sprint backlog)

This format delivers: the health status, the specific failures with action owners, and explicit next steps — all in under one minute of reading time.

Learning Tip: Track your team's time-to-action on daily regression reports. The goal is that every failure flagged as "Action needed" is triaged within four hours of the report appearing. If failures are going unaddressed for days, the report format is not compelling enough, the right people are not receiving it, or the action items are too vague. Fix the report format before assuming the team is ignoring it.


How to use AI to interpret test results — failures, flakiness, and coverage deltas?

Raw test output is data. Interpreted test output is intelligence. The interpretation layer is where the agentic loop converts pass/fail counts and stack traces into the risk assessments, root cause hypotheses, and priority rankings that a QA engineer would produce manually.

Failure Interpretation: Root Cause Classification

Every test failure falls into one of four categories, and the category determines the correct response:

Category Definition Correct Response
Application bug Test is correct; application behavior is wrong File bug report, block merge
Test code error Application is correct; test has wrong assertion/selector Fix the test, re-run
Environment issue Test and app are correct; test env is broken Fix environment, re-run
Flaky failure Non-deterministic failure, not reproducible consistently Investigate timing/state

The AI classification prompt for failure triage:

claude --print "
Classify each test failure in the following results.

PERSISTENT CONTEXT:
$(cat CLAUDE.md)
$(cat qa-context/known-issues.md)
$(cat qa-context/known-flaky-tests.md)

---

TEST FAILURE OUTPUT:
$(cat test-results/failures.txt)

RECENT CODE CHANGES (last 48 hours):
$(git log --oneline -20 --since='48 hours ago')
$(git diff HEAD~5 HEAD -- src/ | head -300)

CURRENT TEST FILES:
$(cat tests/e2e/checkout-validation.spec.ts)

For each failure, provide:
1. Failure ID: [test name + line]
2. Category: APPLICATION_BUG / TEST_CODE_ERROR / ENVIRONMENT_ISSUE / FLAKY
3. Evidence for this classification
4. Is this a known issue? (check qa-context/known-issues.md)
5. Specific fix recommendation

Output as a classified failure report, grouped by category.
"

Flakiness Detection and Analysis

Flaky tests are identified by comparing results across multiple runs:

for run in 1 2 3; do
  npx playwright test --reporter=json 2>/dev/null > "run-${run}-results.json"
done

claude --print "
Compare these three consecutive test run results and identify flaky tests.

RUN 1 RESULTS:
$(cat run-1-results.json | python3 -c 'import json,sys; d=json.load(sys.stdin); print("\n".join([f"{t[\"title\"]}: {t[\"status\"]}" for t in d.get(\"suites\",[])[0].get(\"specs\",[]) if True]))')

RUN 2 RESULTS:
$(cat run-2-results.json | python3 -c 'import json,sys; d=json.load(sys.stdin); print("\n".join([f"{t[\"title\"]}: {t[\"status\"]}" for t in d.get(\"suites\",[])[0].get(\"specs\",[]) if True]))')

RUN 3 RESULTS:
$(cat run-3-results.json | python3 -c 'import json,sys; d=json.load(sys.stdin); print("\n".join([f"{t[\"title\"]}: {t[\"status\"]}" for t in d.get(\"suites\",[])[0].get(\"specs\",[]) if True]))')

For each test that has different results across runs:
1. Is it CONSISTENTLY_FLAKY (different results in 2+ of 3 runs)?
2. Is it OCCASIONALLY_FLAKY (different in 1 of 3 runs)?
3. What is the most likely cause? (Timing, shared state, network, test order dependency)
4. Recommend: FIX_NOW / INVESTIGATE / QUARANTINE

Also: are there any tests that consistently fail? Those are not flaky — they are broken.
"

Coverage Delta Analysis

Coverage delta analysis identifies when test coverage decreases across commits. This is a regression signal independent of test failures:

npx playwright test --reporter=json 2>/dev/null | \
  python3 -c 'import json,sys; d=json.load(sys.stdin); total=d.get("stats",{}); print(f"Total: {total.get(\"expected\",0)}, Passed: {total.get(\"ok\",0)}")' > current-coverage.txt

claude --print "
Analyze this coverage delta between the current run and the baseline.

BASELINE (from last week's main branch run):
$(cat qa-artifacts/baseline-coverage-2024-01-08.txt 2>/dev/null || echo 'No baseline available')

CURRENT COVERAGE:
$(cat current-coverage.txt)

CURRENT TEST PLAN:
$(cat qa-artifacts/test-plan-*.md | tail -1)

Identify:
1. Which test scenarios from the test plan are no longer being tested?
2. Is the coverage decrease intentional (tests deleted) or a regression?
3. Which modules have the largest coverage decrease?
4. Risk assessment: does this coverage decrease represent a risk to the current release?

Output: a coverage delta report with specific actions.
"

Learning Tip: Set up a coverage baseline file that is updated weekly on the main branch, not on every commit. Using weekly baselines prevents false alarms from short-term coverage fluctuations (a test is temporarily disabled for a refactor, for example). Compare against the weekly baseline to identify genuine coverage regressions while tolerating short-term fluctuations.


How to use AI to distinguish new failures from pre-existing issues?

The most common complaint about CI test reports is that they surface pre-existing issues alongside new failures, making it impossible to quickly identify what the current PR actually broke. The agentic loop solves this with an explicit new-vs-pre-existing classification step.

The Classification Data Requirements

To classify failures accurately, the agent needs two inputs:
1. The current failure set (from the CI run on this PR)
2. The historical failure record (what was failing before this PR)

The historical failure record is built from two sources:

Source 1 — known-issues.md: Manually maintained list of accepted pre-existing issues that have not been fixed yet.

Source 2 — baseline run results: Automated record of which tests were failing on the main branch before this PR was created.

Building the Baseline Record

name: Update Test Failure Baseline

on:
  push:
    branches: [main]
  schedule:
    - cron: '0 5 * * 1'  # Every Monday at 5 AM UTC

jobs:
  update-baseline:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npx playwright install --with-deps chromium

      - name: Run tests and capture baseline
        run: |
          npx playwright test --reporter=json 2>/dev/null > baseline-results.json || true

          # Extract just the failing tests for the baseline file
          cat baseline-results.json | python3 -c "
          import json, sys
          data = json.load(sys.stdin)
          failures = []
          def extract_failures(suite):
              for spec in suite.get('specs', []):
                  for test in spec.get('tests', []):
                      if test.get('status') not in ['passed', 'skipped']:
                          failures.append(f\"{spec['title']}: {test['status']}\")
              for s in suite.get('suites', []):
                  extract_failures(s)
          for suite in data.get('suites', []):
              extract_failures(suite)
          print('\n'.join(failures))
          " > qa-artifacts/baseline-failures-$(date +%Y%m%d).txt

      - name: Commit baseline
        run: |
          git config user.name "QA Bot"
          git config user.email "[email protected]"
          git add qa-artifacts/baseline-failures-*.txt
          git commit -m "chore: update test failure baseline $(date +%Y%m%d)" || true
          git push

The New-vs-Pre-Existing Classification Prompt

On each PR, compare current failures against the baseline:

claude --print "
You are classifying test failures as new (introduced by this PR) or pre-existing (existed before this PR).

CURRENT PR FAILURES:
$(cat current-pr-failures.txt)

BASELINE FAILURES (before this PR):
$(cat qa-artifacts/baseline-failures-latest.txt 2>/dev/null || echo 'No baseline available — treat all failures as potentially new')

KNOWN ISSUES (accepted pre-existing problems):
$(cat qa-context/known-issues.md)

RECENT COMMITS IN THIS PR:
$(git log origin/main..HEAD --oneline)

For each current failure, classify as:
- NEW_FAILURE: Not in baseline, likely caused by this PR
- PRE_EXISTING: Present in baseline before this PR  
- KNOWN_ISSUE: Listed in known-issues.md
- UNCERTAIN: Cannot determine from available context

For NEW_FAILURE classifications:
- Which specific change in the PR is the likely cause?
- Is this blocking (should prevent merge) or non-blocking?

For PRE_EXISTING/KNOWN_ISSUE classifications:
- Note that these are not regressions caused by this PR

Output format:
## Failure Classification Report

### New Failures (introduced by this PR)
[list with root cause hypothesis]

### Pre-Existing Failures (not caused by this PR)  
[list — these should not block this PR]

### Blocking Recommendation
[Should this PR be blocked? Why or why not?]
"

This classification output is what the team needs to make a merge decision. "3 tests failing" is ambiguous. "1 new failure likely caused by this PR (blocking), 2 pre-existing failures not caused by this PR (non-blocking)" is an actionable go/no-go signal.

Learning Tip: Maintain a qa-artifacts/baseline-failures-latest.txt symlink that always points to the most recent baseline file. This prevents every CI job from needing to know the exact date of the last baseline update. When the baseline is updated, update the symlink. Any CI job that reads baseline-failures-latest.txt automatically uses the current baseline without configuration changes.


How to close the agentic loop by feeding CI results back into planning?

The loop is not closed until CI results inform the next iteration of planning. Without this feedback step, the agentic QA loop is a waterfall with extra steps — context flows forward through the stages but never feeds back to improve the plan for the next cycle. Closing the loop is what makes the system self-improving.

The Feedback Data That Closes the Loop

Three categories of CI results feed back into planning:

Category 1 — New failures and their root causes: Each new failure reveals a risk area that the test plan either underestimated (risk was classified MEDIUM but produced a real bug) or missed entirely (no scenario covered this path). This information updates the risk taxonomy in the persistent context.

Category 2 — Persistent flaky tests: Flaky tests reveal non-deterministic behavior — timing dependencies, state leakage, race conditions. These are structural quality risks that should be tracked in the test plan for future features that touch the same code areas.

Category 3 — Coverage gaps confirmed by failures: When an exploratory finding reveals a bug that no scripted test covered, that gap must be documented and addressed before the next cycle on a related feature.

The Feedback Integration Prompt

At the end of each feature's QA cycle, run the feedback integration prompt to update the persistent context:

claude --print "
You are a QA lead closing the agentic QA loop by extracting lessons from this cycle.

FEATURE: [feature name]
CYCLE DATES: [start] to [end]

TEST PLAN (what was planned):
$(cat qa-artifacts/test-plan-checkout-validation-20240113.md)

FINAL COVERAGE REPORT (what was found):
$(cat qa-artifacts/coverage-report-checkout-validation-20240115.md)

CI RESULTS SUMMARY:
$(cat qa-artifacts/ci-results-latest.md)

BUGS FOUND (all sources):
$(cat qa-artifacts/bugs-checkout-validation-*.md)

Your task:

## 1. Risk Model Calibration
- Which risks were underestimated? (MEDIUM risk that produced HIGH impact bugs)
- Which risks were overestimated? (HIGH risk that produced no issues)
- Recommended update to the risk weights in CLAUDE.md:

## 2. Coverage Pattern Analysis
- Which test types found the most bugs? (E2E / API / Manual / Exploratory)
- Which areas of the codebase were highest-yield for bugs?
- What test scenarios should be added to the standard test plan template?

## 3. Persistent Context Updates Needed
List specific additions to make to:
- CLAUDE.md (risk weights, new conventions)
- qa-context/known-issues.md (new pre-existing issues to document)
- qa-context/known-flaky-tests.md (new flaky tests discovered)
- qa-context/risk-taxonomy.md (new risk categories)

## 4. Process Improvements
What should change in the agentic loop itself for the next cycle?

Output: A retrospective report with specific, actionable context file updates.
"

Applying the Updates

The retrospective report's "Persistent Context Updates Needed" section is a direct input to maintaining the context files. Apply the updates immediately after each feature cycle:



echo "## Known Issue — [date]
**ID**: KI-$(date +%Y%m%d)-001
**Description**: Payment gateway timeout in staging under load — not reproducible in prod
**Status**: Accepted — tracked in infra backlog
**Do not report as**: New test failure
" >> qa-context/known-issues.md

git add CLAUDE.md qa-context/known-issues.md
git commit -m "chore: update QA agent persistent context after checkout validation cycle"

The Compounding Value Over Time

Each closed loop makes the next iteration better in measurable ways:

After 1 cycle: Basic risk weights calibrated for the project's specific technology choices.
After 3 cycles: Domain-specific risk patterns established. Agent consistently identifies HIGH-risk changes that match team intuition.
After 6 cycles: Coverage patterns established. Agent generates test plans that require minimal human correction before approval.
After 12 cycles: The agent's test plans, risk assessments, and bug classifications are accurate enough to dramatically reduce QA review time. The team's value shifts from doing the work to validating the AI's work — a fundamentally different and more leveraged operating model.

Learning Tip: Schedule a monthly "context engineering session" — 60 minutes to review the feedback data from the past month's agentic QA cycles and apply systematic updates to the persistent context. This session is an investment in the compounding value described above. Teams that skip this maintenance step find their agentic loop gradually drifting — the context files become stale, agent accuracy decreases, and trust erodes. Teams that treat context maintenance as a standing engineering practice consistently report improved output quality over time.