Iterating and Refining AI Outputs for Testing

How to Recognize a Poor AI Output — Vague, Incomplete, or Off-Target?

The ability to quickly and accurately evaluate AI output is as important as the ability to write good prompts. If you can't recognize a bad output, you'll either discard good work (overcritical) or ship poor test cases because you rubber-stamped AI output without review (undercritical). The goal is calibrated judgment — fast, accurate, evidence-based.

The Three Failure Modes

Poor AI output falls into three categories, each with distinct symptoms.

Failure Mode 1: Vague
The output contains correct-sounding language but no specific, verifiable assertions. It reads like a description of testing strategy rather than executable test scenarios.

Symptoms:
- Test steps say "verify the response is correct" instead of "verify response status is 201"
- Scenarios say "test error handling" instead of "enter a password shorter than 8 characters and verify the error message reads 'Password must be at least 8 characters'"
- Expected results say "the user should see an appropriate message" instead of naming the message
- Risk assessments say "this area is important to test" without specifying which inputs or conditions create risk

A vague output is not usable as a test plan — it requires a human to convert generalities into specifics, which defeats the purpose of AI generation.

Failure Mode 2: Incomplete
The output covers some of what was requested but systematically misses categories of scenarios. Common incompleteness patterns:

Happy paths covered, negative paths absent
API status codes for 200 covered, 4xx and 5xx missing
Core feature scenarios present, edge cases and boundary conditions absent
New behavior covered, regression scenarios for dependent functionality absent
UI behavior tested, API call assertions missing

Incompleteness is harder to detect than vagueness because the parts that are present look correct — you have to notice what's absent, which requires active comparison against a mental model of what "complete" should look like.

Failure Mode 3: Off-Target
The output addresses a different problem than the one you posed. Common off-target patterns:

Generated tests for the wrong endpoint or component (model confused by context)
Generated tests in the wrong framework or language
Generated test design scenarios when you asked for executable test code (or vice versa)
Applied the wrong business rules (used assumptions from a similar feature instead of your spec)
Addressed a historical version of the feature based on an outdated spec

A Practical Output Evaluation Checklist

Run this checklist on every significant AI output before accepting it:

□ SPECIFICITY CHECK
  - Do all expected results contain specific, observable values (not vague descriptions)?
  - Do all API test scenarios include exact status codes?
  - Are all test data values concrete (not "enter a valid email" but "enter [email protected]")?

□ COMPLETENESS CHECK
  - Happy paths: at least 2 scenarios?
  - Negative paths: at least 3 scenarios (invalid input, auth failure, resource not found)?
  - Boundary conditions: covered for numeric inputs, string lengths, and quantity limits?
  - Error states: all documented error responses tested?

□ ACCURACY CHECK
  - Do field names match the actual API spec / codebase?
  - Do status codes match the spec?
  - Do business rules match your actual product behavior, or sound generic?

□ SCOPE CHECK
  - Does this cover the feature I asked about, not a similar or hypothetical one?
  - Does this use the correct framework/language for my project?
  - Are there scenarios that test things I explicitly said to exclude?

□ USABILITY CHECK
  - Can I copy this into my test management tool without manual reformatting?
  - If code: does it compile/run as-is, or does it need significant editing?

If you fail three or more checks, the output needs substantial revision — it may be faster to re-prompt than to manually fix it.

Calibrating Your Judgment Over Time

New AI users tend toward one of two errors: accepting everything the model produces (overconfident) or rejecting everything and doing it manually (underconfident). Neither is efficient.

Build calibration by doing a structured review for your first 20 AI-generated outputs:
1. Accept, reject, or revise each output
2. Note the failure mode (vague / incomplete / off-target) if you revised or rejected
3. After 20 sessions, identify your most common failure mode

Most engineers discover one dominant failure mode that accounts for 60–70% of their output quality issues. Fix that pattern first — in your prompt architecture, your context selection, or your output format specification.

Learning Tip: Add a "QA on the QA" step to your high-stakes AI sessions. After generating a test plan, run this follow-up prompt: "Review the test plan you just generated. What scenarios are missing that a thorough QA engineer would include? What assertions are too vague to be executable?" This self-critique prompt consistently surfaces 2–5 improvements on outputs that initially looked complete.

Refinement is the process of improving an AI output through targeted follow-up instructions. It is not re-prompting from scratch — it is surgical intervention that fixes specific deficiencies while preserving what's working.

Technique 1: Targeted Correction

Identify specific items that are wrong and instruct the model to fix exactly those items, leaving the rest unchanged.

**Prompt:**
In your previous test scenario table:

1. TC-007 has the wrong expected status code. The spec says 422 for payment processing
   failures, not 400. Update TC-007.

2. TC-011 through TC-015 have vague expected results ("user sees error message"). Replace
   with specific expected result text from the spec:
   - payment declined: "Your payment was declined. Please check your card details."
   - card expired: "Your card has expired. Please use a different payment method."

3. TC-019 is testing admin functionality I explicitly excluded. Remove TC-019.

Output: the updated table with only these three changes applied. All other scenarios
should remain exactly as they were.

Targeted correction is the fastest refinement technique when the output is mostly correct but has specific errors. The "all other scenarios remain unchanged" instruction prevents the model from reformatting or rearranging content you already approved.

Technique 2: Scope Extension

The output is correct but incomplete — extend it to cover missing categories.

**Prompt:**
Your test scenarios cover the happy path and basic validation errors well. Now extend
the table with:

1. Boundary conditions for the quantity field (min value = 1, max value = 100)
   - Add scenarios for quantity = 0, 1, 50, 100, 101
2. Concurrent request scenarios (two users attempting to purchase the last item simultaneously)
3. Timeout and network error scenarios (payment service unavailable, returns 503)

Add these as new rows continuing from TC-025 (your last scenario). Maintain the same
table format.

Technique 3: Depth Drilling

The output has the right scenarios but each one is too shallow — drill down to get more detail on specific scenarios.

**Prompt:**
Expand TC-008 (concurrent order conflict) into a full scenario group. The current
description is too brief to be testable. For this scenario, provide:

1. Exact precondition: database state, inventory count, user states
2. Exact concurrent action: what both users do, in what order (or simultaneously)
3. Exact assertions: what does User A see? What does User B see? What does the database
   state look like after both requests complete?
4. Test implementation notes: what concurrency mechanism to use in the test framework
   (Jest: Promise.all; Playwright: page.on('request') interception)

Technique 4: Format Transformation

The content is correct but in the wrong format for your workflow. Convert without regenerating.

**Prompt:**
Transform the test scenario table you just generated into JSON format for import to
TestRail. Each scenario should become a JSON object with these fields:
{
  "title": "string",
  "section": "string (use the test type column as section name)",
  "priority": 1-4 (where 1=Critical, 2=High, 3=Medium, 4=Low — map from your Risk column),
  "steps": [{"content": "string", "expected": "string"}],
  "references": "string (use AC-ID if present)"
}

Output a JSON array. Include all 24 scenarios.

Technique 5: Adversarial Review

After generating test cases, ask the model to try to break them — find ways the tests could be wrong or incomplete.

**Prompt:**
You have just generated a test plan for the password reset flow. Now switch roles: you
are a senior developer reviewing this test plan to find gaps before implementation.

Identify:
1. Scenarios where the assertion is testable only if the implementation works in a specific
   way (brittle assertions)
2. Missing scenarios that a developer who knew the implementation details would add
3. Assumptions in the test plan that may not hold in all environments
4. Any test that could give a false positive (passes even when the feature is broken)

This adversarial review will improve the test plan before it goes to the test suite.

This technique surfaces the exact gaps that would turn into false confidence during regression runs.

Output problem	Best technique
Specific wrong values (wrong status code, wrong field name)	Targeted correction
Missing scenarios in specific categories	Scope extension
Scenarios present but too vague to execute	Depth drilling
Correct content, wrong format	Format transformation
Output looks complete but might have hidden gaps	Adversarial review
Fundamentally wrong direction	Re-prompt from scratch (refining won't help)

Learning Tip: Develop a personal "refinement vocabulary" — a set of refinement instructions you apply regularly. Examples: "Make all expected results more specific," "Add boundary conditions for numeric inputs," "Add error path scenarios for each network dependency." Once you have 10–15 refinement patterns, applying them becomes automatic. You'll scan an output, identify which refinements apply, and issue them in a single follow-up — turning a mediocre first output into a production-ready artifact in one additional turn.

When to Use Multi-Turn Conversations vs. Single-Shot Prompts for QA Tasks?

Single-shot prompts are faster and cheaper. Multi-turn conversations are more powerful for complex tasks. Choosing correctly is about understanding which tasks benefit from iteration and which are better served by a comprehensive upfront prompt.

Single-Shot Prompts: When They're the Right Choice

Use single-shot prompts when:

The task is well-defined and bounded (you know exactly what you want)
The output format is fixed and familiar (a table, a JSON object, a Gherkin scenario)
The context is small enough to include completely in one prompt
You've done this task before and have a proven prompt template
The stakes are low enough that first-pass quality is acceptable
Speed matters more than exhaustiveness

Single-shot examples:

**Prompt:**
Generate a test case table for this user story. [story] Format: [format]. Constraints: [list].

**Prompt:**
Summarize this CI failure log in the format: failure type | affected test | likely cause.
[paste log excerpt]

Single-shot prompts are the workhorses of a QA prompt library. The goal of building a good prompt template is to make complex tasks achievable in a single shot — by front-loading all the structure and constraints that would otherwise require iteration.

Multi-Turn Conversations: When They Add Value

Use multi-turn conversations when:

The task requires exploration before you know what you want (discovery phase)
The output is too complex to specify fully upfront
You're debugging a problem and need to follow hypotheses iteratively
The task involves building on previous results (Step 1 output informs Step 2)
You need to ask clarifying questions before generating
The model needs to demonstrate understanding before you trust its generation

Multi-turn examples:

Turn 1 — Discovery:

**Prompt:**
I'm planning tests for a new payment retry mechanism. What are the key risk areas I should
cover? Don't generate scenarios yet — just identify the risk categories.

Turn 2 — Confirmation:

**Prompt:**
You identified: timeout behavior, idempotency, partial failure recovery, and notification
reliability. I'd also add: race conditions during retry. Confirm these 5 categories and
then generate 3 test scenarios per category.

Turn 3 — Refinement:

**Prompt:**
The idempotency scenarios are good. The timeout scenarios are too vague — expand them with
specific timeout values from the spec (1000ms for payment service, 5000ms for notification
service).

This three-turn sequence produces a better result than a single complex prompt because Turn 2 incorporates your domain knowledge correction (race conditions) and Turn 3 fixes the specific weakness.

A Framework for Choosing

Ask these questions about your task:

Do I know exactly what I want the output to look like?
Yes → single-shot with a detailed template
No → multi-turn: discovery first, generation second
Is the context small and complete?
Yes → single-shot
No → multi-turn: build context incrementally or summarize between turns
Is this a recurring task I've done before?
Yes → single-shot with a proven template
No → multi-turn until you understand the task well enough to build a template
Does Step 1's output affect what I'll ask in Step 2?
Yes → multi-turn is required
No → single-shot is fine
Is quality more important than speed for this output?
Quality → multi-turn with refinement
Speed → single-shot and accept first-pass quality

Managing Multi-Turn Context Effectively

Long multi-turn conversations degrade because context fills with conversational filler, intermediate outputs, and corrected drafts — all of which consume tokens that could hold better content.

Technique — context compaction between turns:

When a conversation gets long, periodically issue a compaction turn:

**Prompt:**
Before we continue: summarize the current state of our work. What test categories have
we finalized, and what remains to be done? Include the final version of each confirmed
test scenario. I will start a new conversation from your summary.

Then start a new conversation pasting the summary as your initial context. This removes all the conversational overhead and leaves only the essential work product.

Learning Tip: Build a personal heuristic for choosing between single-shot and multi-turn. A simple rule: if you're writing the prompt and you already know every constraint, every format detail, and every scope boundary — single-shot. If you're writing the prompt and making even one assumption about what "good output" looks like — multi-turn, clarify the assumption first. This rule prevents the most common multi-shot failure: spending three turns iterating on a misunderstood task direction.

How to Build a Personal Prompt Library for Recurring QA Tasks?

A prompt library is the compound interest of prompt engineering. The first time you write a good prompt for a task, it takes 20 minutes. The tenth time you run that task, it takes 2 minutes because you have the template. After six months, your library is the difference between a QA engineer who can produce AI-assisted work at scale and one who starts from scratch every session.

What Goes in a Prompt Library

A prompt library contains reusable prompt templates — not a log of specific prompts, but abstracted, parameterizable templates with placeholders for task-specific content.

Example template entry:

## Template: API Test Case Generation

**Use when**: Generating test scenarios for a REST API endpoint from an OpenAPI spec or AC

**Template:**
---
You are a senior backend QA engineer with expertise in REST API testing, boundary value
analysis, and security testing.

SPEC CONTEXT:
[PASTE: acceptance criteria or OpenAPI spec for the endpoint]

SYSTEM CONTEXT:
[PASTE: endpoint handler function — 30-80 lines]
[PASTE: relevant data model / type definitions]

GENERATE:
A complete test scenario table covering all acceptance criteria. Include:
- At minimum 2 happy path scenarios
- All documented error responses (4xx, 5xx)
- Boundary conditions for all numeric inputs and string lengths
- At least 1 security/auth scenario

CONSTRAINTS:
- Maximum 25 scenarios
- Map each scenario to an AC-ID
- Do NOT generate test code — scenarios only
- Assume Jest + Supertest unless specified otherwise: [SPECIFY: test framework]

FORMAT:
Markdown table: ID | Scenario | Input | Expected Status | Expected Response | AC-ID | Risk
---

**Last updated**: [date]
**Works best with**: [model name, if you've found a specific model performs better]
**Known limitations**: [anything this template doesn't handle well]

Library Structure

Organize your library by task type, not by feature or project. Templates are reusable across projects — task-type organization makes them findable:

prompt-library/
  test-generation/
    api-test-generation.md
    ui-e2e-generation.md
    unit-test-generation.md
    mobile-test-generation.md
  analysis/
    bug-root-cause-analysis.md
    ci-failure-classification.md
    regression-scope-analysis.md
    coverage-gap-identification.md
  planning/
    sprint-test-planning.md
    exploratory-charter-generation.md
    risk-based-prioritization.md
  review/
    pr-impact-analysis.md
    test-suite-audit.md
    requirement-ambiguity-review.md
  role-frames/
    backend-api-qa.md
    frontend-ui-qa.md
    mobile-qa.md
    ctv-qa.md
    security-qa.md

Building Your Library Incrementally

Don't try to build the library before you need it. Build it as you work:

After every successful prompt: If the output was notably good, save the prompt as a template. Immediately — while you remember what made it work.

After every refinement cycle: The refined final prompt is better than your first attempt. Save the final version, not the first draft.

Weekly review: Once a week, spend 10 minutes reviewing what you saved and converting informal notes into clean templates.

After pattern recognition: When you notice you've written the same prompt structure three times for different features, abstract it into a template.

Template Metadata That Saves Time

Each template should include:
- Title and use case: What task this serves and when to use it
- Required context: What the [PASTE] placeholders need — so you know what to gather before starting
- Optional context: What additional context improves quality if available
- Known failure modes: When this template produces poor output (helps calibrate expectations)
- Last updated date: Templates that haven't been updated in 6+ months may reference outdated frameworks or constraints

Team Library vs. Personal Library

A personal library is yours alone — fast to build, idiosyncratic to your workflow. A team library is a shared asset that multiplies prompt quality across the whole QA function.

To convert personal templates to team templates:
1. Remove project-specific details, keeping the structure and the reasoning
2. Add team-specific defaults (your team's tech stack, naming conventions, test management tool format)
3. Document the purpose and use case clearly for teammates who didn't write it
4. Store in a shared location (Notion, Confluence, a docs/prompt-library/ directory in your repo)
5. Treat template updates like code changes — review before merging

A team library with 20 well-maintained templates is a productivity multiplier that scales to every QA engineer on the team. The time investment to build it (8–10 hours over a quarter) returns dividends indefinitely.

Learning Tip: Your first library entry should be the prompt that produced the best output you've generated so far in this course. Don't wait until the library feels "ready" — start with one entry today. The library that exists is infinitely more useful than the perfect library you're planning to build. Add five entries by the end of this module and you have the foundation of a tool you'll use every day.

Iterating and refining AI outputs