·

Writing effective prompts for QA

Writing effective prompts for QA

What Makes a Good QA Prompt — Task, Context, Constraints, and Expected Output?

Every effective QA prompt is built from the same four ingredients. Missing any one of them consistently degrades output quality in predictable ways. This is not a framework to memorize — it's a diagnostic: when an AI output disappoints you, trace it back to which ingredient was absent or weak.

The Four Ingredients

1. Task — What you want the model to do

The task is the action you're requesting. It should be a single, unambiguous directive.

Weak: "Help me with my tests."
Strong: "Generate a comprehensive test scenario table for the user story below."

Weak: "Look at this log and tell me what's wrong."
Strong: "Identify the root cause of the test failure in the log excerpt, and recommend the minimum change needed to fix it."

The task statement should include:
- An action verb (generate, identify, analyze, classify, draft, compare)
- A clear deliverable (test scenario table, bug report, risk matrix, regression scope list)

2. Context — What the model needs to know to do the task

Context is the information the model cannot infer from the task alone. For QA tasks, context typically includes one or more of:

  • The feature or system being tested (requirements, acceptance criteria, API spec)
  • The code or implementation being analyzed (source file, PR diff, OpenAPI spec)
  • The failure evidence (log excerpt, stack trace, test output)
  • The existing test artifacts (test file for style reference, current coverage summary)
  • Domain-specific rules (compliance requirements, business logic that isn't obvious from code)

Every piece of context should pass the "would the model produce worse output without this?" test. If the answer is no, the context is noise.

3. Constraints — What limits the model should operate within

Constraints prevent the most common output failure modes:
- Scope creep (model generates 80 scenarios when you needed 15)
- Format violation (model gives prose when you needed a table)
- Framework mismatch (model writes pytest tests when your team uses Jest)
- Assumption errors (model assumes admin access is available when it isn't)

Good constraints are specific and negative: "do not...", "maximum...", "only include...", "assume...".

4. Expected Output — The exact shape of what you want back

Define the format, length, and structure of the output before the model generates. Show a template if possible. Name the format explicitly (markdown table, JSON, Given/When/Then, XRAY-compatible).

Diagnosing Weak Prompts

Symptom Missing ingredient
Output is too generic or shallow Missing context — the model didn't have enough to work with
Output ignores your tech stack Missing context — framework and tooling not specified
Output is too long / includes irrelevant stuff Missing constraints
Output is in the wrong format Missing expected output specification
Output makes up facts about your system Missing context — domain knowledge not provided
Output is correct but unusable without reformatting Missing expected output specification

A Minimal Viable QA Prompt Template

Use this structure as your baseline for any QA task:

**Prompt:**
[ROLE: who you are — if not established in system prompt]

[CONTEXT]
[paste the relevant spec, code, or failure evidence]

[TASK]
[action verb] + [deliverable]

[CONSTRAINTS]
- [constraint 1]
- [constraint 2]

[OUTPUT FORMAT]
Format as: [specify exactly]

You don't need all five blocks for every prompt. Simple tasks with a clear output format need only context, task, and format. Complex generation or analysis tasks benefit from all five.

Worked Example: Full vs. Minimal Prompts for the Same Task

Minimal prompt (often sufficient for simple tasks):

**Prompt:**
User story: "Users can filter the product list by price range using a min/max input."

Generate a test case table with columns: ID | Scenario | Input | Expected Result.
Include positive, negative, and boundary cases.

Full prompt (for complex or critical tasks):

**Prompt:**
You are a senior frontend QA engineer. You have deep expertise in React UI testing and
have reviewed hundreds of filter component specs.

CONTEXT:
Feature: Product list price range filter
Component: PriceRangeFilter (frontend React component calling GET /products?min=X&max=Y)
Constraints from spec:
- Min and max are both optional; if neither provided, all products returned
- Min must be >= 0 (no negative prices)
- Max must be > min if both provided
- Non-numeric values should be rejected with field validation error
- Filter should fire on blur, not on every keystroke
Existing API: GET /products accepts query params min (number) and max (number)

TASK:
Generate a comprehensive test scenario table for the PriceRangeFilter component.

CONSTRAINTS:
- Cover UI behavior (validation, error display) AND API call behavior (correct params sent)
- Maximum 20 scenarios
- Prioritize: boundary conditions > validation errors > happy paths
- Do NOT generate test code — scenarios only
- Assume Playwright for E2E and React Testing Library for unit tests (note test type per scenario)

OUTPUT FORMAT:
Markdown table: ID | Scenario | Input Values | Expected UI Behavior | Expected API Call | Test Type | Risk

Both prompts address the same feature. The minimal prompt is appropriate if you're doing a quick first-pass exploration. The full prompt is appropriate when you're generating the authoritative test plan for a sprint.

Learning Tip: Build two versions of your most-used prompts: a "quick mode" (minimal prompt for speed) and a "thorough mode" (full prompt for critical tasks). The quick mode gets you 80% of the value in 20% of the time. The thorough mode gets you the remaining 20% when it matters — before a major release, for high-risk features, when QA is the last gate. Knowing which mode to use is itself a QA skill.


How to Write Prompts for Test Case Generation, Bug Analysis, and Exploratory Planning?

These three tasks each have a distinct purpose, evidence type, and optimal prompt pattern. Treating them the same produces uniformly mediocre results. Here is the specialized pattern for each.

Test Case Generation

Test case generation is the most structured of the three tasks. The model needs a clear picture of expected behavior and produces a list of test scenarios that verify it.

Core principle: The quality of generated test cases directly correlates with the quality of the spec/requirements context you provide. A vague spec produces vague tests.

The generation prompt pattern:

**Prompt:**
You are a senior QA engineer applying systematic test design techniques (boundary value
analysis, equivalence partitioning, state transition testing as applicable).

FEATURE SPEC:
[paste acceptance criteria, user story, or spec excerpt]

SYSTEM DETAILS:
[paste relevant code, API schema, or data model]

GENERATE:
A test case table covering:
1. Happy path scenarios (at least 2)
2. Negative path scenarios (invalid input, error states)
3. Boundary conditions (min/max values, limits)
4. Edge cases (empty state, concurrent access, large input)
5. [for APIs] Status code verification for all specified response codes
6. [for UI] Accessibility and keyboard navigation if interaction is involved

CONSTRAINTS:
- Map each test case to a specific acceptance criterion (use the AC ID if present)
- Flag high-risk scenarios (data loss potential, security implications)
- Do not include test cases that cannot be automated without special infrastructure

FORMAT: [your preferred format here]

For data-driven test generation, request test data alongside scenarios:

**Prompt:**
For each test scenario, also generate one concrete test data example showing the exact
input values and expected output values that would be used in execution.

Bug Analysis

Bug analysis is inductive: you start with failure evidence and reason backward to cause. The model needs precise, complete failure data — not summaries.

Core principle: Summaries of bugs produce vague analysis. Exact failure text produces specific analysis.

The bug analysis prompt pattern:

**Prompt:**
You are a senior QA engineer and debugging specialist. You analyze failure evidence
systematically and distinguish between confirmed facts and working hypotheses.

SYSTEM CONTEXT:
[brief description of the system component, its purpose, and its expected behavior
in the scenario that failed]

FAILURE EVIDENCE:
[paste the EXACT test failure output, stack trace, or error log — do not paraphrase]

[If available, also paste:]
[The test code that produced the failure]
[The source code of the function/method under test]
[Any related log output from the system]

ANALYZE:
1. What does the failure evidence directly confirm? (facts)
2. What is the most likely root cause? (hypothesis with justification)
3. What conditions would need to be true for this failure to occur?
4. What additional evidence (logs, data, state) would confirm or rule out the hypothesis?
5. What is the recommended fix?
6. What regression test(s) should be added to prevent recurrence?

For flaky test analysis, add the occurrence pattern:

**Prompt:**
Additional context: This failure occurs approximately 1 in 20 CI runs. It has been seen
on our Linux CI runner but not in local development. The test involves asynchronous
operations. Analyze with these characteristics in mind — the failure is likely
intermittent/timing-related.

Exploratory Test Planning

Exploratory planning is the most open-ended of the three. You're not generating verification scenarios for known requirements — you're identifying what to investigate in a feature or area where behavior is incompletely understood.

Core principle: The model needs to understand both the intent (what the feature is supposed to do) and the risks (what could go wrong that the spec doesn't address).

The exploratory planning prompt pattern:

**Prompt:**
You are a senior QA engineer designing an exploratory testing program for a new feature.
Your goal is not to verify known requirements — that's covered by automated tests. Your
goal is to discover unknown unknowns: behaviors the spec didn't anticipate.

FEATURE OVERVIEW:
[brief description of the feature, its user flows, and its integration points]

KNOWN RISKS (already covered by automated tests — do NOT include):
[list the test types or scenarios already automated]

GENERATE:
A set of exploratory test charters targeting areas of likely undiscovered risk.
For each charter:
- Charter mission (what question are you investigating?)
- Target area (component, flow, or integration)
- Risk hypothesis (what specific failure might exist here?)
- Investigation approaches (how would you probe this area?)
- Time box (estimated session length: short/15 min, medium/30 min, long/60 min)

FOCUS ON:
- Integration boundaries (where this feature touches other services or systems)
- State transitions (what happens between states, not just within them)
- Concurrency and timing (simultaneous users, race conditions)
- Data variations (unusual user data, edge case data)
- Error recovery (what happens when upstream dependencies fail mid-flow)

Generate 8–12 charters.

Learning Tip: For exploratory planning, start every prompt by explicitly listing what's already covered by automated tests. This is the single most powerful constraint you can add — it forces the AI to generate genuinely novel charters instead of re-describing your existing automated scenarios. The best exploratory sessions discover what automation can't.


When Should Your Prompt Be Specific vs. Open-Ended?

The specificity dial is one of the most consequential prompt decisions you make, and most QA engineers default to over-specific when they'd benefit from open-ended, and under-specific when they need precision. Here's how to calibrate.

The Specificity Spectrum

OPEN-ENDED ←————————————————————————————————→ SPECIFIC

"What could go wrong with this payment flow?"
                ↓
"What edge cases am I missing in this checkout test suite?"
                ↓
"Review these test cases against the spec and identify gaps."
                ↓
"Generate test scenarios for the /POST /orders endpoint using the schema below."
                ↓
"Generate exactly 20 test scenarios for POST /orders covering all 8 acceptance criteria,
 formatted as JSON, prioritized by risk, excluding performance tests."

There is no universally correct position on this spectrum. The right specificity depends on your goal.

When Open-Ended Prompts Win

Use open-ended prompts when:

  • You want the model to surface risks, gaps, or considerations you haven't thought of
  • You're exploring a feature you don't fully understand yet
  • You're trying to discover the right questions before writing the specific prompt
  • The problem is genuinely ambiguous and premature specification constrains the solution

Example — good use of open-ended:

**Prompt:**
I'm about to start test planning for a new real-time collaboration feature (multiple users
editing the same document simultaneously). What are the highest-risk areas I should
focus my testing efforts on, and what testing approaches should I consider that might
not be obvious?

This prompt benefits from being open-ended because you're using the AI as a knowledgeable peer who might see angles you haven't considered. Over-specifying here would constrain the model to your existing mental model — which is exactly the bias you're trying to overcome.

When Specific Prompts Win

Use specific prompts when:

  • You have a concrete deliverable (a test case table, a bug report, a coverage assessment)
  • You've done the exploratory open-ended work and now need execution
  • The output needs to meet a specific format or convention (team standards, tool requirements)
  • You need to audit or validate AI output against clear criteria

Example — good use of specific:

**Prompt:**
Generate exactly 15 test scenarios for the document locking mechanism, covering:
- Single-user lock acquisition (2 scenarios)
- Lock expiration behavior (3 scenarios)
- Concurrent lock attempts (4 scenarios)
- Lock release and re-acquisition (3 scenarios)
- Error handling when lock service is unavailable (3 scenarios)

Format as GitHub Issues (title + body with Steps to Reproduce), labeled with "testing"
and "locking-feature".

The Two-Phase Approach

For many QA tasks, the optimal workflow is open-ended first, then specific:

Phase 1 — Discovery (open-ended):

**Prompt:**
What are the risk areas and edge cases I should cover in testing for this feature?
[paste feature description]

Phase 2 — Generation (specific):

**Prompt:**
Based on your risk analysis, generate test scenarios for the top 5 risk areas you
identified. For each risk area, provide 3–5 test scenarios. [constraints] [format]

Phase 1 gives you the AI's risk perspective. Phase 2 turns that perspective into executable test cases. The two-step approach produces broader, better-targeted coverage than jumping straight to specific generation.

Calibrating Specificity in Practice

Ask yourself these questions before deciding on specificity:

  1. Do I know the right answer shape? (If no → open-ended first)
  2. Is there a team standard or tool format I must hit? (If yes → specific, include the format)
  3. Am I trying to catch things I haven't thought of? (If yes → open-ended)
  4. Does this output go directly to a stakeholder or tool? (If yes → specific)
  5. Is this exploratory planning or execution? (Exploratory → open-ended; Execution → specific)

Learning Tip: Develop the habit of the "two-prompt workflow" for unfamiliar features. The first prompt is always: "What should I be thinking about when testing X?" The second prompt is task-specific based on what the first reveals. This two-step pattern consistently produces better test plans than jumping straight to generation — and it forces you to review the AI's risk thinking before committing to a test scope.


How to Diagnose a Bad AI Output and Improve Your Prompt?

Bad AI output is not a reason to distrust AI. It is a diagnostic signal about your prompt. Every poor output has a traceable cause, and every traceable cause has a specific fix.

The Five Categories of Bad Output

Category 1 — Hallucinated specifics
The model invents facts about your system: wrong method names, incorrect status codes, non-existent API parameters, made-up business rules.

Cause: Insufficient system context. The model filled gaps in its knowledge with plausible-sounding fabrications.

Fix: Add the specific source of truth (the spec, the code, the API schema) as explicit context. Don't describe your system in the prompt — paste the relevant artifact.

Category 2 — Generic/shallow output
The model produces test cases or analysis that could apply to any similar system, not specifically to yours.

Cause: Missing domain context, or the context provided was too abstract.

Fix: Add one or more of: the specific tech stack, the specific data model, the specific user types, the specific business rules that distinguish your system.

Category 3 — Wrong scope
The model generates too many or too few scenarios, covers the wrong risk areas, or goes in a completely different direction than you intended.

Cause: Ambiguous task statement or missing constraints.

Fix: Add an explicit scope constraint: "Generate exactly N scenarios covering [X, Y, Z] only." Also check your task statement — if it could be interpreted multiple ways, the model will choose one interpretation. Make the interpretation explicit.

Category 4 — Wrong format
The output is in prose when you needed a table, in one language when you needed another, structured differently than your tool expects.

Cause: Missing or weak output format specification.

Fix: Add an explicit output format block with a template example. If format is critical, add: "Output ONLY the formatted table. No introductory text, no conclusion."

Category 5 — Reasoning error
The model's analysis reaches an incorrect conclusion because of a reasoning step that went wrong. The output looks confident and well-formatted but is substantively wrong.

Cause: The model made an inference error, often from ambiguous context or from applying a general rule that doesn't fit your specific case.

Fix: Request chain-of-thought reasoning: "Work through your reasoning step by step before giving your conclusion." This externalizes the reasoning and makes errors visible and correctable.

A Prompt Debugging Checklist

When an output is bad, run through this checklist in order:

□ 1. Is the task statement clear and unambiguous?
     → If ambiguous: rewrite with specific action verb + specific deliverable

□ 2. Did I provide the right spec/requirements context?
     → If hallucinations: paste the actual spec/schema/code, don't describe it

□ 3. Did I provide system-specific context (tech stack, naming conventions)?
     → If output is generic: add framework names, file naming conventions, data model

□ 4. Did I specify constraints (scope, count, exclusions)?
     → If scope is wrong: add explicit "maximum N", "only cover X", "exclude Y"

□ 5. Did I specify the output format?
     → If format is wrong: add template, format name, and "output only the table"

□ 6. Is the context I provided accurate?
     → Verify: does the pasted spec match the actual current spec?
     → Verify: is the pasted code the version I think it is?

The Iterative Refinement Loop

Prompting is not a one-shot activity for complex tasks. Use multi-turn refinement:

Turn 1 — Generate first draft:

**Prompt:**
[full prompt as designed]

Turn 2 — Identify and fix specific issues:

**Prompt:**
In your previous output:
- Test case TC-003 is incorrect because [specific reason]
- You missed the rate limiting scenario from acceptance criteria AC-7
- The format for "Expected Result" should be the HTTP status code, not a description

Regenerate TC-003 with the correct behavior, add a rate limiting scenario after TC-012,
and update all Expected Result cells to show status codes (e.g., "200 OK", "400 Bad Request").

Turn 3 — Final polish:

**Prompt:**
Output the complete final table with all corrections applied. Output only the table —
no explanation needed.

Using "Ask Before You Generate" for High-Stakes Prompts

For the most important tasks (test plan for a critical feature, regression scope for a major release), add a pre-generation check:

**Prompt:**
Before generating the test plan, tell me:
1. What artifacts do you have in context for this task?
2. What are the key acceptance criteria you plan to cover?
3. Are there any ambiguities in the spec that would affect your output?

After I confirm, generate the test plan.

This turns a potentially wrong single shot into a correct validated output. The 30-second confirmation check is worth it for outputs that feed into your sprint test plan.

Learning Tip: Keep a "bad outputs" log for one sprint — every time AI output is wrong or unusable, note the task, the likely cause from the checklist above, and the fix you applied. At the end of the sprint, you'll have a personal catalogue of your most common prompt failure modes and their cures. Most engineers discover they have 2–3 recurring failure patterns that account for 80% of bad outputs. Fix those systematically and your average output quality jumps sharply.