·

Hands-on: Domain-specific test suite for your stack

Hands-on: Domain-specific test suite for your stack

How to assemble the right domain-specific context for AI test generation?

The quality of everything that follows in this hands-on session depends entirely on the quality of the context you give AI before you prompt. Context assembly is not busywork — it is the highest-leverage investment you make in the AI-assisted test generation process. Bad context produces generic test cases you have to throw away. Rich, domain-specific context produces test cases you can run within minutes.

This section gives you a repeatable context-assembly process you can use for any domain, any stack, and any feature.

The context layers model

Think of context as four layers, each adding specificity:

Layer 1: Technology context     (what tools/frameworks/platforms are in use)
Layer 2: Structural context     (what is the architecture, what does the code look like)
Layer 3: Domain context         (what does this feature do, what are its business rules)
Layer 4: Test context           (what patterns do your tests follow, what exists already)

Each layer builds on the previous. You can often generate useful output with only layers 1 and 3, but layers 2 and 4 are what transform "decent AI test cases" into "test cases I could commit without heavy editing."

Layer 1: Technology context

Technology context tells AI which libraries, frameworks, runners, and platforms it's working with. This prevents it from generating Selenium tests when you use Playwright, or Python fixtures when you're on TypeScript.

What to include:
- Language and version (TypeScript 5.3, Python 3.12, Swift 5.9, Kotlin 1.9)
- Test framework and runner (Jest, Pytest, XCTest, JUnit 5, Espresso)
- Platform specifics (iOS 16+, Android API 29+, Fire TV OS 7, tvOS 17)
- Key test libraries (Playwright v1.40, Appium 2.x, Maestro, WireMock, Pact)
- Project structure conventions (where test files live, naming conventions)

How to collect it: Open your package.json, build.gradle, Podfile, requirements.txt, or equivalent. Copy the testing-relevant dependencies section.

Quick technology context prompt wrapper:

Tech stack:
- Language: TypeScript 5.3
- Test runner: Jest 29 with ts-jest
- E2E framework: Playwright 1.40
- Mobile automation: Maestro (for E2E) + Appium 2 (for integration)
- Platform: iOS 16+ (primary), Android 10+ (secondary)
- Test helper libraries: @faker-js/faker, Fishery (factories), nock (HTTP mocking)
- CI: GitHub Actions with Xcode Cloud for iOS builds
- File convention: test files colocated with source at *.test.ts; E2E at /e2e/*.spec.ts

Layer 2: Structural context

Structural context shows AI the actual shape of your code — the interfaces, class signatures, API routes, database schema, and existing test helpers it needs to produce idiomatic tests.

What to include:
- The interface or type definitions for the subject under test
- Relevant API endpoint definitions (route, method, request/response shape)
- Database schema (the relevant tables, not your entire schema)
- Existing test helpers, factories, and fixtures AI should reuse
- Page Object classes or screen abstractions (for UI/E2E tests)

How to collect it: You don't need to paste entire files. Identify the 3–5 most relevant definitions and paste them. For large files, paste only the relevant section and add a note: "This is an excerpt from [filename] — I can provide more if needed."

Example structural context paste:

Relevant types (from src/types/order.ts):
---
type Order = {
  id: string;
  userId: string;
  status: 'pending' | 'confirmed' | 'shipped' | 'delivered' | 'cancelled' | 'refunded';
  lineItems: LineItem[];
  totalCents: number;
  currency: string;
  createdAt: Date;
  updatedAt: Date;
}

type LineItem = {
  productId: string;
  quantity: number;
  unitPriceCents: number;
}
---

Existing test factory (from tests/factories/order.factory.ts):
---
export const buildOrder = (overrides: Partial<Order> = {}): Order => ({
  id: faker.string.uuid(),
  userId: faker.string.uuid(),
  status: 'pending',
  lineItems: [buildLineItem()],
  totalCents: 4999,
  currency: 'USD',
  createdAt: new Date(),
  updatedAt: new Date(),
  ...overrides,
});
---

API route to test:
POST /api/orders/:id/cancel
Auth: Bearer JWT (userId must match order.userId)
Request body: { reason: string (required, max 500 chars) }
Success: 200 { order: Order (status changed to 'cancelled') }
Errors: 400 (validation), 401 (unauthenticated), 403 (not owner), 404 (not found),
        409 (already cancelled/delivered/refunded — cannot cancel)

Layer 3: Domain context

Domain context explains the business rules and domain knowledge that are not obvious from the code structure. This is where QA engineers add the most value — you understand the domain.

What to include:
- Business rules governing the feature (if X, then Y)
- Edge cases that matter to the business (high-value transactions, expired states)
- Known historical bugs or regression areas (this tends to produce very targeted tests)
- User types and personas that have different behaviors (admin vs. end user, free vs. paid)
- Integration dependencies that matter for the feature (what external services does this touch)

Example domain context:

Domain rules for order cancellation:
1. Orders can only be cancelled if status is 'pending' or 'confirmed'
2. Orders that have been shipped (status: 'shipped') cannot be cancelled — 
   customer must use the return flow instead
3. Cancellation reason is logged to our audit system (not just stored on the order)
4. When an order is cancelled, inventory must be immediately restocked 
   (synchronous call to InventoryService)
5. If InventoryService fails during cancellation, the cancellation should still 
   succeed but an error alert must be triggered to ops team
6. A cancellation confirmation email is sent asynchronously (non-blocking)
7. Users with an active Pro subscription get a 15-minute grace window to cancel 
   after an order moves to 'shipped' status (this is a special rule for Pro tier)

Known regression areas:
- The Pro 15-minute grace window rule has had 3 bugs in the past 6 months — 
  tests for this rule are high priority
- InventoryService timeout during cancellation once caused data inconsistency — 
  the partial failure path needs thorough testing

Layer 4: Test context

Test context tells AI what your existing tests look like — so it can match your patterns, reuse your helpers, and not re-invent the wheel.

What to include:
- A representative existing test file (or key excerpts)
- Your test naming conventions
- How you handle setup/teardown (beforeEach, fixtures, @BeforeEach)
- How you mock external dependencies in this codebase
- Any patterns to avoid (e.g., "we don't use jest.spyOn, we use nock for HTTP mocking")

Example test context:

Representative test pattern from our codebase 
(from tests/api/orders/create-order.test.ts):
---
describe('POST /api/orders', () => {
  let testUser: User;

  beforeEach(async () => {
    testUser = await createUser({ plan: 'pro' });
    nock('https://inventory.internal').post('/reserve').reply(200, { reserved: true });
  });

  afterEach(async () => {
    await cleanupTestData([testUser.id]);
    nock.cleanAll();
  });

  it('returns 201 with created order when request is valid', async () => {
    const response = await request(app)
      .post('/api/orders')
      .set('Authorization', `Bearer ${testUser.token}`)
      .send({ items: [{ productId: 'prod_123', quantity: 2 }] });

    expect(response.status).toBe(201);
    expect(response.body.order).toMatchObject({ status: 'pending', userId: testUser.id });
  });
});
---

Conventions:
- Test files: describe() groups by endpoint; it() uses sentence format
- Setup: user + mock setup in beforeEach; cleanup in afterEach
- HTTP mocking: nock for external service calls
- Auth: testUser.token is a pre-signed JWT (createUser helper handles this)
- Assertion style: toMatchObject() for partial object matching; toBe() for primitives

Assembling context into a single prompt briefing

Once you have all four layers, assemble them into a structured briefing at the top of your generation prompt:

## Context Briefing

### Tech Stack
[Layer 1 content]

### Structural Context
[Layer 2 content]

### Domain Context
[Layer 3 content]

### Existing Test Patterns
[Layer 4 content]

---
## Test Generation Request

Given the above context, generate [specific test suite request here].

Learning Tip: Context assembly takes 15–30 minutes for an unfamiliar feature, but drops to 5–10 minutes for features in a codebase you know well. Keep a test-context.md file in your working directory as a running context document that you update as you learn the codebase. Maintaining this document across sessions means you start each AI session with rich context already assembled — rather than re-collecting it every time.


How to generate a complete domain-specific test suite with AI in one session?

With rich context assembled, generating a complete test suite is a matter of sequencing your prompts correctly. This section walks through a full end-to-end session structure — from blank slate to a commit-ready test suite — using a realistic example across each domain type.

The one-session generation workflow

A complete test suite generation session has five phases:

Phase 1: Scope definition     (what to test, what not to test)
Phase 2: Test case generation (generate all test cases as a list first)
Phase 3: Gap review           (identify missing coverage with AI)
Phase 4: Code generation      (generate executable test code)
Phase 5: Edge case sweep      (prompt specifically for edge cases)

Running these phases sequentially — rather than jumping straight to code — produces substantially better output.

Phase 1: Scope definition

Before writing a single test, get AI to help you define scope:

I'm about to generate a test suite for the Order Cancellation feature. 
Before we generate tests, help me define scope.

Context: [paste full context briefing]

Generate a scope definition that answers:
1. What is IN scope for this test suite? (which behaviors to test)
2. What is explicitly OUT of scope? (what to defer to other test suites or layers)
3. Which testing layer covers each behavior? 
   (Unit / Integration / E2E / Manual — list all in-scope items with their test layer)
4. What are the 3 highest-priority test cases that must exist?
5. What are 3 areas where this feature is likely to have hidden complexity?

This produces a scope document that serves as a checklist as you proceed through the remaining phases.

Phase 2: Test case generation (list form first)

Generate test cases as a structured list before generating any code. This is faster to review and iterate on:

Using the scope we defined, generate a complete test case list for the 
Order Cancellation API endpoint (POST /api/orders/:id/cancel).

Format each test case as:
| ID | Category | Test Name | Input/Precondition | Expected Outcome | Priority | Layer |

Categories to cover:
- Happy path
- Authentication/authorization
- Validation
- Business rules (each rule from domain context = at least one test)
- External service behavior (InventoryService scenarios)
- Edge cases

Do not write code yet — just the test case list.

Review this list before moving to code generation. Add, remove, or rewrite test cases at this stage — it's far faster than editing generated code.

Phase 3: Gap review

After reviewing the list, use AI to find coverage gaps:

Review the test case list we generated. Identify:
1. Any business rules from the domain context that have no corresponding test case
2. Any error codes from the API spec that have no corresponding test case
3. Any combination scenarios we haven't considered 
   (e.g., Pro user + already-shipped order + InventoryService timeout)
4. Any security-relevant test cases we should add 
   (auth bypass, IDOR, injection in the reason field)

After identifying gaps, suggest 5–10 additional test cases to add.

Phase 4: Code generation

Now generate the executable code from the validated test case list:

Generate the Jest + TypeScript test file for the test cases in our validated list.
[paste the finalized test case list]

Requirements:
- Use the existing test patterns from our context briefing
- Reuse the createUser() helper and nock for HTTP mocking
- Group test cases by category using nested describe() blocks
- Use test IDs as part of the it() description: it('[TC-CANCEL-001] returns 200...')
- Add a comment above each test case explaining the business rule being tested
- For InventoryService scenarios, generate both nock stubs at the top of the file 
  as named constants (INVENTORY_SUCCESS_STUB, INVENTORY_TIMEOUT_STUB)
- Do not mock the database — use real test database transactions

Generate one complete test file. Do not split across multiple files.

Phase 5: Edge case sweep

After the main test file is generated, run a targeted edge case pass:

Review the test file we just generated. Run an edge case sweep and identify:

1. String/numeric boundaries not yet tested:
   - cancellation reason at exactly 500 characters (limit)
   - cancellation reason at 501 characters (over limit)
   - cancellation reason as empty string vs. whitespace-only

2. Concurrency scenarios:
   - Two simultaneous cancellation requests for the same order — race condition?

3. State machine transitions:
   - Any order status transition we haven't tested (e.g., 'confirmed' → 'cancelled')

4. Data integrity:
   - After cancellation, verify inventory was restocked (DB assertion, not just mock assertion)

Generate the additional test cases for any gaps found. 
Add them to the existing test file structure.

Domain-specific generation patterns

Each domain has a variation on this workflow:

Frontend/UI: Replace Phase 4 with Playwright component test generation. Use Phase 5 to add visual regression snapshot captures and accessibility assertions via axe-playwright.

Backend/API: As described above. Phase 5 focus is concurrency, data integrity, and external service failure modes.

Mobile: Phase 2 explicitly includes device/OS variation as a test case dimension. Phase 4 generates Maestro flow files or XCTest/Espresso files. Phase 5 sweeps for platform-specific OS version behaviors.

CTV: Phase 2 explicitly includes platform (Fire TV, Apple TV, Roku) as a test case dimension. Phase 4 generates playback event assertion scripts. Phase 5 sweeps for ad insertion edge cases and ABR boundary conditions.

Learning Tip: The five-phase workflow is not bureaucracy — skipping directly to code generation produces lower-quality code that takes longer to review and fix than the time you saved. The list-first approach (Phase 2) specifically prevents a common failure mode: AI generates code for 15 test cases, you review the code, realize 5 of the test cases are wrong, and spend 30 minutes untangling generated code. Reviewing a test case list takes 5 minutes; reviewing and refactoring generated code takes 30.


How to review AI output for domain accuracy and edge case coverage?

AI-generated test output requires structured review — not because AI makes frequent errors, but because the errors it does make are systematic and predictable. Understanding those failure modes turns review from "read everything carefully" into a targeted inspection checklist.

The four categories of AI test generation failure

1. Domain rule omission
AI generates tests for the rules it found in your domain context but misses rules that are implicit (in your head, not in the spec) or spread across multiple documents.

Detection: Compare each domain rule in your context briefing against the test cases. If a rule has no test case, it was omitted.

2. Business logic hallucination
AI invents plausible-sounding but incorrect business logic. This is more common when your domain context is thin — AI fills gaps with assumptions.

Detection: Read each test's expected result and ask "is this actually what the system should do?" If you're not sure, go back to the spec. Never trust an expected result you haven't verified.

3. Framework/API misuse
AI uses your test framework or libraries in a way that syntactically valid but semantically incorrect: mocks that don't work as expected, async patterns that don't properly await, assertion methods used on the wrong object type.

Detection: Run the tests. Framework misuse typically produces runtime errors or false positives (tests pass but don't actually test what they claim). If a test passes trivially, investigate.

4. Coverage gaps in edge cases
AI tends toward complete coverage of the happy path and the obvious error paths, but undercovers:
- Concurrent operations
- State machine edge transitions (e.g., status B → status C, skipping A → C)
- Large payload / boundary conditions
- Clock-sensitive logic (expiry, grace windows, token refresh)

Detection: Use the edge case sweep (Phase 5) explicitly. Also manually enumerate: what happens at limits? What happens when time moves forward?

Domain review checklist

Use this checklist when reviewing AI-generated test output:

## Domain Review Checklist

### Domain Rules Coverage
- [ ] Every business rule in the domain context has ≥1 test case
- [ ] Every API error code has ≥1 test case that triggers it
- [ ] Every conditional branch (if Pro user, if already cancelled, etc.) has ≥1 test case

### Correctness of Expected Results
- [ ] Each expected result matches the actual spec (not AI's assumption)
- [ ] Success responses check the right fields (not just status code)
- [ ] Error responses check the error code AND the message (not just status code)
- [ ] Side effects are asserted, not just the response (e.g., database state, email sent)

### Test Independence
- [ ] Each test can run in isolation without depending on previous tests
- [ ] Setup/teardown is complete (no test leaves state that would affect others)
- [ ] Mocks are reset between tests

### Framework Correctness
- [ ] All async operations are properly awaited
- [ ] Mocks are correctly configured (mock target, return value, call count)
- [ ] Assertions are on the right objects (common mistake: asserting on the mock instead of the response)

### Edge Case Coverage
- [ ] String/numeric boundary values tested
- [ ] Null/undefined handling tested
- [ ] Concurrent access scenario considered
- [ ] Clock-sensitive logic tested (if applicable)
- [ ] Permission/auth edge cases tested

### Test Quality
- [ ] Test names are descriptive (reader can understand expected behavior from test name alone)
- [ ] No logic in tests (no if/else in test body — tests should be deterministic)
- [ ] No magic numbers (use named constants for expected values)

Using AI for peer review of AI output

Counterintuitively, AI is good at reviewing AI-generated tests — because it can apply the checklist systematically at scale:

Review the following test file for domain accuracy and coverage gaps.

Domain context (source of truth):
---
[paste your domain context from Phase 1]
---

Test file to review:
---
[paste generated test file]
---

For each issue found:
1. Category: Domain Rule Omission / Business Logic Error / Framework Misuse / Coverage Gap
2. Specific problem: which test case, what's wrong
3. Severity: Will miss a real bug / False positive risk / Style issue
4. Fix: exact corrected test case or suggestion

Also provide:
- A coverage percentage estimate: how many of the domain rules have test coverage?
- Top 3 missing test cases that would provide the most bug-catching value

Learning Tip: Build the review checklist into your team's pull request template for test PRs. When a PR contains AI-generated tests, the reviewer should run through the checklist — not because AI output is untrustworthy, but because the reviewer's job shifts from "did the author write good tests?" to "did the author review and validate the AI's output?" These are different cognitive tasks with different failure modes.


How to run the suite and interpret domain-specific failures?

A test suite that passes is good. A test suite that fails is better — if it fails because it caught a real bug. The challenge is that AI-generated test suites sometimes fail for the wrong reasons: test setup issues, incorrect mock configuration, or genuine AI generation errors. This section covers how to triage domain-specific test failures efficiently.

The three classes of test failure

When a freshly generated test suite fails, the failure is almost always in one of three classes:

Class A: Infrastructure failure — The test is correct but the environment isn't set up right. The test database isn't seeded, a required service isn't running, an environment variable is missing.

Class B: Test generation error — AI made a mistake in the test itself. Wrong assertion value, incorrect mock setup, wrong API endpoint, incorrect data factory usage.

Class C: Real bug — The test is correct and the application has a bug. This is the intended outcome.

Your triage priority is: confirm it's not Class A, then determine if it's Class B or C.

Rapid triage workflow

Step 1: Read the error message completely
  - Is it a connection error, import error, or type error? → Likely Class A (infrastructure)
  - Is it an assertion failure? → Could be Class B or C
  - Is it a runtime error in the application code? → Likely Class C

Step 2: For assertion failures
  - What was expected vs. what was actual?
  - Is the expected value something AI invented, or something from the spec?
  - Run the same call manually (curl, Postman, direct DB query) — does it match actual or expected?
  - If manual call matches actual: the application is correct, the test is wrong (Class B)
  - If manual call matches expected: the application has a bug (Class C)

Step 3: For Class B failures
  - Fix the test; don't file a bug
  - Note the pattern — AI made this error once, it may have made it elsewhere

Step 4: For Class C failures
  - Do not fix the test — file a bug
  - The test is now your regression test

Using AI to triage test failures

Feed failing test output to AI for rapid triage:

I'm running a newly generated test suite. Some tests are failing. 
Help me triage each failure as Class A (infrastructure), Class B (test error), 
or Class C (real bug).

For each failure:
1. Classify: Class A / B / C
2. Root cause hypothesis
3. If Class A: what to fix in the environment
4. If Class B: exact fix to the test
5. If Class C: description of the bug for the bug report

Domain context (to help distinguish expected vs. unexpected behavior):
---
[paste domain context]
---

Test failures:

Failure 1:
Test: "cancels order with valid reason for pending order"
Error: "Cannot read properties of undefined (reading 'status')"
Stack: at tests/api/orders/cancel-order.test.ts:45

Failure 2:
Test: "returns 409 when order is already cancelled"
Error: "Expected: 409, Received: 200"
Stack: assertion at tests/api/orders/cancel-order.test.ts:89

Failure 3:
Test: "restocks inventory when order is cancelled"
Error: "Nock: No match for request: POST https://inventory.internal/restock"
Stack: nock timeout at tests/api/orders/cancel-order.test.ts:112

Domain-specific failure patterns

Each domain has characteristic failure patterns in AI-generated tests:

Frontend/UI tests:
- False positives from animation timing (screenshot taken before animation completes)
- Selector failures when AI uses fragile CSS selectors instead of accessible attributes
- Visual regression false positives from font rendering differences across environments

Backend/API tests:
- Mock not intercepting the correct URL (trailing slash, query string, HTTP vs. HTTPS)
- Foreign key constraint failures when test data is created in wrong order
- Race condition failures in tests that rely on async side effects (email sending, inventory updates)

Mobile tests:
- Maestro/Appium element not found errors when AI uses incorrect element identifiers
- Timing failures on slow devices (AI assumes desktop-like speeds)
- Permission state failures when permission state doesn't match test precondition

CTV tests:
- Player event timing failures (AI assumes events fire faster than actual player implementation)
- Platform-specific API failures (AI generates Fire TV code that doesn't work on Roku)
- Network simulation failures when throttling doesn't perfectly reproduce assumed conditions

Building a failure log for continuous improvement

Track AI-generation failures systematically to improve future prompts:

Generate a "test generation failure log" template that our team will use to 
track patterns in AI-generated test failures. The log should capture:

1. Date and prompt used
2. Test framework / domain type
3. Failure class (A/B/C)
4. Root cause (what AI got wrong, specifically)
5. Fix applied
6. Prompt improvement: what context would have prevented this error?

After logging 20 failures, we'll use this log to identify the top 3 prompt 
improvements that would prevent the most common failure patterns.

Interpreting domain-specific test results as a signal

Beyond individual failures, test suite results carry domain-level signals:

Frontend: If visual regression tests produce many false positives after a design token change, your visual regression threshold may be too tight — or your baseline needs updating.

Backend: If data integrity tests fail in production but not in test, you have fixture data that doesn't represent the complexity of production data.

Mobile: If tests fail on specific OS versions but not others, you have a platform-specific bug — look at that OS version's changelog for breaking changes in relevant APIs.

CTV: If QoS tests fail on the 1Mbps network condition but pass on 5Mbps, you have a real ABR issue — your low-bitrate rendition ladder may be poorly configured.

Prompt for test result interpretation:

I've run our domain-specific test suite after a recent release. 
Here are the results summary. Analyze the failure pattern and provide:
1. Most likely root cause of the failure cluster
2. Whether this is a test problem or an application problem
3. What to investigate first
4. Any domain-specific concern raised by this pattern

Results summary:
---
Total tests: 186
Passed: 171 (92%)
Failed: 15 (8%)

Failure breakdown:
- Visual regression failures: 8 (all on mobile viewport, all Button component variants)
- API contract test failures: 4 (all in UserService contract, subscription.plan field)
- Performance test failures: 3 (all on video start time, 1Mbps network condition only)
---

Learning Tip: The best measure of a domain-specific test suite's quality is not its pass rate on the day it's written — it's how many bugs it catches in the 6 months after it's written. Keep a "caught in CI" log: every time a test catches a real bug before it reaches production, note it. After 6 months, this log tells you which parts of your test suite are pulling weight and which are mostly noise. Use it to prioritize test maintenance and expansion with AI.