How to Generate Realistic, Domain-Accurate Test Data Sets with AI?
Generic test data — "John Doe", "[email protected]", "123 Main St" — produces tests that pass but don't catch real bugs. Production bugs are caused by actual data patterns: names with apostrophes, emails with plus-signs, addresses with long street names, phone numbers in every national format. AI can generate domain-accurate, realistic data sets that surface these issues before production does.
The Domain Context Prompt
The key to realistic test data is giving AI context about your domain, your users, and your data constraints:
Generate a test data set for our e-commerce platform.
USER DOMAIN CONTEXT:
- Our users are located primarily in the US, UK, Canada, and Australia
- Names include a significant percentage of non-Latin characters (we serve diaspora communities)
- About 15% of users have multiple email addresses — they use + aliases heavily
- Payment methods: credit card (60%), PayPal (25%), bank transfer (15%)
- Average order value: $65–$180 USD
- Product categories: electronics, clothing, homeware, groceries
GENERATE 20 ROWS of user test data in this format:
| first_name | last_name | email | phone | country | postal_code | payment_method |
CONSTRAINTS:
- Include at least 3 names with non-ASCII characters (accents, diacritics, CJK characters)
- Include at least 2 email addresses with + aliases
- Include at least 2 phone numbers in international format with country code
- Include valid postal codes for each country represented
- Include edge cases: very short name (1 char), very long name (40+ chars),
hyphenated last name, name with apostrophe (O'Brien, D'Souza)
- All data should be fictional but structurally realistic
Generating Structured Test Data for Specific Entities
For more complex entities like orders or financial transactions:
Generate a test data set for order processing.
ORDER ENTITY SCHEMA:
- order_id: UUID
- user_id: UUID (reference to user)
- status: enum [pending, processing, shipped, delivered, cancelled, refunded]
- items: array of { product_id: UUID, quantity: int, unit_price: decimal, discount_pct: decimal }
- subtotal: decimal (sum of items)
- discount_total: decimal
- shipping_cost: decimal
- tax_amount: decimal (varies by shipping state)
- total: decimal (subtotal - discount_total + shipping_cost + tax_amount)
- created_at: ISO 8601 timestamp
- shipped_at: ISO 8601 timestamp or null
- delivery_address: object
GENERATE 15 orders covering:
1. One in each status
2. At least 2 orders with multiple items (3-5 items)
3. At least 1 order with a discount applied
4. At least 1 order where discount makes total < $1 (tests minimum charge logic)
5. At least 1 order with international shipping address
6. At least 1 order created over a year ago (for archive/retention testing)
7. At least 1 order with zero shipping cost (digital product / free shipping threshold)
IMPORTANT: All decimal totals must be mathematically consistent — subtotal - discount + shipping + tax = total.
Provide the data as a JSON array.
The instruction to maintain mathematical consistency is critical — AI will otherwise generate totals that don't add up, producing test data that immediately fails database constraint checks.
Generating Data for Localization Testing
Generate test data for localization and internationalization edge cases.
CONTEXT: Our app supports 12 locales: en-US, en-GB, fr-FR, de-DE, ja-JP, zh-CN, zh-TW,
ar-SA, he-IL, pt-BR, es-MX, ru-RU
For each locale, generate one test user record that includes:
- Full name in the local script
- Local phone number format
- Local address format (including postal code format)
- Local currency symbol and sample price (e.g., ¥1,200 for ja-JP)
- A sample date formatted in local convention
- Text direction note (LTR / RTL)
SPECIFIC EDGE CASES TO INCLUDE:
- RTL locales (ar-SA, he-IL): test that reversing text direction doesn't break layout
- CJK locales (ja-JP, zh-CN, zh-TW): names are typically 2-3 characters, no spaces
- German (de-DE): compound words can be 30+ characters — test long word wrapping
- Brazilian Portuguese: CPF number format (tax ID) is different from other countries
Generating CSV/JSON Data Files Directly
For large data sets used in data-driven tests:
Generate a CSV file with 50 rows of product test data for our inventory system.
COLUMNS:
sku, product_name, category, price_usd, stock_quantity, weight_grams,
dimensions_cm (format: "LxWxH"), is_active, tags (comma-separated in quotes),
supplier_id, last_restocked_at
INCLUDE THESE SPECIFIC CASES:
- 5 products with stock_quantity = 0 (out of stock)
- 3 products with stock_quantity = 1 (low stock edge case)
- 2 products with price = 0.00 (free items / samples)
- 2 products with price > 1000 (high-value items)
- 3 products with very long names (50+ characters)
- 2 products with special characters in name (ampersand, apostrophe, em-dash)
- 5 inactive products (is_active = false)
- The rest: varied realistic product data across all categories
Output: valid CSV with header row, properly quoted strings, no trailing whitespace.
Learning Tip: After generating a test data set, paste it into a schema validator or run it against a quick DB insert script before using it in tests. AI-generated data frequently has subtle inconsistencies: dates that are out of range for the status, totals that don't add up, or postal codes that don't match the stated country. A 5-minute validation pass prevents hours of confusing test failures caused by invalid seed data.
How to Prompt AI to Generate Adversarial Boundary and Edge Case Data?
Boundary value analysis (BVA) is a testing discipline, but it's also a prompt discipline. AI will not generate boundary data by default — it generates "typical" data. You have to explicitly enumerate the boundary conditions you want explored, and the more precisely you specify the boundaries, the more useful the output.
The Boundary Specification Prompt Pattern
For any field with constraints, provide the constraint and then ask for all relevant boundary values:
Generate boundary value test data for these fields in our user registration form:
FIELD: username
- Min length: 3 characters
- Max length: 30 characters
- Allowed characters: a-z, 0-9, underscore, hyphen
- Must start with a letter
- Cannot contain consecutive hyphens or underscores
GENERATE:
1. Valid boundary values:
- Exactly 3 chars (minimum valid)
- Exactly 30 chars (maximum valid)
- Only lowercase letters
- Contains single hyphen (valid special char)
- Contains single underscore (valid special char)
- Starts with letter, ends with number (valid)
2. Invalid boundary values:
- 2 chars (one below minimum)
- 31 chars (one above maximum)
- Starts with number (violates start rule)
- Starts with underscore (violates start rule)
- Contains space (invalid character)
- Contains @ symbol (invalid character)
- Contains consecutive hyphens: "user--name"
- Contains consecutive underscores: "user__name"
- All uppercase letters
- Empty string
3. Unicode and encoding edge cases:
- Username with accented character: "café" (should be rejected — only ASCII)
- Username that is valid ASCII but displays differently in different fonts
- Username that looks like another username (homoglyph attack: "userl" vs "user1")
For each value: provide the test input and expected validation result (VALID / INVALID).
Generating Numeric Boundary Data
Generate boundary value test data for a financial transaction amount field.
CONSTRAINTS:
- Type: decimal with 2 decimal places
- Minimum: $0.50 (payment processor minimum)
- Maximum: $999,999.99 (single transaction limit)
- Must be positive (no negative values)
- Must have exactly 2 decimal places when stored
GENERATE boundary values:
- Below minimum: 0.00, 0.49, 0.01
- At minimum: 0.50
- Above minimum: 0.51, 1.00
- Typical values: 10.00, 99.99, 500.00
- Near maximum: 999,999.98, 999,999.99
- Above maximum: 1,000,000.00, 1,000,000.01
- Invalid formats: "10.999" (3 decimal places), "10." (trailing decimal),
".99" (leading decimal), "10,00" (European format), "ten dollars" (string)
- Special values: 0, -1, -0.50, null, undefined, NaN, Infinity
For the "invalid formats" category, also specify which HTTP status code
(400 vs 422) your API should return for each.
Date and Time Boundary Data
Date fields are a particularly rich source of edge cases that AI excels at generating:
Generate date/time boundary test data for a subscription start/end date field.
CONSTRAINTS:
- Format accepted: ISO 8601 (YYYY-MM-DDTHH:mm:ssZ)
- Start date cannot be in the past (must be today or future)
- End date must be after start date
- Maximum subscription duration: 3 years
- System stores in UTC; users submit in their local timezone
GENERATE:
1. Valid dates:
- Today's date at midnight UTC
- Tomorrow's date
- Exactly 3 years from today (max duration)
- Dates in all 12 months (to catch month-length bugs)
- Feb 29 on a leap year (2028-02-29)
2. Invalid dates:
- Yesterday's date (past)
- Exactly 3 years + 1 day from today (exceeds max)
- End date equal to start date (zero-duration, should be invalid)
- End date before start date
3. Timezone edge cases:
- Date submitted as "2025-03-09T23:00:00-05:00" (EST) — converts to next day UTC
- Date at DST transition time (March 9, 2025 2:00 AM EST doesn't exist)
- Date submitted as "+14:00" timezone (Kiribati — furthest ahead)
- Invalid timezone: "+25:00"
4. Calendar edge cases:
- Feb 29 on a non-leap year (2025-02-29 — invalid date)
- Month 13 (2025-13-01)
- Day 32 (2025-01-32)
- Day 0 (2025-01-00)
File Upload Boundary Data
For file upload endpoints, boundary data covers more than just file size:
Generate boundary test data for a document upload endpoint.
CONSTRAINTS:
- Accepted types: PDF, DOCX, PNG, JPG, JPEG
- Maximum file size: 10 MB
- Minimum file size: 1 byte (not empty)
- Maximum filename length: 255 characters
- Filename cannot contain: / \ : * ? " < > |
GENERATE:
1. Size boundaries:
- 0 bytes (empty file)
- 1 byte (minimum valid)
- 1 MB (well within limit)
- 9,999,999 bytes (just under 10 MB)
- 10,000,000 bytes (exactly 10 MB — should this be accepted?)
- 10,000,001 bytes (just over 10 MB)
2. File type edge cases:
- Valid PDF with .pdf extension
- PDF file renamed with .jpg extension (MIME type vs extension mismatch)
- .exe file renamed with .pdf extension (polyglot/malicious)
- Valid PNG with no extension
- Text file with .pdf extension (wrong content for extension)
3. Filename edge cases:
- Filename with spaces: "my document.pdf"
- Filename with Unicode: "café résumé.pdf"
- Filename with only dots: "...pdf"
- Filename at exactly 255 characters
- Filename at 256 characters
For each test case: filename, content description, expected HTTP response code,
and expected error message if rejected.
Learning Tip: Build a "boundary matrix" for your most complex forms — a simple spreadsheet where rows are fields and columns are boundary types (min valid, min-1, max valid, max+1, empty, null, wrong type). Fill it with AI help, then use it as the source for generating test data prompts. This matrix is reusable every time the form changes — just update the constraint column and regenerate. Teams that maintain a boundary matrix catch more regression bugs with less re-work.
How to Use AI to Generate Reusable Data Factories and Test Fixtures?
Individual test data values are useful, but reusable data factories are what enable clean, maintainable, isolated tests at scale. AI can generate factory classes and fixture functions that produce unique, valid test data on demand — with full control over overrides for specific test scenarios.
Generating a TypeScript Test Data Factory
Generate a TypeScript test data factory for our User entity.
USER ENTITY TYPE:
interface User {
id: string; // UUID
email: string; // valid email
password_hash: string; // bcrypt hash of password
first_name: string;
last_name: string;
role: 'user' | 'admin' | 'viewer';
is_active: boolean;
created_at: Date;
updated_at: Date;
profile: {
bio: string | null;
avatar_url: string | null;
timezone: string; // IANA timezone
}
}
REQUIREMENTS:
- Factory function accepts Partial<User> overrides
- Generates unique emails using a counter or timestamp to prevent collisions
- Provides sensible defaults for all fields
- Exports both a `buildUser()` function (in-memory object) and a `createUser()`
function (creates in DB via our ApiClient)
- Includes builder variants: buildAdminUser(), buildInactiveUser(), buildViewerUser()
DEPENDENCIES AVAILABLE:
- import { ApiClient } from '../helpers/api-client'
- import { v4 as uuidv4 } from 'uuid'
- Use faker from @faker-js/faker for realistic data values
Expected output shape:
import { faker } from '@faker-js/faker';
import { v4 as uuidv4 } from 'uuid';
import { ApiClient } from '../helpers/api-client';
let emailCounter = 0;
function generateUniqueEmail(prefix = 'user'): string {
return `test+${prefix}${++emailCounter}+${Date.now()}@example.com`;
}
export function buildUser(overrides: Partial<User> = {}): User {
return {
id: uuidv4(),
email: generateUniqueEmail(),
password_hash: '$2b$10$mockhashfortestingpurposes',
first_name: faker.person.firstName(),
last_name: faker.person.lastName(),
role: 'user',
is_active: true,
created_at: new Date(),
updated_at: new Date(),
profile: {
bio: null,
avatar_url: null,
timezone: 'America/New_York',
},
...overrides,
};
}
export function buildAdminUser(overrides: Partial<User> = {}): User {
return buildUser({ role: 'admin', ...overrides });
}
export function buildInactiveUser(overrides: Partial<User> = {}): User {
return buildUser({ is_active: false, ...overrides });
}
export async function createUser(
client: ApiClient,
overrides: Partial<User> = {}
): Promise<User> {
const userData = buildUser(overrides);
return await client.createUser(userData);
}
Generating Pytest Fixtures for API Tests
Generate pytest fixtures for our API test suite.
CONTEXT:
- Framework: Pytest
- HTTP client: requests library
- Auth: Bearer token via POST /api/v1/auth/login
- API base URL: from environment variable API_BASE_URL
GENERATE THESE FIXTURES:
1. `base_url` (session scope): returns API_BASE_URL from environment
2. `api_client` (function scope): authenticated requests.Session with Bearer token
3. `admin_api_client` (function scope): authenticated as admin user
4. `test_user` (function scope): creates a user via API, yields the user object,
deletes the user after the test
5. `test_product` (function scope): creates a product with a known price and stock,
yields the product, deletes after test
6. `test_order` (function scope, depends on `test_user` and `test_product`):
creates an order in "pending" status for the test user, yields the order,
deletes after test
Output for the test_user fixture:
@pytest.fixture
def test_user(api_client):
"""Creates a test user before test, deletes after."""
payload = {
"email": f"test+user{int(time.time())}@example.com",
"password": "TestPassword1!",
"first_name": "Test",
"last_name": "User",
"role": "user"
}
response = api_client.post("/api/v1/users", json=payload)
response.raise_for_status()
user = response.json()
yield user
# Cleanup: delete user regardless of test outcome
try:
api_client.delete(f"/api/v1/users/{user['id']}")
except Exception as e:
warnings.warn(f"Failed to clean up test user {user['id']}: {e}")
The try/except around cleanup is important — AI won't include it unless you ask. Without it, a cleanup failure causes the next test to fail with a misleading error. Always request robust cleanup that doesn't fail the test on teardown errors.
Generating Playwright Fixtures with Database Seeding
Generate a Playwright fixture that seeds an authenticated user session.
CONTEXT:
- Playwright version: 1.44, TypeScript
- We use storageState for session management
- Auth endpoint: POST /api/v1/auth/login
- Our db-helpers module exports: createTestUser(overrides), deleteTestUser(email)
- We want the fixture to:
1. Create a user in the DB
2. Log in via API to get a token
3. Set browser storage state with the token
4. Yield the page object ready to use
5. Delete the user after the test
Make the fixture accept an optional `role` parameter (default 'user').
Learning Tip: Write your factory's
buildX()(pure in-memory) andcreateX()(writes to DB) functions separately. Tests that only need to assert on data structure should usebuildX()— it's 100x faster than hitting the database. Tests that need the entity to actually exist in the system should usecreateX(). If all your factories write to the DB by default, your test suite will be much slower than it needs to be.
What Test Data Should You Never Send to an AI Tool?
This is the most important security section in the module. AI tools — including Claude Code, Gemini CLI, GitHub Copilot, and all web-based AI assistants — transmit your input to external cloud services. Sending the wrong data creates legal liability, violates data protection regulations, and can expose your customers or organization to real harm.
Categories of Data Never to Send
Personal Identifiable Information (PII) from production
- Real customer names, email addresses, phone numbers, physical addresses
- Real government IDs (SSN, passport numbers, national ID numbers, driving licence numbers)
- Real IP addresses of actual users
- Real device IDs or tracking identifiers linked to real people
- Real dates of birth tied to named individuals
Payment and Financial Data
- Real credit card numbers (even test transactions from real cards)
- Real bank account numbers, routing numbers, IBAN numbers
- Real CVV codes (these should never even be stored — if you have them, that's a PCI-DSS violation separate from the AI issue)
- Real transaction records that would identify a customer's spending
Healthcare and Sensitive Personal Data
- Real patient records, medical diagnoses, medication information
- Real mental health records
- Real biometric data (fingerprints, face scan IDs, voiceprints)
- Any data under HIPAA jurisdiction in the US, or health data under GDPR
Authentication Credentials
- Real passwords (plaintext or hash)
- Production API keys, access tokens, refresh tokens
- SSH private keys, certificate private keys
- OAuth client secrets
- Database connection strings with real credentials
Business-Sensitive Data
- Unreleased product roadmaps or feature specs (if covered by NDAs)
- Customer contract terms and pricing
- Internal financial data
How to Anonymize Data Before Sending
If you have a real data structure that you want to use as a template for AI test data generation, anonymize it first:
cat real-api-response.json | jq '
.email = "[email protected]" |
.phone = "+1-555-000-0000" |
.first_name = "Test" |
.last_name = "User" |
.address = "123 Test Street" |
.ssn = "XXX-XX-XXXX" |
.credit_card = "4111 1111 1111 1111"
'
For database query results with multiple rows:
import json, re
def anonymize_row(row):
"""Replace PII fields with safe placeholders."""
pii_fields = {'email', 'phone', 'first_name', 'last_name',
'address', 'ssn', 'ip_address', 'device_id'}
return {k: f"REDACTED_{k.upper()}" if k in pii_fields else v
for k, v in row.items()}
anonymized = [anonymize_row(row) for row in production_rows[:3]]
print(json.dumps(anonymized, indent=2))
Safe Alternatives for Realistic Test Data
Instead of real production data:
| Instead of | Use |
|---|---|
| Real customer emails | Generated emails: test+{uuid}@example.com |
| Real phone numbers | Fake phone libraries: faker.js, Faker (Python) |
| Real SSNs | Test SSNs with invalid prefix: 000-00-0000 |
| Real credit cards | Published test card numbers: 4242 4242 4242 4242 |
| Real names | Faker library or AI-generated fictional names |
| Real addresses | Test addresses from USPS/Royal Mail published test sets |
| Real API responses | Sanitized copies with PII fields replaced |
Organizational Policy Prompts
When prompting AI for test data, include a reminder to generate only fictional data:
Generate test data for this scenario.
IMPORTANT DATA POLICY:
- All generated data must be completely fictional
- Do not generate realistic-looking SSNs, passport numbers, or government IDs
(use format like "XXX-XX-0000" or "PASS-TEST-001")
- Use fictional names and addresses only
- Use published test credit card numbers (4242 4242 4242 4242) not realistic-looking fake numbers
- All emails should use the @example.com domain (IANA-reserved for examples)
Learning Tip: Create a one-page "AI Data Policy" document for your QA team that specifically lists: what data categories are never sent to AI, how to anonymize data before sending, and what the approved alternatives are for each PII category. Post it in your team's Confluence or Notion. A single incident of real customer data being sent to an AI tool can trigger regulatory notification requirements under GDPR, CCPA, or HIPAA. Five minutes of policy clarity prevents that risk entirely.