Property-Based Testing & Mutation Testing via AI

Example-based tests verify that your code is right on the cases you thought of — property-based tests try to break it on cases you did not, and mutation testing tells you whether your tests would even notice.

The Limits of Example-Based Tests

Every test you write with a specific input and expected output is an example-based test. They are easy to reason about, fast to run, and straightforward to debug when they fail. They are also fundamentally bounded by your imagination. You can only test the inputs you thought to test, which means your test suite has gaps wherever your mental model of the system has gaps.

This limitation is especially significant when AI writes your production code. AI-generated code can contain subtle logic errors that look correct at a glance and pass all your example-based tests because the examples you chose happen to avoid the problematic edge case. An AI might implement a sorting function that fails when all elements are equal, a date calculation that is wrong in leap years, or a discount formula that produces negative values when the discount exceeds the order total. If your examples do not exercise these conditions, your tests will not catch them.

Property-based testing addresses this by generating thousands of random inputs and verifying that a property of the output holds — not a specific value, but an invariant that must be true for all valid inputs. Instead of "given input 5, the sorted output is [1, 2, 5]," you assert "for any list of integers, the sorted output has the same length as the input, every element in the input appears in the output, and no element in the output is greater than the element after it." This property must hold for all inputs, and the testing framework will try thousands of random cases to find one that violates it.

Learning tip: When you find yourself writing 6 or more example-based tests for the same function to cover different cases, that is a signal that property-based testing might serve you better. You are already encoding an invariant — just write it as a property instead.

Identifying Properties from Business Invariants

The hardest part of property-based testing is writing good properties. A property is a statement about what is always true, not what is specifically true for one case. Business logic is usually rich in invariants, but they are often unstated — they are assumed to be obvious. AI is useful here: given a description of the business logic, it can enumerate the implied invariants that your code must satisfy.

Common categories of useful properties:

Idempotency: Applying an operation twice produces the same result as applying it once. Relevant for: formatters, normalizers, deduplicated collections, cache warming.

Roundtrip / inverse: An encode operation followed by a decode operation returns the original value. Relevant for: serialization, encoding, encryption (with key), compression.

Conservation: A quantity is preserved through a transformation. Relevant for: financial calculations (the total before split must equal the sum after split), data transformations (row count before equals row count after filtering the same condition twice).

Monotonicity: As input increases, output increases (or decreases). Relevant for: pricing tiers (a higher order total should never result in a higher percentage fee than a lower one), rankings, scoring.

Commutativity: The order of inputs does not change the output. Relevant for: set operations, additive calculations.

Boundary preservation: Outputs stay within valid ranges regardless of inputs. Relevant for: percentage calculations (never goes below 0 or above 100), date arithmetic (resulting date is a valid date), proration (discount never exceeds the original price).

To find properties in your domain: write the invariants in plain English first, then translate them into code. AI can generate the translation once you have the plain-English version.

Learning tip: Ask AI to list business invariants for a domain before asking it to write properties. "What must always be true about a proration calculation for a subscription billing system?" produces better properties than "write property-based tests for this function."

Tools: fast-check, Hypothesis, and PropTest

fast-check (JavaScript/TypeScript) is the most widely used property-based testing library for the JS ecosystem. It integrates with Jest and Vitest and provides a rich library of arbitraries (generators for specific types) as well as the ability to define custom arbitraries. When a property fails, fast-check automatically shrinks the failing case to the simplest possible input that reproduces the failure.

Hypothesis (Python) is the Python equivalent, with an arguably more mature shrinking engine and a powerful @given decorator API. It integrates with pytest and maintains a database of previously failing examples so it reruns them on every test run, providing regression protection.

PropTest (Rust) and QuickCheck (the original, in Haskell, with ports to many languages) follow the same model. Most languages have at least one actively maintained property-based testing library.

The AI workflow for each tool is similar: describe the business invariants, provide the function signature and any constraints on input ranges, and ask AI to generate both the arbitrary (the input generator) and the property assertion.

Learning tip: Start with fc.integer(), fc.string(), and fc.record() — the most basic fast-check arbitraries. Once you are comfortable reading failing cases, move to domain-specific custom arbitraries that generate realistic business objects.

Mutation Testing to Evaluate Test Suite Strength

Coverage metrics tell you which lines of code are executed during tests. They do not tell you whether your tests would catch a defect in that code. A test that covers a line but makes no meaningful assertion about it is providing false coverage.

Mutation testing answers the more useful question: if we introduce a defect into the code, will the test suite catch it? A mutation testing framework (Stryker for JavaScript, mutmut for Python, PITest for Java) makes small, systematic changes to the production code — flipping a > to >=, changing + to -, negating a conditional — and runs the test suite after each change. A mutation that causes a test to fail is "killed." A mutation that all tests pass on is "survived." The ratio of killed mutations to total mutations is your mutation score.

A high mutation score (80%+) indicates a test suite that is sensitive to the kinds of defects mutations simulate. A low mutation score indicates a test suite with coverage but no teeth — tests that run code without asserting strongly enough to catch changes.

Mutation testing is computationally expensive (it runs the test suite once per mutation), so it is typically run on CI on a schedule rather than on every commit, or scoped to the most critical modules.

Learning tip: Run mutation testing on your highest-risk module first, not the whole codebase. The findings from one module will teach you which patterns create surviving mutations, and you can apply that learning everywhere.

Using AI to Interpret Mutation Testing Results and Improve Tests

Mutation testing reports can be verbose and difficult to parse. AI is effective at interpreting them: given the list of surviving mutations and the corresponding test file, AI can identify the pattern of weakness (are tests never asserting on return values? are tests using .toContain instead of .toEqual?) and suggest specific improvements.

The most useful AI workflow for mutation testing:

Run mutation testing on a module and export the surviving mutations list.
Paste the list into an AI conversation with the test file.
Ask AI to categorize the survivors by type (boundary confusion, missing assertion, missing test case).
Ask AI to generate replacement or additional tests that would kill each category of survivor.
Re-run mutation testing after adding the new tests to verify the score improved.

AI can also help write the initial property-based tests specifically to kill mutations that example-based tests consistently miss — mutations that change boundary conditions and mathematical operators are the hardest to kill with hand-written example tests and easiest to kill with property-based tests.

Learning tip: A surviving mutation that changes a > to >= almost always indicates a missing boundary test. These are the easiest mutations to kill — add one test that asserts the specific behavior at the boundary.

How Property-Based Tests Catch AI Code Hallucinations

AI-generated code has a specific failure mode: it looks correct, follows the structure of the problem, and passes the example tests you provide, but contains a subtle logical error in an edge case the examples did not exercise. This is a "hallucination of correctness" — the code looks right, feels right, and tests right on the obvious inputs.

Property-based tests are particularly effective at exposing this failure mode because they do not rely on the engineer's mental model of which inputs to test. The testing framework generates thousands of inputs, many of which the engineer would not have thought of, and verifies the invariant holds on all of them.

Concrete examples of AI code hallucinations that property-based tests catch: a discount calculator that underflows to a large positive number when passed a coupon with a large fixed-amount discount on a small order (integer underflow in a language with unsigned integers), a string tokenizer that fails when the input contains consecutive delimiters, a pagination function that returns duplicate items when the page boundary aligns with a deleted item.

None of these would be caught by typical example-based tests unless the engineer specifically anticipated them. All of them are caught immediately by property-based tests asserting basic invariants (total cannot exceed original price, all tokens are non-empty strings, no item appears twice in paginated results).

Learning tip: After any AI generates a non-trivial algorithm, write at least one property-based test before shipping. The investment is small and the protection against subtle hallucinations is high.

Hands-On: Write Properties and Run Mutation Testing for a Proration Calculator

This exercise uses a subscription billing proration function: given a plan's monthly price, the day of the month the user upgrades, and the number of days in the billing month, calculate the prorated charge for the remainder of the month.

Step 1: Elicit business invariants from AI.

I have a subscription billing proration function with this signature:

function calculateProration(monthlyPrice: number, upgradeDay: number, daysInMonth: number): number

It should return the prorated charge for the remaining days of the billing month after an upgrade.

Before writing any tests: list every business invariant that must hold for this function. Think about:
- Valid and invalid input ranges
- What the minimum and maximum return values can be
- Relationships between inputs and outputs
- Conservation properties (does the math add up?)
- Edge cases that could produce nonsensical results

Write invariants in plain English. Do not write code yet.

Expected output: 8-12 invariants such as: result is always non-negative, result never exceeds monthlyPrice, if upgradeDay is 1 the result equals monthlyPrice, if upgradeDay equals daysInMonth the result is zero or near-zero, the ratio (result / monthlyPrice) equals (daysInMonth - upgradeDay + 1) / daysInMonth, and so on.

Step 2: Convert invariants to fast-check property tests.

Convert these invariants into fast-check property tests. Use the following arbitraries:
- monthlyPrice: fc.float({ min: 0.01, max: 9999.99 })
- upgradeDay: fc.integer({ min: 1, max: 31 })
- daysInMonth: fc.integer({ min: 28, max: 31 })
- Add a filter: upgradeDay <= daysInMonth

For each invariant, write one property test using fc.assert(fc.property(...)) with:
- A clear test description
- The correct arbitrary composition
- A meaningful assertion

Use Jest + fast-check.

Expected output: 6-10 property tests, each asserting a different invariant with appropriately composed arbitraries.

Step 3: Run the properties against a known-good and a broken implementation.

Here is one implementation of calculateProration. Run these property tests mentally against it:

// Implementation A (correct):
function calculateProration(monthlyPrice, upgradeDay, daysInMonth) {
  const remainingDays = daysInMonth - upgradeDay + 1;
  return Math.round((monthlyPrice * remainingDays / daysInMonth) * 100) / 100;
}

// Implementation B (subtly wrong):
function calculateProration(monthlyPrice, upgradeDay, daysInMonth) {
  const remainingDays = daysInMonth - upgradeDay; // off-by-one: missing + 1
  return Math.round((monthlyPrice * remainingDays / daysInMonth) * 100) / 100;
}

Which of our property tests would catch the bug in Implementation B? Which would miss it?
Suggest additional properties that would catch this specific off-by-one error.

Expected output: Analysis showing which properties catch the off-by-one, which do not, and new properties specifically targeting the +1 boundary (e.g., "on day 1 of the month, the result should equal the full monthly price").

Step 4: Set up mutation testing with Stryker and interpret results.

I ran Stryker mutation testing on calculateProration and got these surviving mutations:

1. Changed `daysInMonth - upgradeDay + 1` to `daysInMonth - upgradeDay` (BlockStatement mutant)
2. Changed `*` to `/` in `monthlyPrice * remainingDays` (ArithmeticOperator mutant)
3. Changed `Math.round` to `Math.floor` (CallExpression mutant)

My current tests are:
[paste test file]

For each surviving mutation:
1. Explain why the current tests are not killing it
2. Write a specific test (example-based or property-based) that would kill it
3. Indicate whether this is a real risk or a mutation that is equivalent to the correct behavior

Expected output: Analysis of three surviving mutations with targeted tests and assessment of whether each represents a real bug risk.

Step 5: Evaluate test suite improvement.

Before adding the new tests, my mutation score was 62%. After adding them, what mutation score would you estimate and why? What is the practical meaning of moving from 62% to a higher score for this particular function?

Expected output: Estimated new score with reasoning, and a plain-English explanation of what the improvement means for confidence in the function's correctness.

Key Takeaways

Property-based tests verify invariants across thousands of random inputs, catching defects that example-based tests miss because the examples were never adversarial enough.
The hardest part of property-based testing is writing good properties — start by listing invariants in plain English and having AI translate them to code.
Mutation testing measures whether your test suite can actually detect defects, not just whether it achieves coverage metrics.
AI is effective at interpreting mutation testing results and generating targeted tests to kill specific surviving mutations.
Property-based tests are particularly effective at catching the "hallucinations of correctness" that AI-generated code produces — subtle logic errors on edge cases that look correct on typical inputs.