Test-Driven Development (TDD) with AI

TDD with AI does not make the red-green-refactor loop faster — it makes the implementation step nearly instant, which means your tests need to be better than ever.

Why TDD Changes When AI Is Your Pair

Traditional TDD's hidden benefit was always that writing tests first forced you to think about the interface before the implementation. The process of making a test pass from scratch — figuring out the function signature, the edge cases, the state transitions — taught you something about the problem. The test was a specification, and the act of writing it was a design activity.

When AI implements the code, this benefit does not disappear — it becomes more urgent. The AI will implement whatever your test specifies, quickly and convincingly. If your test is vague, the AI's implementation will also be vague. If your test has a subtle flaw in its preconditions, the AI will write code that passes the test and fails in production. The tests are still the specification language; you are still responsible for writing them well.

The practical difference is velocity. In traditional TDD, the bottleneck is implementation. You write a test, you sit with the problem, you implement step by step. With AI, the bottleneck shifts entirely to test design. The AI can implement a function in 30 seconds. That means you can run 10 red-green cycles per hour instead of 1 or 2. This speed is an advantage only if you are writing good tests.

Learning tip: Budget the same amount of time for writing each test as you did before using AI. The implementation is faster, not the thinking. Rushing the test design step is the fastest way to build something that works on paper and fails in the field.

The AI-Augmented TDD Loop

The red-green-refactor cycle adapts cleanly to AI assistance, with a modified loop:

Write a failing test that specifies one behavior precisely.
Run the test and confirm it fails for the right reason (not a syntax error or missing import).
Paste the failing test into AI with context about the system and ask it to implement the minimum code to make the test pass.
Run the tests. If they pass, proceed. If they fail, check whether the AI misunderstood the spec or introduced a regression.
Ask AI to refactor the implementation for clarity and simplicity, then run tests again to confirm they still pass.
Write the next failing test.

The refactor step is where AI adds unexpected value. AI is good at simplifying implementations that grew organically through iteration. It can suggest cleaner data structures, extract functions, and eliminate duplication. But you must explicitly ask for this — AI will not refactor unprompted when operating in a test-passing mode.

A critical check at step 4: verify that the AI made the failing test pass by implementing the correct logic, not by hardcoding the expected value or special-casing the test input. This is a real failure mode. An AI that sees expect(add(2, 3)).toBe(5) may return if (a === 2 && b === 3) return 5. Run a second test with different inputs immediately to eliminate this possibility.

Learning tip: After each AI implementation, add a quick "negative test" — a test with different inputs that should produce a different result. This rules out hardcoding in about 10 seconds.

Using Tests as a Specification Language for AI

The most powerful application of TDD with AI is using tests as the sole specification for a function, bypassing prose documentation entirely. Instead of writing "the rate limiter should allow N requests per window and reject subsequent requests," you write a test that sets up a time window, makes N requests, verifies they succeed, then makes one more and verifies it fails.

This specification-by-test approach has significant advantages for AI-assisted development. Tests are unambiguous in a way prose is not. When you write expect(limiter.check('user-1')).toBe(true) five times and then expect(limiter.check('user-1')).toBe(false) on the sixth call (with a 5-per-minute limit), the AI cannot misinterpret that requirement. There is exactly one correct implementation.

The technique requires writing tests in the right order — from the simplest behavior to the most complex. Start with the most trivially true case, then add behaviors one test at a time. Each test should add exactly one new constraint to the implementation. If a test adds two constraints at once, split it.

A useful framing: each test is a sentence in a programming language that describes one thing the system must do. The complete test suite is the complete specification. If the AI can satisfy all the tests, it has built the right thing.

Learning tip: When you find yourself writing a comment in a test to explain what the code should do, stop and ask whether that comment belongs in the test name or in a new test. Comments in test bodies are usually a sign that one test is doing the job of two.

Outside-In TDD with AI (Starting from Acceptance Tests)

Outside-in TDD, also called acceptance-test-driven development or "London school" TDD, starts with an end-to-end acceptance test that describes the full behavior the user experiences, then drives the implementation inward by writing unit tests for each component the acceptance test requires.

With AI, this approach is particularly powerful. You write one high-level acceptance test describing the complete behavior. The acceptance test fails because nothing is implemented. You then work inward with AI: "what components does the system need to make this acceptance test pass?" AI generates a reasonable component breakdown. You write a unit test for the outermost component, have AI implement it with mocked dependencies, then move inward to the next component.

The advantage of this approach over bottom-up TDD is that you always have the acceptance test as the ultimate arbiter. When the acceptance test passes, you are done. Unit tests tell you which component broke; the acceptance test tells you whether the system works.

The practical challenge is that acceptance tests are slow and require more setup. For a web service, an acceptance test might require a running server, a test database, and a simulated HTTP client. This setup investment pays off when the acceptance test catches interactions between components that no unit test covers.

Learning tip: Write the acceptance test first in plain English as a comment, then convert it to code. The plain English version becomes the test description, which serves as readable documentation for the entire feature.

Pitfalls: Hardcoding, Skipping Refactor, and Spec Drift

Hardcoding to pass tests. AI will sometimes pass a test by producing an implementation that special-cases the test's inputs rather than implementing the general logic. This is most likely when: the test only has one input/output example, the function name is generic, or the test's expected value is a "magic number" with no obvious derivation. Mitigate by always writing at least two tests with different inputs before asking AI to implement.

Skipping the refactor step. AI implements code that passes tests. Passing tests does not mean clean code. AI-generated implementations are often verbose, use nested conditionals where a data structure would be cleaner, or duplicate logic across methods. If you skip the refactor step consistently, technical debt accumulates fast. Make "refactor for clarity" an explicit step in your loop prompt.

Spec drift. When you are iterating fast with AI, the test suite can drift away from the actual requirements. A test is modified to accommodate an AI implementation, then another test is weakened, and after 20 cycles the tests no longer specify what the system was originally supposed to do. Periodically review the test suite against the original requirements to check for drift.

Over-trusting green tests. If all tests pass after an AI implementation, the temptation is to move on immediately. Always do a brief manual review of what the AI implemented. Green tests mean the implementation satisfies your test specification — they do not guarantee the implementation is correct. The specification might be incomplete.

Learning tip: Every 5-7 TDD cycles, pause and read the implementation from top to bottom without looking at tests. Ask: does this do what I intended? AI-generated code that passes tests can still contain logic errors that no test currently covers.

Hands-On: Build a Rate Limiter with TDD and AI

This exercise walks through building a sliding window rate limiter using the AI-augmented TDD loop. A rate limiter allows up to N requests per time window for a given key, and rejects requests that exceed the limit.

Step 1: Write the first failing test — the simplest case.

I am doing TDD. Here is my first failing test. Do NOT implement the solution yet. Just confirm you understand what behavior this test specifies and what the minimum interface needs to look like:

test('allows first request under the limit', () => {
  const limiter = new RateLimiter({ maxRequests: 5, windowMs: 60000 });
  const result = limiter.check('user-1');
  expect(result).toBe(true);
});

Expected output: AI confirms it understands the interface (constructor with config, .check(key) returning boolean) without implementing anything.

Step 2: Add the rejection test, then ask for implementation.

Here are two tests for a sliding window rate limiter. Implement the minimum code to make both pass:

test('allows requests under the limit', () => {
  const limiter = new RateLimiter({ maxRequests: 3, windowMs: 60000 });
  expect(limiter.check('user-1')).toBe(true);
  expect(limiter.check('user-1')).toBe(true);
  expect(limiter.check('user-1')).toBe(true);
});

test('rejects request that exceeds the limit', () => {
  const limiter = new RateLimiter({ maxRequests: 3, windowMs: 60000 });
  limiter.check('user-1');
  limiter.check('user-1');
  limiter.check('user-1');
  expect(limiter.check('user-1')).toBe(false);
});

Requirements:
- Use a sliding window algorithm (not fixed window)
- Store timestamps of requests, not just a count
- Do not hardcode the return values — check the actual request count

Expected output: A RateLimiter class implementation with real sliding window logic, timestamp storage, and the correct boolean return.

Step 3: Add the sliding window behavior test and iterate.

My existing tests pass. Now add this test and update the implementation to make all three pass:

test('allows requests again after the window slides', () => {
  const limiter = new RateLimiter({ maxRequests: 2, windowMs: 1000 });

  // Use fake timers to control time
  jest.useFakeTimers();
  jest.setSystemTime(new Date('2024-01-01T00:00:00.000Z'));

  limiter.check('user-1'); // t=0ms
  limiter.check('user-1'); // t=0ms (now at limit)
  expect(limiter.check('user-1')).toBe(false); // over limit

  jest.advanceTimersByTime(1001); // slide the window past the first request
  expect(limiter.check('user-1')).toBe(true); // oldest request has expired

  jest.useRealTimers();
});

Do not change the existing tests. Only extend the implementation.

Expected output: Updated implementation that correctly expires old timestamps and allows new requests after the window slides.

Step 4: Ask for the refactor.

All tests pass. Now refactor the implementation for clarity:
- Extract any magic numbers into named constants or constructor config
- Simplify the timestamp cleanup logic if it can be expressed more clearly
- Add JSDoc comments for public methods
- Ensure the class handles concurrent calls correctly (hint: is there a race condition in the cleanup step?)

Run the tests after refactoring to verify nothing broke. Show me only the refactored code and the test results.

Expected output: Cleaner implementation with named constants, clear method names, and a note about whether the concurrency issue was addressed.

Step 5: Add isolation — different keys should not interfere.

Add a test that verifies two different keys have independent rate limit counters. Then verify the existing implementation handles this correctly, or fix it if it does not.

The test: user-1 reaches the limit, then user-2 should still be allowed through.

Expected output: New test added, implementation verified to store per-key state correctly.

Step 6: Review the complete test suite for spec completeness.

Here is the complete test suite for our rate limiter. Review it against this specification:

Spec: A sliding window rate limiter that allows maxRequests requests per windowMs milliseconds per key. Excess requests are rejected. After the window slides, rejected keys become available again. Different keys have independent counters. The limiter should handle high-volume concurrent use.

Identify:
1. Which spec requirements have no corresponding test
2. Which tests have assertions weak enough to not catch a real defect
3. What additional edge cases should be tested before shipping

Expected output: Gap analysis identifying 2-4 missing behaviors (common misses: zero maxRequests config, negative window times, keys with Unicode characters, behavior when clock goes backward).

Key Takeaways

The AI-augmented TDD loop makes implementation nearly instant — invest the time you save in writing better tests, not in writing more code.
Tests are the specification language for AI; vague tests produce vague implementations.
Always verify that AI passed a test by implementing real logic, not by special-casing the test inputs — run at least two tests with different inputs before committing.
The refactor step does not happen automatically; you must explicitly prompt AI to simplify and clean up after making tests pass.
Outside-in TDD with AI combines well — write an acceptance test first, then drive inward component by component with AI implementing each layer.