Systematic Review Techniques for AI Outputs

AI-generated code requires a different review lens than human-written code — not because the standard is lower, but because the failure modes are different and you need to know where to look.

Why AI-Generated Code Needs a Different Review Lens

When you review code written by a colleague, you bring context that shapes your attention. You know what problems that engineer has run into before, what their strengths are, where they tend to skip edge case handling, and whether their test coverage is usually solid. You read their code against a mental model of their habits and blind spots.

AI-generated code comes without that context — but it comes with its own characteristic failure modes, and those modes are consistent enough that you can build a review lens specifically calibrated to them. The risk profile of AI-generated code is predictably different from human-written code.

Human engineers make mistakes that reflect their knowledge gaps, time pressure, and assumptions about the task. AI models make mistakes that reflect their training distribution: they hallucinate APIs, they apply common patterns to uncommon situations, they optimize for looking correct over being correct, and they frequently underhandle edge cases that are rare in training data but common in production. AI models also tend to produce code that is more verbose and plausible-looking than human code, which means reviewers can miss problems by being lulled into confidence by the overall cleanliness of the output.

The effective mindset shift is from "find mistakes in this code" to "verify that this code is actually correct for my specific situation." The default assumption changes. You are not looking for whether the code looks reasonable — you are confirming that every assumption the model made is correct for your codebase, your API versions, your data shapes, and your edge cases.

Learning tip: Before starting a review of AI-generated code, take ten seconds to list three things the AI could not have known from your prompt alone — your exact library versions, your production data shapes, your team's error handling conventions. Then make sure the code handles those correctly.

A Structured Review Checklist

A structured checklist serves two purposes: it ensures nothing important is skipped, and it makes your review auditable. If the code causes a problem later, you can show what you checked. If you miss something on the checklist, you know exactly where your process broke down.

The following checklist is organized by concern, moving from "does this do what was asked" to "will this behave correctly in production."

Spec adherence. Does the code actually do what the prompt or ticket asked? AI models sometimes solve a slightly different problem than the one stated — especially when the prompt was ambiguous. Read the requirements, then read the code. Confirm the mapping explicitly.

Edge case handling. What are the boundary conditions? Empty inputs, null/undefined values, zero, maximum values, concurrent calls, inputs of the wrong type. AI models are trained on common cases and frequently underhandle rare ones. Make a list of five plausible unusual inputs and reason through what the code will do with each.

Error handling. Does the code handle failure modes explicitly? Look for: network errors, missing files, malformed data, third-party API failures, database connection loss. AI models often generate happy-path code with nominal error handling that looks like error handling but does not actually recover gracefully.

Security. Look for: unvalidated input being passed to SQL queries or shell commands, sensitive data being logged or exposed in error messages, authentication checks that can be bypassed, use of deprecated or insecure cryptographic functions, hardcoded credentials or secrets. Security checks in AI code deserve extra attention because they often look correct at a glance while containing subtle bypasses.

Performance. Does the code have O(n^2) patterns inside loops that will be fine in tests but catastrophic at scale? Does it make N+1 database queries? Does it load an entire dataset into memory to filter it? AI models do not know your data volumes and frequently generate correct-but-unscalable code.

Test quality. Are the tests actually testing behavior, or are they testing implementation details? Do the tests cover the edge cases you identified? A common AI pattern is generating tests that are technically passing but do not actually verify anything meaningful — for example, tests that check a mock was called but do not verify the outcome.

Learning tip: Print or save this checklist and keep it next to your review workflow. The first few times it will feel slow. After ten reviews, it will be a fast mental scan that you can do in two to three minutes without looking at the list.

Using AI as a Review Assistant

One of the most effective and underused review techniques is asking the AI to review its own output. This sounds paradoxical — if the AI made a mistake, will it catch it? Sometimes yes, often partially. More importantly, the AI's response tells you which parts of the code it is uncertain about, which is useful information even when the self-review is incomplete.

The key is asking specific, targeted questions rather than "does this code have any bugs?" Specific questions yield specific answers. Vague questions yield vague reassurances.

There are a few particularly effective review prompts. The first targets hallucinations directly: ask the model to list every external method or library call it used and confirm that each one matches the actual documented API. The second targets edge cases: give the model a list of specific unusual inputs and ask it to trace through what the code does for each. The third targets security: ask the model to act as an adversarial security reviewer and identify any inputs that could cause the code to behave incorrectly.

The AI will miss things. Your job is to treat its self-review output as a first pass that reduces the space of issues you need to investigate manually — not as a substitute for your own review.

Learning tip: Develop a small set of standard "review assistant" prompts that you use every time you review AI-generated code. Having a consistent set of prompts means you are testing for the same things across different code, which makes your review process repeatable and improvable.

Reviewing for "Plausible Wrongness"

Plausible wrongness is the category of AI bugs that is most dangerous precisely because it is hardest to catch. The code looks correct. It follows the right patterns. It handles the obvious cases. And then, in a specific edge case that your tests did not cover, it does exactly the wrong thing — quietly, without an error.

Plausible wrongness shows up in specific forms:

Off-by-one in boundary conditions. The code correctly processes items 1 through N-1 but skips or double-processes item N. The test suite covers the middle of the range; the boundary condition was not tested.

Silent truncation or coercion. The code accepts a number but the variable is typed as an integer, so floating-point inputs are silently truncated. The test inputs were all integers. Production data contains floats.

Wrong operator precedence in boolean logic. if (a || b && c) does not mean if ((a || b) && c). The AI generates plausible-looking boolean logic that evaluates differently than intended for specific input combinations.

Incorrect handling of timezone-naive datetimes. The code works correctly when run locally, but in production where the server is in UTC and the user is in a different timezone, dates are off by hours or days.

Race conditions in async code. Two async operations are executed independently, but the code assumes a specific ordering that holds most of the time but not always. The bug is rare in tests and appears intermittently in production.

The review technique for plausible wrongness is to reason adversarially: assume the code is wrong and try to find the specific input that proves it. Ask "what would need to be true about an input for this code to fail silently?" Then construct that input and test it.

Learning tip: Pick one category of plausible wrongness (start with boundary conditions) and deliberately test for it in every AI-generated code review for the next two weeks. Building the habit for one category at a time is more effective than trying to check everything at once.

Using Diffing Tools and Tests as Review Aids

Beyond reading the code, two mechanical aids make your review more reliable.

Diff-based review. If the AI is making changes to existing code rather than generating new code from scratch, a clean diff that shows only what changed is often more informative than reading the full file. Request the AI to show its changes as a diff, or use your version control system to generate one. Changes to existing logic are where regressions hide; the diff makes them visible.

Running existing tests first. Before examining AI-generated changes to existing code, run the full existing test suite. If any tests fail, investigate those failures before reviewing the new code. A failing test on code the AI "didn't touch" is a signal of an unintended side effect.

Writing tests before reviewing implementation. For new code, write the test cases yourself before reading the implementation carefully. Think about what the function should do, what edge cases should be covered, and what failure modes should be handled. Then run the tests. Failures tell you immediately where the AI's assumptions about the problem did not match yours.

Property-based testing for algorithmic code. If the AI generated an algorithm (sorting, parsing, transformation), consider writing a few property-based tests: invariants that must hold for all inputs, not just specific test cases. "The output array must have the same length as the input array," "all output values must be greater than zero," "calling the function twice with the same input must produce the same output." These catch plausible-wrong algorithmic implementations that pass example-based tests.

Learning tip: Get in the habit of running tests before reviewing code, not just before merging. Test results give you a fast, objective signal about what is working before you invest time in reading logic.

How Long Reviews Should Take and What "Done" Looks Like

One of the questions engineers ask most often about AI-assisted review is: how long is enough? The honest answer is: it depends on the code, but there are meaningful lower bounds.

A 50-line AI-generated function that uses external APIs should take at least five to ten minutes to review properly, including verifying the API calls and testing one or two edge cases. A 200-line AI-generated module with complex business logic and state management should take 20–40 minutes, including running the tests, checking edge cases, and reading the docs for any unfamiliar library calls.

If your reviews are consistently taking less than two minutes, you are rubber-stamping, not reviewing. The velocity pressure that makes short reviews feel acceptable is real, but it is not a good trade. A five-minute review that catches a critical bug saves hours of incident response.

"Done" for a review means you can answer yes to each of the following:

I can explain what this code does in plain language.
I have verified that every external API call matches the documented API for the version we use.
I have tested at least the happy path and two edge cases.
I have checked for the security issues relevant to this type of code.
I understand what happens when errors occur.
I would be comfortable being woken up at 2am to debug this code in production.

That last question is the most useful gut-check. If the answer is no, you are not done reviewing.

Learning tip: Track your review times for AI-generated code for two weeks. If the average is under five minutes for non-trivial code, use that data to have a conversation with yourself or your team about what a realistic review standard looks like.

Hands-On: Running a Full Review on AI-Generated Code

This exercise takes you through a complete, structured review from prompt to approval decision.

Step 1: Generate a non-trivial function.

Use this prompt:

Write a TypeScript function called parseAndValidateWebhookPayload that:
1. Takes a raw string body and an HMAC-SHA256 signature header (both strings)
2. Validates the signature against a SECRET_KEY environment variable using Node.js crypto module
3. Parses the body as JSON
4. Validates that the parsed object has a "type" field (string) and an "data" field (object)
5. Returns the parsed payload if valid
6. Throws descriptive errors for: invalid signature, malformed JSON, missing fields

Use the Node.js built-in crypto module. Do not use any third-party libraries.

Step 2: Run through the spec adherence check.

Read the requirements in the prompt one by one. For each requirement, find the corresponding code. If you cannot locate code for a requirement, note it as a gap.

Step 3: Run the security checklist.

Specifically check:
- Is the HMAC comparison timing-safe? (Node.js crypto.timingSafeEqual should be used, not ===)
- Is the secret key sourced correctly from the environment?
- Could a malformed input cause an unhandled exception that leaks information in the error message?

Step 4: Ask the AI to self-review security.

Review the webhook validation function you just wrote with a security focus. Specifically:
1. Is the HMAC signature comparison timing-safe? What attack does timing-safe comparison prevent?
2. Are there any inputs to this function that could cause it to throw an error with information useful to an attacker?
3. What happens if the SECRET_KEY environment variable is not set?

List any issues you find and suggest corrections.

Step 5: Test edge cases manually.

Run or trace through the function for:
- An empty string body
- A body with valid JSON but no type field
- A body with valid JSON but data is a string, not an object
- A correct signature that was computed with a different key
- A missing SECRET_KEY environment variable

Step 6: Check the crypto API usage.

Look up Node.js crypto.createHmac and crypto.timingSafeEqual in the official Node.js documentation. Verify that:
- The arguments match the documented signatures
- The encoding used for comparison is correct
- The function returns a Buffer or string type that can be passed to timingSafeEqual without type coercion errors

Step 7: Write your review conclusion.

Write a three-sentence review comment that states:
- What security properties you verified (or failed to verify)
- Any issues you found and whether they are blocking
- Whether you would approve this code for production use

This is your ownership documentation for this review.

Key Takeaways

AI-generated code has a predictable failure profile — hallucinated APIs, underhandled edge cases, plausible-but-wrong logic — and your review checklist should be calibrated specifically to those failure modes.
A structured checklist (spec adherence, edge cases, error handling, security, performance, test quality) ensures consistent coverage and makes your review auditable.
Using AI as a review assistant — asking it specific questions about its own output — is an effective first pass that narrows down where to focus your manual review.
Plausible wrongness (code that looks right but fails specific edge cases) requires adversarial reasoning: assume the code is wrong and try to find the input that proves it.
Review completion has a minimum time floor. Reviews that consistently take under two minutes for non-trivial code are rubber-stamping, not reviewing.