Verifying Correctness in Agentic Workflows

Shipping AI-generated code without verifying its logic is not moving fast — it is deferring a debugging session to the worst possible moment: production.

Why AI Logic Needs Active Verification

AI models generate code that is syntactically plausible and structurally familiar. It compiles. It passes a quick smoke test. It looks like code a competent engineer would write. That surface credibility is exactly what makes it dangerous. The model has no runtime, no debugger, and no memory of what broke last Tuesday. It assembles code from patterns — and patterns can be confidently wrong.

The failure modes are predictable once you know what to look for: off-by-one errors in boundary handling, silent assumptions about input shape, async flows where error handling was omitted because "the happy path doesn't need it," and business rule implementations that are correct for the example in the prompt but break for the case that was never mentioned. None of these are obvious on a first read. They require deliberate reasoning.

The techniques in this topic treat AI-generated code the way a careful reviewer treats any unfamiliar code: assume nothing, trace everything, and let tests be the final word. The difference is that with AI-generated code, you cannot ask the author why they made a particular decision — so you need a more systematic approach than you might apply to a colleague's PR.

Learning tip: Before reading any AI-generated function, write down in one sentence what you expect it to do, what the input is, and what the output should be. Then read the code. The gap between what you expected and what the code actually does is where bugs live.

Tracing Execution Manually: The Happy Path and Beyond

Manual tracing is the most reliable form of code review. It is slow, which is why engineers skip it — but for AI-generated code handling critical logic, it is non-negotiable.

Start with the happy path. Pick a representative, valid input and walk through the code line by line, tracking variable values in your head (or on paper). Confirm that the output matches what you expect. This step catches obvious structural errors and gives you a baseline mental model of how the code is supposed to work.

Once the happy path passes, stress-test the edge cases. For any function that accepts numeric input, ask: what happens at zero? What happens at the maximum value the type allows? What if the caller passes a negative number when the spec says "count"? For functions that accept strings, ask: what happens with an empty string? With a string containing only whitespace? With a string that is 10,000 characters long? For functions that accept arrays or lists, ask: what happens with an empty collection? With a single element? With duplicate values?

For functions that call other functions, ask: what happens if the dependency throws? What happens if it returns null or undefined instead of the expected shape? AI-generated code frequently handles the case where everything succeeds and implicitly trusts that dependencies behave. That trust is often misplaced.

Learning tip: Keep a personal checklist of edge cases for the data types you work with most. Paste it into your review process for any AI-generated function that handles user input or external data. The list pays for itself the first time it catches a null-dereference that would have reached a customer.

Asking AI to Explain Its Own Reasoning — and Why That Explanation May Also Be Wrong

One useful technique is to ask the AI to explain the logic it just generated. This forces the model to re-examine its output and often surfaces assumptions that were implicit in the code. The explanation can reveal that the model misunderstood the spec, that a variable name is misleading, or that a guard clause is doing something subtly different from what the surrounding comment claims.

However, you must treat this explanation with the same skepticism you apply to the code. The model is generating a plausible explanation for its own output — it is not executing the code and observing behavior. The explanation can be confident, well-written, and wrong. It can rationalize a bug rather than expose it. The explanation is a starting point for your own reasoning, not a substitute for it.

The most useful follow-up after getting an explanation is to probe the cases where the explanation glosses over detail. If the model says "and then it handles the error," ask it exactly what handling means in this case — does it rethrow? Return a default? Swallow the error silently? Those three behaviors have completely different consequences for callers, and "handles the error" covers all of them.

Learning tip: After asking for an explanation, follow up with: "What input would cause this function to return an incorrect result or throw an unexpected error?" This prompt forces the model to take an adversarial stance on its own code, which is significantly more useful than asking it to confirm that the code is correct.

Using Tests as Executable Specifications

Tests are the most reliable form of verification because they are runnable. An explanation can be wrong; a passing test suite cannot lie about what the code does with a specific input (though it can lie about coverage). The goal in this section is not just to run existing tests but to write targeted tests that encode your understanding of the spec and then use failures to surface discrepancies.

Before writing tests, translate the spec into a list of behaviors. For a function that calculates a discount based on user tier and cart total, the behaviors might be: a free-tier user gets no discount; a premium user gets 10% on carts over $50 and no discount on carts at or below $50; a cart total of exactly $50 is the boundary — determine which side it falls on and confirm the code agrees. Each behavior becomes a test case. This is test-as-specification: you are not testing the code, you are testing whether the code implements the spec.

When a test fails, resist the urge to modify the test to match the code. Instead, read the failure as a specification violation. The code is wrong, not the expectation. The exception is when you wrote the test incorrectly — but that is a separate verification step.

Learning tip: Write the tests before reading the AI-generated implementation. You will catch more bugs because your expectations are uncontaminated by the implementation details. This is the closest thing to TDD you can achieve when reviewing AI output.

Boundary Condition Analysis: The Systematic Edge Case Checklist

Boundary conditions are where AI-generated code fails disproportionately. The model learns from examples that are well within normal ranges and is less reliable at the edges. Systematic boundary analysis is a force multiplier for review time.

For every parameter in every function, apply this checklist:

Numeric: zero, negative, maximum integer, floating point near zero, NaN, Infinity
String: empty string, whitespace-only, maximum length, Unicode edge cases, injection strings (SQL, HTML, shell)
Array/Collection: empty, single element, maximum size, elements that are null or undefined, duplicate values
Object: null, undefined, missing required keys, extra unexpected keys, deeply nested structures
Date/Time: epoch, far future, timezone edge cases, daylight saving transitions, invalid dates
Async: what if the promise never resolves? What if it rejects immediately?

You do not need to test every combination — that is combinatorial explosion. Focus on the boundaries that are most likely to be hit in production and most likely to cause data corruption or security issues if wrong.

Learning tip: Pay extra attention to boundaries that cross a threshold in business logic. A discount that kicks in at $50, a rate limit at 100 requests per minute, a pagination size of 20 items — these thresholds are where off-by-one errors concentrate. Check whether "at the boundary" means the old behavior or the new behavior, and verify the code agrees with the spec.

Spotting Logic Errors in Async Code

Async code is where AI-generated logic fails in ways that are hardest to catch in review and hardest to reproduce in testing. The failure modes are specific enough to warrant their own section.

Missing await. The model generates a call to an async function without awaiting it. The function appears to work because the happy path completes before the unawaited promise matters. Under load, or when the async operation takes longer than expected, the behavior becomes undefined. Search for every async function call in the generated code and confirm it is awaited or that the caller explicitly handles the returned promise.

Error swallowing. The model wraps an async call in a try/catch and catches the error with a log statement or an empty handler. The function returns a success response even though the underlying operation failed. Search for every catch block and confirm it either rethrows, returns an error result, or explicitly documents why swallowing is correct behavior.

Race conditions. The model generates code that makes two async calls and uses both results, without considering that the second call might complete before the first, or that a state mutation in one callback affects the other. Look for any pattern where multiple async operations share mutable state.

Unhandled rejection. A promise is created but no rejection handler is attached. In Node.js, this is a crash or a warning depending on version. Look for new Promise() calls, .then() chains without .catch(), and async IIFE patterns.

Learning tip: Copy the async portions of AI-generated code into a tool like the Node.js REPL or a quick test script and run it with deliberately slow or failing dependencies. Simulating a dependency that takes 2 seconds to respond or rejects with a network error will expose most async logic errors faster than reading the code.

Worked Example: Verifying a Complex Business Rule

Consider an AI-generated function that calculates a user's invoice total with the following spec: apply a 15% discount for enterprise users; apply an additional 5% for invoices over $1000 after the enterprise discount; never allow the final total to go below $10 (minimum billing amount); return the total rounded to two decimal places.

The AI generates the following logic (pseudocode for language-agnostic clarity):

function calculateInvoiceTotal(baseAmount, isEnterprise):
  total = baseAmount
  if isEnterprise:
    total = total * 0.85
  if total > 1000:
    total = total * 0.95
  total = max(total, 10)
  return round(total, 2)

Manual trace on happy path: base = $1200, enterprise = true. After enterprise discount: $1020. Is $1020 > $1000? Yes. After volume discount: $969. Is $969 > $10? Yes. Return $969.00. Looks correct.

Now stress-test the threshold. Base = $1177, enterprise = true. After enterprise discount: $1000.45. Is $1000.45 > $1000? Yes. After volume discount: $950.43. Seems fine.

Now the exact boundary. Base = $1176.47, enterprise = true. After enterprise discount: $1000.00 (approximately). Floating point: the actual value is $999.9995. Is $999.9995 > $1000? No — depending on floating point representation, this might miss the threshold by a hair. This is a real boundary bug. The spec is ambiguous (does "over $1000" mean strictly greater than?), but the code may behave inconsistently at the exact boundary due to floating point arithmetic.

Now the minimum billing check. Base = $5, enterprise = true. After enterprise discount: $4.25. Is $4.25 > $1000? No. Final total: max($4.25, $10) = $10. Correct.

What about base = $0? After enterprise discount: $0. After volume check: $0. After minimum: $10. Is this correct per the spec? The spec says "never allow the final total to go below $10" but does not address whether charging $10 for a $0 base amount is intended. This is a spec clarification needed, not a code bug — but manual tracing surfaced it.

Learning tip: A worked trace like this takes 10–15 minutes for a moderately complex function and has a higher return on investment than any other review technique. When you find one boundary bug this way, there are usually more. Keep tracing.

Hands-On: Verify an AI-Generated Discount Calculator

Step 1: Generate the initial implementation

Ask the AI to implement a business rule you know well enough to verify.

Implement a TypeScript function `calculateOrderDiscount(orderTotal: number, customerTier: 'free' | 'pro' | 'enterprise', couponCode?: string): number` that:
- Returns the discount amount (not the final price)
- Free tier: no discount
- Pro tier: 10% on orders over $100, no discount otherwise
- Enterprise tier: 20% on all orders
- Coupon code "SAVE10" adds a flat $10 discount on top of any tier discount
- The discount can never exceed the order total
- Return value rounded to 2 decimal places

Step 2: Ask for an explanation before reviewing the code

Before I review this implementation, explain step by step what the function does for each tier. Walk through the coupon code logic specifically and explain what happens if a free-tier user uses the coupon code on a $5 order.

Expected result: The explanation will either confirm correct logic or reveal a misunderstanding. Watch for whether the model correctly identifies that a $10 coupon on a $5 order should be capped at $5.

Step 3: Write the boundary test cases yourself

Before running any tests, write out your expected results manually:

calculateOrderDiscount(50, 'free') should return 0
calculateOrderDiscount(100, 'pro') should return 0 (boundary — is $100 "over $100"?)
calculateOrderDiscount(100.01, 'pro') should return 10.00
calculateOrderDiscount(200, 'enterprise') should return 40.00
calculateOrderDiscount(5, 'free', 'SAVE10') should return 5.00 (capped)
calculateOrderDiscount(15, 'pro', 'SAVE10') should return 10.00 (coupon only, no tier discount)

Step 4: Ask AI to generate tests for these exact cases

Write Jest test cases for `calculateOrderDiscount` that verify exactly these scenarios:
1. Free tier, $50 order — expect 0 discount
2. Pro tier, exactly $100 order — expect 0 discount (boundary: "over $100" means strictly greater)
3. Pro tier, $100.01 order — expect $10.00 discount
4. Enterprise tier, $200 order — expect $40.00 discount
5. Free tier, $5 order with coupon SAVE10 — expect $5.00 discount (capped at order total)
6. Pro tier, $15 order with coupon SAVE10 — expect $10.00 discount (coupon only, no tier discount since under $100)

Use `toBeCloseTo` for floating point comparisons where appropriate.

Step 5: Run the tests and investigate failures

Run the generated tests. For any failure, do not immediately ask AI to fix the code.

This test is failing: [paste test output]. Before you fix anything, explain why the current implementation produces this result and whether the implementation or the test expectation is wrong according to the spec I gave you.

Step 6: Verify the async version

If the function needs to look up the coupon code from a database:

Refactor `calculateOrderDiscount` so that coupon code validation is done via an async function `validateCoupon(code: string): Promise<boolean>`. Ensure that if `validateCoupon` rejects, the discount calculation continues without the coupon discount rather than throwing to the caller. Show me explicitly how you handle the rejection case.

Review the returned code specifically for: is the rejection handled? Does the error get swallowed silently, or is it logged? What does the caller receive if the coupon service is down?

Key Takeaways

Manual execution tracing — walking through code line by line with real input values — catches logic errors that code review by reading alone misses, especially at boundary conditions.
Asking AI to explain its own code is useful as a starting point, but the explanation can rationalize bugs rather than expose them. Follow up with adversarial probing.
Write your test cases from the spec before reading the AI implementation. Uncontaminated expectations catch more bugs.
Async logic errors (missing await, error swallowing, race conditions) are the highest-risk failure mode in AI-generated code and require targeted inspection of every async call site.
Boundary conditions — zero, null, max, empty, exact threshold values — are where AI-generated logic fails disproportionately. Apply a systematic checklist rather than relying on intuition.