AI Security Vulnerability Scanning & Pen Test Planning

How to Use AI to Generate Security Test Cases Based on OWASP Top 10?

The OWASP Top 10 is a framework, not a test plan. "A03: Injection" does not tell you which endpoints to test, what payloads to use, or how to assert whether a vulnerability exists in your specific application. Translating OWASP categories into concrete, executable test cases for your system is where the bulk of security test planning effort lives — and where AI provides the highest acceleration.

The translation problem: framework to test case

A generic OWASP A01 (Broken Access Control) test case says: "Verify that access control enforces policies on trusted server-side code, not just in the UI." For your application, that means:

Which specific endpoints have access control logic?
What roles exist, and what is each role permitted to access?
What does "accessing as the wrong role" look like for each endpoint?
What is the expected response (403? 404? redirect?) for unauthorized access?
What HTTP manipulation (changing user IDs in URLs, modifying request headers) should be tested?

AI can generate all of this from your API spec or codebase description. This is the level of specificity you need.

Prompt template for OWASP-based test case generation

You are a security QA engineer generating test cases for an authorized 
internal security test. Our application is a SaaS project management 
tool with the following characteristics:

## Application Context
- Type: Multi-tenant SaaS, REST API (JSON)
- Authentication: JWT tokens, 15-minute expiry, refresh token pattern
- Authorization: RBAC with 4 roles: Owner, Admin, Member, Viewer
- Sensitive operations: project deletion, user management, billing, 
  file downloads, API key generation

## Relevant Endpoints (from OpenAPI spec)
- DELETE /api/projects/{project_id} — Owner only
- GET /api/projects/{project_id}/files/{file_id} — All roles with project access
- POST /api/organizations/{org_id}/members — Owner or Admin only
- GET /api/billing/invoices — Owner only
- POST /api/api-keys — Admin or Owner only
- GET /api/users/{user_id}/activity — Admin+ or own user only

For OWASP categories A01 (Broken Access Control) and A02 (Cryptographic 
Failures), generate specific test cases for our application.

For each test case:
1. Test case ID (e.g., A01-001)
2. OWASP category and sub-type (e.g., A01 — IDOR, A01 — Privilege Escalation)
3. Endpoint and HTTP method under test
4. Precondition: What test user state is required? 
   (e.g., "User authenticated as Member in Project A, not a member of Project B")
5. Test steps: Exact HTTP request to make, including which parts to manipulate
6. Expected result: What response indicates the control is working correctly
7. Failure indicator: What response indicates a vulnerability exists
8. Severity: Critical/High/Medium/Low with justification

Generate at least 3 test cases per OWASP category. 
Focus on test cases that require no special tooling — just HTTP requests 
that a QA engineer with curl or Postman can execute.

Generating test cases for OWASP A03: Injection

Injection testing requires payload sets, not just test case structures. AI can generate both:

For our application (described above), generate SQL injection and 
NoSQL injection test cases for OWASP A03.

The backend uses PostgreSQL for user data and MongoDB for project 
activity logs.

For each injection test:
1. The specific endpoint and parameter to test
2. A set of 5 test payloads appropriate for this parameter type 
   (string field, numeric ID, JSON body field)
3. What the expected safe response looks like (sanitized error, 400 status)
4. What the vulnerable response looks like (stack trace, 500 error with 
   DB message, unexpected data returned)
5. Whether this test requires authentication (and at what role level)

Focus on parameters that appear in user-controlled inputs: 
search fields, filter parameters, ID fields in URL paths, and 
JSON body string fields.

IMPORTANT: Generate payloads that test for the vulnerability (whether 
input sanitization exists) without being destructive. Use SELECT-based 
payloads and time-based detection, not DROP or UPDATE payloads. 
This is an authorized test on our own staging environment.

Example AI-generated injection test cases:

A03-001: SQL Injection in Search Field
Endpoint: GET /api/projects?search={value}
Payloads to test:
  1. search=' OR '1'='1  (classic tautology — returns all projects if vulnerable)
  2. search=test' AND SLEEP(5)--  (time-based blind — 5s delay = vulnerable)
  3. search=test'; SELECT pg_sleep(3)--  (PostgreSQL-specific time-based)
  4. search=%27%20UNION%20SELECT%20null--  (URL-encoded UNION detection)
  5. search=test\  (backslash — triggers syntax error in some DB drivers)

Expected safe response: 
  - Returns only projects matching literal search string
  - On malformed input: 400 Bad Request with "Invalid search parameter"
  - No 500 errors, no stack traces in response body

Vulnerable response indicators:
  - Returns ALL projects regardless of membership (tautology success)
  - Response delayed by ~5 seconds (time-based blind success)
  - 500 error with PostgreSQL error message in body
  - Empty result set when authenticated user has projects (broken auth)

Learning Tip: The hardest part of OWASP-based test generation is writing test cases that are specific enough to execute but not so specific that they only catch textbook vulnerabilities. When you prompt AI to generate security test cases, explicitly ask for both "textbook" payloads (classic SQL injection strings) and "application-specific" variants (payloads that target your specific technology stack and data model). The application-specific cases are more likely to find real vulnerabilities in a well-hardened application.

How to Identify Injection, Auth, and Data Exposure Vulnerabilities in Your Test Scope with AI?

Security test scope management is a critical discipline. Testing everything is impossible; testing too little leaves high-risk areas uncovered. AI helps you map your application's attack surface systematically and identify where injection, authentication, and data exposure risks are highest — before you write a single test.

Building an attack surface map from your API spec

I need to identify the highest-risk areas for security testing in my 
application before I write test cases. Here is our complete OpenAPI spec:

---
{PASTE OPENAPI SPEC OR ENDPOINT LIST WITH DESCRIPTIONS}
---

Analyze this API spec and produce an attack surface map that:

1. Groups endpoints by security risk category:
   - Authentication surface (login, token refresh, password reset, MFA)
   - Authorization surface (RBAC, ABAC, tenant isolation, scope enforcement)
   - Injection surface (parameters that likely touch DB, file system, or commands)
   - Data exposure surface (endpoints returning PII, financial data, credentials)
   - State mutation surface (write operations that modify sensitive state)

2. For each endpoint, assign a security risk score (1–5) based on:
   - Sensitivity of data accessed or modified
   - Number of authorization checks required
   - Complexity of input handling
   - Exposure in multi-tenant context

3. Identify the top 10 endpoints to prioritize for security testing, 
   ranked by risk score

4. Flag any endpoints that appear to be missing standard security controls 
   based on their behavior description (e.g., password reset without rate 
   limiting, file download without auth check)

Identifying IDOR (Insecure Direct Object Reference) risks

IDOR is consistently one of the most impactful vulnerability classes found in real API security tests. AI can identify IDOR-prone patterns from your API design:

Review this endpoint list for IDOR (Insecure Direct Object Reference) 
vulnerability patterns:

GET /api/users/{user_id}/profile
GET /api/users/{user_id}/documents
GET /api/projects/{project_id}/members
GET /api/orders/{order_id}
GET /api/invoices/{invoice_id}/download
PUT /api/users/{user_id}/settings
DELETE /api/comments/{comment_id}
GET /api/notifications/{notification_id}

For each endpoint that carries IDOR risk:
1. Describe the specific IDOR scenario 
   (e.g., "Authenticated user A requests user B's invoice by guessing invoice_id")
2. What authorization check must exist to prevent the vulnerability
3. What test to execute to verify the authorization check works:
   - Test user setup (e.g., "User A owns order 1001, User B owns order 1002")
   - Exact request to make as User A (e.g., GET /api/orders/1002 with User A's token)
   - Expected response (403 Forbidden, not 200 with User B's order data)
4. What additional variants to test 
   (e.g., sequential IDs, type confusion, path traversal in ID field)

Identifying authentication bypass risks

Our authentication system has these components:

1. POST /auth/login — returns access_token (JWT, 15 min) and refresh_token (30 days)
2. POST /auth/refresh — takes refresh_token, returns new access_token
3. POST /auth/logout — invalidates refresh_token on server
4. POST /auth/password-reset/request — sends reset link to email
5. POST /auth/password-reset/confirm — takes reset_token + new password
6. GET /auth/me — returns current user info (validates token)

Known implementation details:
- JWTs are signed with HS256
- JWT payload contains: user_id, email, roles, tenant_id, exp
- Refresh tokens are stored in database and invalidated on logout
- Password reset tokens expire after 1 hour

Generate authentication security test cases covering:
1. JWT manipulation attempts (algorithm confusion, signature stripping, 
   claim modification)
2. Token replay after logout
3. Password reset token security (predictability, reuse, expiry enforcement)
4. Refresh token security (reuse after rotation, cross-tenant use)
5. Brute force and rate limiting on login and password reset

For each test: specify exact HTTP request, expected secure behavior, 
and what a vulnerability looks like in the response.

Identifying sensitive data exposure risks

Our API returns the following data in various responses. 
Review for PII and sensitive data exposure risks:

GET /api/users/{id} response:
{
  "id": "usr_abc123",
  "email": "[email protected]", 
  "full_name": "John Smith",
  "phone": "+1-555-0123",
  "created_at": "2024-01-15",
  "last_login": "2024-03-20T14:32:11Z",
  "password_hash": "...",
  "reset_token": "...",
  "internal_flags": {...}
}

GET /api/orders/{id} response:
{
  "order_id": "ord_xyz789",
  "user_id": "usr_abc123",
  "payment_last_four": "4242",
  "payment_processor_token": "tok_stripe_abc...",
  "shipping_address": {...},
  "billing_address": {...}
}

For each response:
1. Identify fields that should NEVER be returned in an API response
2. Identify fields that should only be returned to specific roles
3. Identify fields that may cause GDPR/CCPA compliance issues if leaked
4. Write a test case verifying each sensitive field is absent or masked 
   in the response (e.g., assert response body does NOT contain "password_hash")

Learning Tip: IDOR is underrepresented in AI-generated test suites when the prompt doesn't explicitly ask for it, because IDOR requires cross-account testing rather than single-user manipulation. Build a habit of always including "specifically test for IDOR on any endpoint with a resource ID in the path or body" in your security test generation prompts. IDOR is consistently in OWASP's top 3 for API security and is the class of vulnerability most commonly found in authorized bug bounty programs on well-maintained applications.

How Does AI-Assisted Threat Modeling and Attack Surface Mapping Work?

Threat modeling is the systematic process of identifying how an attacker could compromise your system, what they would target, and what controls exist (or should exist) to prevent it. Traditionally, threat modeling requires a trained security architect and a multi-hour whiteboard session. AI does not replace that session — but it dramatically reduces the preparation time and helps you identify threats your team might not have considered.

The STRIDE framework and AI

STRIDE (Spoofing, Tampering, Repudiation, Information Disclosure, Denial of Service, Elevation of Privilege) is the standard framework for threat modeling software systems. AI has strong STRIDE knowledge and can apply it systematically to your architecture description.

I need to perform a threat model for our order processing service before 
a security penetration test. Use the STRIDE framework.

## System Description
- Service: Node.js REST API (order-service)
- Consumers: Web frontend (React, browser), Mobile app (iOS/Android), 
  Internal admin panel
- Dependencies: 
  - PostgreSQL (user data, orders, inventory)
  - Redis (session cache, rate limiting state)
  - RabbitMQ (async events: order-created, payment-processed)
  - Stripe API (external, payment processing)
  - SendGrid (external, email notifications)
- Authentication: JWT (issued by auth-service, validated by order-service)
- Deployment: Kubernetes, behind AWS ALB, VPC-isolated from public internet
- Data sensitivity: Stores PII (names, addresses), payment metadata 
  (last 4, processor token), order history

Apply STRIDE to this system and produce:

For each STRIDE category:
1. Identify the 3 most likely threats specific to this architecture
2. For each threat: describe the attack scenario in plain language
3. Identify what control should prevent this threat
4. Rate the risk if the control is absent or misconfigured: 
   Critical/High/Medium/Low
5. Generate a security test case that verifies the control is working

This is for an authorized security review of our own system.

Data flow diagram analysis for security gaps

Here is the data flow for our checkout process, described as a sequence:

1. Browser → ALB → order-service: POST /checkout with JWT + order data
2. order-service → auth-service: validate JWT (internal gRPC call)
3. order-service → PostgreSQL: BEGIN TRANSACTION; check inventory; reserve stock
4. order-service → Stripe API: POST /charges (HTTPS, Stripe secret key from env var)
5. order-service → PostgreSQL: insert order record; COMMIT TRANSACTION
6. order-service → RabbitMQ: publish "order-created" event (plaintext AMQP)
7. notification-service (consumer) → SendGrid: POST /mail/send

Analyze this data flow for security issues:

1. Identify all trust boundaries being crossed (where data moves 
   between trust zones)
2. At each trust boundary, identify what validation should occur 
   and what could be attacked if validation is absent
3. Identify any data in transit that is not encrypted
4. Identify any credentials or secrets in the flow and how they 
   should be managed
5. Flag the RabbitMQ step specifically: what are the security 
   implications of plaintext AMQP for internal messaging in this context?
6. For each identified issue, write a security test case or 
   configuration check that QA can verify

Generating a penetration test scope document

Before an authorized penetration test, QA typically helps produce the scope document that defines what testers can and cannot do. AI accelerates this:

Generate a penetration test scope document for an authorized security 
test of our order-service API in the staging environment.

Include:
1. Scope definition: 
   - In-scope systems, IPs/URLs, and test accounts
   - Out-of-scope systems (production, third-party services)
2. Permitted test types and prohibited test types:
   - What automated scanning is permitted
   - What manual testing techniques are permitted
   - What is explicitly prohibited (DoS against staging infra, 
     testing with real customer data, testing Stripe in live mode)
3. Test environment setup requirements 
   (what test data must exist, what mock services to use)
4. Rules of engagement:
   - How to report critical findings immediately
   - What constitutes a stop-test trigger
   - Communication channels and escalation path
5. Success criteria:
   - What deliverables are expected from the penetration test
   - What a clean result looks like (no criticals/highs)

Our staging environment constraints:
- Shared with other teams' QA testing
- Connected to real Stripe test mode (no actual charges)
- Email delivery disabled
- Max load: don't exceed 20 requests/second automated scanning

Learning Tip: Threat modeling sessions fail when they become abstract. When you use AI to generate a STRIDE threat model, immediately follow up with: "For the top 5 threats you identified, what specific configuration or code check would confirm whether the control exists?" This forces the output from theoretical threat lists to actionable verification steps. A QA engineer can execute a configuration check in 10 minutes. A theoretical threat requires a week of architecture review. Operationalize the output immediately.

How to Review AI-Generated Security Test Cases for False Positives?

AI-generated security test cases suffer from a specific failure mode: they are structurally correct but contextually wrong. A test case that checks for SQL injection in a field that uses an ORM with parameterized queries will never find a vulnerability — but it looks valid, inflates your test count, and creates false confidence. The flip side is also true: AI sometimes flags safe behavior as a vulnerability indicator because the response pattern matches a known vulnerability signature without understanding the application's design.

Categories of false positives in AI-generated security test cases

Type 1: Testing controls that don't exist in your stack
- AI generates XSS test cases for API endpoints that only return JSON (no HTML rendering)
- AI generates CSRF test cases for a stateless JWT API (CSRF requires session cookies)
- AI generates host header injection tests for a service that ignores the Host header

Type 2: Wrong vulnerability indicator
- AI marks a 400 response to a malformed input as "potentially vulnerable" because some frameworks return 400 instead of 200 on injection attempts (but 400 is actually the correct safe behavior)
- AI flags a detailed error message as "information disclosure" without noting that this endpoint is internal-only and the error message is intentional for debugging

Type 3: Correct vulnerability class, wrong expected behavior for your design
- AI expects all password reset tokens to expire, but your system design documents show 24-hour expiry with one-time use (AI assumed 1-hour standard)
- AI marks a 200 response to an invalid JWT as vulnerable without noting that a specific endpoint is intentionally public

Prompt for reviewing generated security test cases

Review these AI-generated security test cases for false positives 
and contextual errors.

## Application Context
- API type: JSON REST API (no HTML responses except /docs)
- Authentication: Stateless JWT (no session cookies, no CSRF tokens)
- ORM: TypeORM with parameterized queries for all DB operations
- Error handling policy: Return generic error messages in production, 
  detailed errors in staging (this is intentional)
- Password reset token expiry: 24 hours (intentional, per design doc)
- Public endpoints (no auth required): GET /health, GET /api/products, 
  GET /api/categories

## Generated Test Cases to Review

TC-001: CSRF test on POST /api/orders/checkout
- Test: Remove CSRF token from request, expect 403
- Issue with this test case: ?

TC-002: XSS test on POST /api/search
- Payload: <script>alert(1)</script> in search field
- Expected vulnerable: 200 response contains unescaped script tag
- Issue with this test case: ?

TC-003: SQL injection on GET /api/products?search=
- Payload: test' OR 1=1--
- Expected safe: 400 or sanitized response
- Issue: AI flagged 400 response as "potentially vulnerable"

TC-004: Password reset token expiry
- Test: Use reset token after 2 hours, expect 400
- Context: Our design specifies 24-hour tokens
- Issue with this test case: ?

TC-005: Error message information disclosure
- Test: POST /api/orders/checkout with malformed JSON
- AI flags: 400 response with stack trace is a vulnerability
- Context: This is staging environment, detailed errors are intentional

For each test case:
1. Is this a false positive, a valid test, or a test that needs modification?
2. If false positive: explain exactly why it doesn't apply to our system
3. If valid: confirm the test is correctly specified
4. If needs modification: rewrite the test case with correct expected behavior

Building a security test review checklist

Use AI to generate a reusable review checklist that your team applies to all AI-generated security test suites:

Generate a quality review checklist for AI-generated security test cases.

The checklist should help a QA engineer identify:
1. Tests that don't apply to the application's technology stack
2. Tests with incorrect expected behavior (wrong status code, wrong 
   response field to check)
3. Tests that overlap in coverage (testing the same vulnerability twice 
   with different payloads but identical expected outcomes)
4. Tests that require a specific environment setup not noted in the 
   test case preconditions
5. Tests where the "vulnerability indicator" is actually valid application 
   behavior
6. Tests missing a prerequisite (e.g., requires two test accounts but 
   only one is mentioned)
7. Tests that could cause unintended side effects in a shared staging 
   environment (e.g., locking accounts, generating excessive notifications)

Format as a checklist that can be applied in under 30 minutes to a set 
of 20–30 security test cases.

Calibrating AI security test output over time

The most effective approach is a feedback loop:

Here are 5 security test cases that AI generated for us last month 
that turned out to be false positives after execution. 

Review them and extract the patterns that caused the false positives, 
so I can add guardrails to future security test generation prompts.

[FP-001]: {description + reason it was a false positive}
[FP-002]: {description + reason}
...

From these false positives:
1. What application context information was missing from my original prompt?
2. What assumptions did AI make that don't match our stack?
3. Write 3-5 "anti-patterns to exclude" instructions I can prepend 
   to future security test generation prompts for this application

Learning Tip: False positives in security testing are more damaging than false positives in functional testing because they create a documented vulnerability record that developers and management will act on. A false positive SQL injection report triggers investigation, developer time, and sometimes emergency patches — all for a vulnerability that doesn't exist. Before you file any security finding, verify it by executing the test manually, observing the actual response, and confirming the vulnerability indicator is real. AI helps you generate test cases fast; you are the verification gate that prevents noise from becoming organizational overhead.