A/B Testing Prompts | Token Optimization Masterclass

Intuition is a poor guide for prompt optimization. What seems like a cleaner, more concise prompt often produces worse results, while verbose prompts sometimes perform better than expected. A/B testing removes opinion from the equation and replaces it with evidence. This topic covers how to design statistically rigorous experiments on prompts and context strategies, how to run them safely in production, and how to interpret results to make durable improvements.

The Case for Rigorous A/B Testing in LLM Systems

Most teams optimize prompts informally: an engineer tries a few variations in a playground, picks the one that "looks good," and deploys it. This approach has three critical failure modes:

Survivorship bias: You only evaluate the variants you thought to try. The best variant might not have occurred to you.

Sample bias: A handful of test cases is not representative of production distribution. Edge cases, domain-specific inputs, and rare patterns are invisible in manual testing.

Confounding: When you change the prompt and the quality metric changes, you cannot be sure whether the change caused the improvement without controlling for other variables.

Rigorous A/B testing solves all three problems. It forces explicit hypothesis formation, uses representative production traffic, and isolates variables.

What to Test

In the context of token optimization, A/B testing has two objectives:

Efficiency tests: Does Variant B use fewer tokens than Variant A while maintaining quality?
Quality-under-constraint tests: If we reduce tokens by X%, what is the minimum quality degradation?

The variables you can test fall into three categories:

Prompt variables: System prompt length/structure, instruction verbosity, example count and placement, output format instructions, chain-of-thought elicitation.

Context variables: Retrieval chunk size, number of retrieved chunks, context ordering (most relevant first vs. chronological), inclusion/exclusion of metadata, context compression ratios.

Architecture variables: Model choice (GPT-4o vs. Claude Sonnet vs. Haiku), temperature settings, maximum token limits, tool definition verbosity.

Tip: Focus your first A/B tests on system prompt compression and output format instructions. These two variables consistently yield the highest token savings (often 20–40%) with the least risk to output quality. System prompts are frequently over-engineered, and output format instructions often implicitly invite verbose responses.

Designing a Statistically Sound A/B Test

A poorly designed A/B test wastes time and produces misleading conclusions. These are the steps for a sound experiment design.

Step 1: Write a Hypothesis

A valid hypothesis has three parts: the change, the expected effect, and the rationale.

Bad hypothesis: "Shorter prompt will save tokens."

Good hypothesis: "Removing the 'Reasoning guidelines' section from the PR review system prompt (reducing it from 1,200 to 650 tokens) will reduce average prompt tokens per review by 35–45% while maintaining a PR summary quality score ≥ 4.2/5.0 (current baseline: 4.4/5.0)."

The good hypothesis specifies the exact change, the expected magnitude, and an acceptable quality floor.

Step 2: Define Your Metrics

For each A/B test, define three metrics:

Primary metric (efficiency): Average total tokens per workflow run. This is what you are trying to reduce.

Guard metric (quality): A quality signal that must not degrade beyond a threshold. Options include:
- Human evaluation score (most reliable, expensive)
- LLM-as-judge score (scalable, needs calibration)
- Task completion rate (binary, easy to measure)
- Downstream outcome metric (e.g., PR merge rate, test pass rate)

Secondary metric (cost): Actual USD cost per run, accounting for model pricing and cached tokens.

Step 3: Calculate Required Sample Size

Token count distributions are rarely normal — they have heavy right tails from complex inputs. Use the Mann-Whitney U test (non-parametric) rather than a t-test, and calculate sample size accordingly.

For a rough rule of thumb:

If you expect a 20% reduction in mean tokens with 80% power and α = 0.05, you need approximately 150–200 samples per variant
If you expect a 10% reduction, you need approximately 500–600 samples per variant
If you expect a 5% reduction (fine-grained optimization), you need 2,000+ samples per variant

Use the scipy.stats package for sample size estimation:

from scipy import stats
import numpy as np

def estimate_sample_size(
    baseline_mean: float,
    baseline_std: float,
    expected_reduction_pct: float,
    power: float = 0.80,
    alpha: float = 0.05
) -> int:
    """
    Estimate required sample size per variant for a token reduction test.
    Uses two-sided test assumption.
    """
    effect_size = (baseline_mean * expected_reduction_pct) / baseline_std

    # Cohen's formula for two-sample t-test (approximation for non-normal)
    z_alpha = stats.norm.ppf(1 - alpha/2)
    z_beta = stats.norm.ppf(power)

    n = int(2 * ((z_alpha + z_beta) / effect_size) ** 2)
    return n

baseline_mean = 12500  # tokens
baseline_std = 4200    # tokens
expected_reduction = 0.25  # 25% reduction

n_per_variant = estimate_sample_size(baseline_mean, baseline_std, expected_reduction)
print(f"Required samples per variant: {n_per_variant}")

Step 4: Control for Confounders

Key confounders in token A/B tests:

Input complexity: Long PRs will always use more tokens than short ones. Stratify your traffic assignment by input length buckets to ensure both variants receive similar complexity distributions.
Time of day/week: LLM response behavior can vary. Run tests for at least 7 days to cover weekly patterns.
User behavior changes: If users know about the test, they may change their behavior. Use server-side assignment based on session ID hashing, not user-visible flags.

Tip: Always run an A/A test before your first real A/B test. An A/A test assigns traffic to two identical variants. If your testing infrastructure is working correctly, there should be no statistically significant difference between them. If there is, you have a measurement problem that will corrupt all subsequent experiments.

Implementing A/B Testing Infrastructure

Simple Traffic Splitting with Feature Flags

For most teams, a feature flag system is the fastest path to A/B testing LLM prompts.

import hashlib
from enum import Enum

class PromptVariant(Enum):
    CONTROL = "control"
    TREATMENT = "treatment"

def assign_variant(session_id: str, experiment_name: str, traffic_split: float = 0.5) -> PromptVariant:
    """
    Deterministic variant assignment based on session ID hash.
    Same session always gets the same variant.
    """
    hash_input = f"{experiment_name}:{session_id}"
    hash_value = int(hashlib.md5(hash_input.encode()).hexdigest(), 16)
    normalized = hash_value / (2 ** 128)  # 0.0 to 1.0

    if normalized < traffic_split:
        return PromptVariant.CONTROL
    return PromptVariant.TREATMENT

SYSTEM_PROMPTS = {
    PromptVariant.CONTROL: """
You are a senior software engineer conducting a thorough code review.
Your role is to examine code changes for correctness, maintainability, 
performance issues, security vulnerabilities, and adherence to best practices.

When reviewing, consider:
- Logic correctness and edge cases
- Performance implications
- Security vulnerabilities (injection, auth, data exposure)
- Code style and readability
- Test coverage adequacy
- Documentation completeness

Provide structured feedback with specific line references where applicable.
Format your response as: Summary, Critical Issues, Suggestions, Positive Observations.
""",
    PromptVariant.TREATMENT: """
You are a senior software engineer. Review the code changes for:
- Correctness, edge cases, and logic errors
- Security vulnerabilities
- Performance issues
- Maintainability concerns

Format: Summary | Critical Issues | Suggestions | Positives
Be specific with line references. Be concise.
"""
}

def run_pr_review(session_id: str, pr_diff: str) -> dict:
    variant = assign_variant(session_id, "pr_review_system_prompt_v2")
    system_prompt = SYSTEM_PROMPTS[variant]

    response = llm_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": f"Review this PR:\n\n{pr_diff}"}
        ]
    )

    # Log experiment data
    log_experiment_result(
        experiment="pr_review_system_prompt_v2",
        variant=variant.value,
        session_id=session_id,
        input_tokens=response.usage.prompt_tokens,
        output_tokens=response.usage.completion_tokens,
        total_tokens=response.usage.total_tokens
    )

    return {
        "review": response.choices[0].message.content,
        "variant": variant.value
    }

Collecting Quality Metrics

Token efficiency is only half the story. You must measure quality alongside efficiency.

Option 1: LLM-as-Judge for Quality Scoring

def evaluate_review_quality(review_text: str, pr_diff: str) -> float:
    """
    Use a separate LLM call to score review quality.
    Returns a score from 1.0 to 5.0.
    Note: Use a cheaper/faster model for evaluation.
    """
    judge_prompt = f"""
You are evaluating the quality of a code review. Score it from 1 to 5:
5 = Comprehensive, actionable, specific line references, covers security/perf/logic
4 = Good coverage, mostly specific, minor gaps
3 = Adequate but generic or missing important areas
2 = Superficial, lacks specificity
1 = Unhelpful or incorrect

Code diff:
{pr_diff[:2000]}  # Truncate for cost control

Review to evaluate:
{review_text[:1500]}

Respond with ONLY a number from 1 to 5, then one sentence of justification.
"""
    response = fast_llm_client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": judge_prompt}],
        max_tokens=50
    )

    score_text = response.choices[0].message.content.strip()
    score = float(score_text[0])  # Extract first character as score
    return score

Option 2: Downstream Outcome Tracking

Track whether engineers who received AI-reviewed PRs merged them faster or with fewer subsequent bug reports. This is the gold standard quality signal — it measures real impact, not proxy metrics.

def log_pr_outcome(session_id: str, pr_id: str, outcome: dict):
    """
    Called when a PR is merged, rejected, or a bug is filed later.
    Links back to the experiment variant via session_id.
    """
    experiment_db.update_outcome(
        session_id=session_id,
        pr_id=pr_id,
        merged=outcome["merged"],
        days_to_merge=outcome["days_to_merge"],
        post_merge_bugs=outcome["post_merge_bugs_30d"],
        reviewer_time_minutes=outcome["reviewer_time_minutes"]
    )

Tip: When using LLM-as-judge for quality evaluation, use a model that is different from and ideally more capable than the model being tested. An LLM evaluating its own outputs tends to rate them higher than they deserve. Cross-model evaluation is more objective.

Analyzing A/B Test Results

Statistical Analysis Workflow

import pandas as pd
from scipy import stats
import numpy as np

def analyze_ab_test(experiment_name: str, min_samples: int = 100):
    """
    Full A/B test analysis: statistical significance, effect size, 
    and practical significance.
    """
    # Load experiment data
    df = load_experiment_results(experiment_name)

    control = df[df["variant"] == "control"]["total_tokens"].values
    treatment = df[df["variant"] == "treatment"]["total_tokens"].values

    print(f"Sample sizes: Control={len(control)}, Treatment={len(treatment)}")

    if len(control) < min_samples or len(treatment) < min_samples:
        print(f"WARNING: Insufficient samples. Need {min_samples} per variant.")
        return

    # Non-parametric test (Mann-Whitney U) - appropriate for skewed token distributions
    statistic, p_value = stats.mannwhitneyu(control, treatment, alternative='two-sided')

    # Effect size: percent change in median
    control_median = np.median(control)
    treatment_median = np.median(treatment)
    pct_change = (treatment_median - control_median) / control_median * 100

    # Cohen's d (for reference, though Mann-Whitney is primary)
    pooled_std = np.sqrt((np.std(control)**2 + np.std(treatment)**2) / 2)
    cohens_d = (np.mean(treatment) - np.mean(control)) / pooled_std

    # Quality metrics
    control_quality = df[df["variant"] == "control"]["quality_score"].mean()
    treatment_quality = df[df["variant"] == "treatment"]["quality_score"].mean()
    quality_delta = treatment_quality - control_quality

    print(f"\n=== {experiment_name} Results ===")
    print(f"Token Efficiency:")
    print(f"  Control median: {control_median:.0f} tokens")
    print(f"  Treatment median: {treatment_median:.0f} tokens")
    print(f"  Change: {pct_change:+.1f}%")
    print(f"  p-value: {p_value:.4f} ({'SIGNIFICANT' if p_value < 0.05 else 'NOT SIGNIFICANT'})")
    print(f"  Cohen's d: {cohens_d:.3f}")
    print(f"\nQuality:")
    print(f"  Control quality: {control_quality:.2f}/5.0")
    print(f"  Treatment quality: {treatment_quality:.2f}/5.0")
    print(f"  Quality delta: {quality_delta:+.2f}")

    # Decision framework
    token_improvement = pct_change < -10  # >10% reduction
    quality_acceptable = treatment_quality >= (control_quality - 0.2)  # Within 0.2 points
    statistically_significant = p_value < 0.05

    if token_improvement and quality_acceptable and statistically_significant:
        print(f"\nRECOMMENDATION: SHIP treatment variant")
        print(f"  Estimated monthly savings: ${estimate_monthly_savings(pct_change):.2f}")
    elif not statistically_significant:
        print(f"\nRECOMMENDATION: CONTINUE TEST (insufficient evidence)")
    elif not quality_acceptable:
        print(f"\nRECOMMENDATION: REJECT (quality degradation exceeds threshold)")
    else:
        print(f"\nRECOMMENDATION: INVESTIGATE (marginal improvement, consider further iteration)")

Common A/B Test Failure Patterns

The novelty effect: When a new prompt style is deployed, users may engage differently just because it is new. Mitigation: run tests for at least 7 days and look for stabilization of the effect.

The Simpson's paradox trap: If input complexity is distributed differently between variants (e.g., treatment gets more complex PRs by chance), the aggregate comparison is misleading. Always report results stratified by input complexity bucket.

The quality lag problem: Some quality signals (like downstream bug rates) take weeks to materialize. Do not ship based solely on immediate quality proxies if the long-term signal matters more. Use guardrail metrics aggressively.

Tip: Keep an experiment log — a simple spreadsheet or wiki page — with every A/B test you run: hypothesis, dates, sample sizes, result, and decision. After running 20+ experiments, patterns emerge. You will discover that certain prompt modifications (e.g., adding "be concise" to the output instructions) reliably reduce output tokens by 15–25% with minimal quality impact across ALL your agents. These become your default optimization techniques.

Context Strategy A/B Testing

Context management is often higher-leverage than prompt compression. These are the most impactful context variables to test.

RAG Chunk Size Experiments

The chunk size used in retrieval directly affects token consumption and answer quality.

CHUNK_SIZE_EXPERIMENTS = {
    "control": {"chunk_size": 1000, "chunk_overlap": 200, "top_k": 5},
    "treatment_a": {"chunk_size": 500, "chunk_overlap": 100, "top_k": 8},   # Smaller, more chunks
    "treatment_b": {"chunk_size": 2000, "chunk_overlap": 400, "top_k": 3},  # Larger, fewer chunks
}

def run_rag_experiment(query: str, session_id: str) -> dict:
    variant_config = get_variant_config(session_id, "chunk_size_v3", CHUNK_SIZE_EXPERIMENTS)

    chunks = retrieve_chunks(
        query=query,
        chunk_size=variant_config["chunk_size"],
        chunk_overlap=variant_config["chunk_overlap"],
        top_k=variant_config["top_k"]
    )

    context_tokens = count_tokens(chunks)

    response = generate_answer(query, chunks)

    log_rag_experiment(
        variant=variant_config["name"],
        query_tokens=count_tokens(query),
        context_tokens=context_tokens,
        response_tokens=count_tokens(response),
        answer_relevance_score=evaluate_answer(query, response)
    )

    return response

Context Ordering Tests

Research shows that LLMs recall information from the beginning and end of context better than from the middle (the "lost in the middle" problem). Test whether reordering context chunks improves quality without adding tokens:

Control: Chunks ordered by relevance score (highest first)
Treatment A: Most relevant chunk first, then least relevant, filling middle with others
Treatment B: Most recent chunks first (temporal ordering)
Treatment C: Relevance score ordering but with the top chunk repeated at the end

Tip: Run your context strategy experiments with fixed prompts. If you change the prompt and the context strategy simultaneously, you cannot attribute the result to either change. Lock one variable while testing the other. This single discipline — changing one variable at a time — is what separates teams that learn systematically from teams that make random progress.

Building a Prompt A/B Testing Pipeline with LangSmith

LangSmith has native support for dataset-based evaluation that can serve as a structured A/B testing pipeline:

from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

dataset = client.create_dataset("pr_review_eval_set_v1")
for sample in production_samples[:200]:
    client.create_example(
        inputs={"pr_diff": sample["pr_diff"]},
        outputs={"expected_quality": sample["human_quality_score"]},
        dataset_id=dataset.id
    )

def control_pr_reviewer(inputs: dict) -> dict:
    return {"review": run_with_control_prompt(inputs["pr_diff"])}

def treatment_pr_reviewer(inputs: dict) -> dict:
    return {"review": run_with_treatment_prompt(inputs["pr_diff"])}

results = evaluate(
    treatment_pr_reviewer,
    data=dataset,
    evaluators=[quality_evaluator, token_efficiency_evaluator],
    experiment_prefix="pr_review_prompt_v2_treatment"
)

compare_results = evaluate(
    control_pr_reviewer,
    data=dataset,
    evaluators=[quality_evaluator, token_efficiency_evaluator],
    experiment_prefix="pr_review_prompt_v2_control"
)

This approach gives you reproducible, dataset-grounded comparisons that are fully auditable in the LangSmith UI.

Summary

A/B testing transforms prompt optimization from art to engineering. The key practices are:

Write explicit hypotheses with expected magnitude and acceptable quality floors
Define primary (efficiency), guard (quality), and secondary (cost) metrics before running any test
Calculate required sample sizes using statistical power analysis — underpowered tests produce misleading results
Use deterministic hash-based traffic assignment for reproducibility
Use non-parametric statistical tests (Mann-Whitney U) for token distribution comparison
Test context strategies (chunk size, ordering) separately from prompt variables
Maintain an experiment log to identify patterns across your entire optimization history