Validating Assumptions and Running AI Experiments

Overview

Every product decision is a bet. The bet has a set of assumptions underneath it: assumptions about who the customer is, what they need, whether they value what you plan to build, whether you can build it at the required cost and quality, and whether the business model will work. Most of these assumptions are never made explicit, which means they are also never systematically tested. Instead, teams discover they were wrong about a critical assumption after they have shipped the product, analyzed the usage data, and realized the thing they built is not being used in the way they expected — or at all.

The discipline of assumption mapping and experiment design is about making the implicit explicit, systematically identifying which assumptions carry the most risk, and designing the most efficient possible evidence-gathering activities to test those assumptions before making irreversible resource commitments. This is not the same as being risk-averse — it is about being risk-intelligent. A team that validates its riskiest assumptions early and cheaply can take on more ambitious bets with greater confidence, because they have reduced uncertainty on the dimensions that matter most.

AI makes assumption mapping and experiment design significantly more accessible and more thorough. Experienced practitioners know how to run these activities, but they are rarely done with full rigor because they are time-consuming. AI can generate a comprehensive assumption map in minutes, suggest experiment designs with detailed specifications, calculate sample sizes, and help interpret results. What used to be a half-day workshop becomes an hour of focused AI-assisted work followed by team review.

This topic covers the complete validation workflow: surfacing all categories of assumptions underlying a product bet, designing appropriate experiments for each assumption category, defining success metrics and sample sizes with statistical rigor, and interpreting results to make a confident next-step decision. The goal is to give you a complete, executable validation capability that you can apply to any product decision, from a tactical UX change to a major strategic bet.

How to Use AI to Identify and List Critical Assumptions Behind a Product Bet

Every product bet rests on a stack of assumptions across three categories: desirability (customers want and value this), feasibility (we can build it to the required standard), and viability (this will generate enough value to be worth the investment). The desirability assumptions are typically the ones most in need of validation — they concern customer behavior, which is notoriously hard to predict from intuition or analogy. Feasibility and viability assumptions often have more accessible validation paths through technical spikes and financial modeling respectively.

The most common failure mode in assumption mapping is incompleteness: teams identify the obvious assumptions (customers want this feature) but miss the less obvious ones (customers will discover and use this feature given our current navigation structure, customers will be willing to change their current workflow to use this feature, customers' organizations will permit use of this feature given their security policies). These second-order assumptions are often the ones that cause product failures, precisely because they were never identified and never tested.

A useful mental model for assumption completeness is the "how would this fail?" analysis. For each assumption category, ask: what would have to be false for this product bet to fail despite perfect execution? This adversarial framing tends to surface the assumptions that are most taken for granted, because the failure modes that come to mind are often linked to assumptions the team has not consciously acknowledged.

AI is particularly effective at assumption surfacing because it can systematically apply multiple frameworks and failure mode lenses without the cognitive fatigue that makes human assumption mapping sessions incomplete. When you provide AI with a detailed product brief, it can generate a comprehensive assumption register covering all three categories, differentiate between high-risk and low-risk assumptions, and suggest validation priority order — all in a single well-structured prompt.

Hands-On Steps

Prepare a detailed product brief for the bet you want to map assumptions for: the problem you are solving, the target user, the proposed solution approach, the intended behavior change, and the success metrics.
Run the assumption mapping prompt to generate a complete assumption register across desirability, feasibility, and viability dimensions.
Review the AI-generated register with your team and add any assumptions that are missing based on your domain knowledge and organizational context.
Score each assumption on two dimensions: certainty (how confident are you that this is true?) and criticality (how catastrophic would it be if this assumption were false?).
Create a 2x2 priority matrix: High Criticality + Low Certainty assumptions are your validation priorities.
For each high-priority assumption, run the assumption investigation prompt to identify the most efficient validation path.

Prompt Examples

Prompt: Comprehensive Assumption Mapping

You are conducting assumption mapping for a product investment decision.

**Product Brief:**
[PASTE YOUR DETAILED PRODUCT BRIEF — include the problem statement, target user, proposed solution, intended user behavior change, and business model]

Generate a comprehensive assumption register across three categories. For each assumption:
- State the assumption explicitly as a declarative sentence ("We assume that...")
- Identify the assumption type (Desirability, Feasibility, or Viability)
- Rate the assumption's current certainty: Unknown (no evidence), Low (weak indirect evidence), Medium (directional evidence), High (strong direct evidence)
- Rate the assumption's criticality: Critical (product fails if false), Significant (major rework required if false), Minor (addressable with small adjustments)
- Suggest the simplest possible test that could validate or refute this assumption

**DESIRABILITY ASSUMPTIONS:**
Cover: customer awareness of the problem, motivation to solve it, willingness to change current behavior, perceived value of the solution, frequency of need, accessibility of the solution in context, social acceptability, emotional drivers

**FEASIBILITY ASSUMPTIONS:**
Cover: technical capabilities required, data availability and quality, third-party dependencies, team skills and capacity, performance and scalability requirements, security and compliance requirements, integration requirements

**VIABILITY ASSUMPTIONS:**
Cover: willingness to pay, price sensitivity, business model fit, channel assumptions, customer acquisition cost, retention and engagement assumptions, competitive defensibility, regulatory environment

After generating the register:
1. Identify the 5 assumptions with the highest combination of criticality and uncertainty — these are your validation priorities.
2. Identify any assumptions that are unknowable without building and shipping the product — these must be accepted as residual risk.
3. Identify any assumptions that are already validated by existing research or data — mark these as "pre-validated."

Expected output: A comprehensive assumption register with 20–35 assumptions across all three categories, each with certainty and criticality ratings, a simple test suggestion, and an overall prioritization of the top 5 validation targets.

Prompt: Hidden Assumption Surfacing

I want to identify the assumptions that are so deeply embedded in my product brief that they are not immediately obvious — the ones my team might be taking for granted.

**Product Brief:** [PASTE YOUR PRODUCT BRIEF]
**Assumption Register Generated So Far:** [PASTE YOUR CURRENT ASSUMPTION REGISTER]

Look for hidden assumptions in the following categories:

1. **Behavioral assumptions:** We are assuming users will do X (discover, adopt, change behavior, pay, return). Are these behavior assumptions realistic given what we know about user behavior change?

2. **Context assumptions:** We are assuming this product is used in a specific context (device, environment, time, with specific other tools). Are these context assumptions validated?

3. **Organizational assumptions (for B2B):** We are assuming the buyer, user, and influencer align. We are assuming procurement, security, and IT will not block adoption. Are these organizational dynamics assumptions tested?

4. **Counter-narrative assumptions:** We are assuming there is no better alternative emerging, no regulatory threat, no market shift that would make this solution obsolete. Are these "no-threat" assumptions explicit?

5. **Timing assumptions:** We are assuming the market is ready for this now, that customers are experiencing pain intensely enough to adopt now, and that our team can execute in the required timeframe. Are these timing assumptions grounded?

For each hidden assumption surfaced, add it to the assumption register with a criticality and certainty rating.

Expected output: An extended assumption register that surfaces 5–15 additional hidden assumptions across behavioral, contextual, organizational, counter-narrative, and timing dimensions. These are often the assumptions most worth testing, precisely because they were not noticed.

Learning Tip: Keep a "killed assumptions" log alongside your assumption register. Every time you validate an assumption (proving it true or false), record the assumption, the evidence used, the result, and the date. Over time this log becomes one of the most valuable documents in your product organization — it records what you have learned from past bets, prevents repeating the same validation work, and lets new team members understand why certain decisions were made the way they were.

Generating Experiment Designs — A/B Tests, Fake Door Tests, Concierge MVPs — with AI

Experiment design is a technical discipline that most product managers and business analysts are underprepared for. The result is experiments that are poorly specified, generate ambiguous results, or test the wrong thing — and then lead to either false confidence ("the test passed") or false rejection ("the test failed") in the assumption. AI can help you design well-specified, appropriately rigorous experiments for each validation challenge, matching the method to the assumption, the risk level, and the available resources.

The three most widely applicable experiment types in product discovery are A/B tests (comparing two versions of an existing feature or flow in a live product), fake door tests (measuring demand for a feature before building it by presenting the feature entry point and tracking click-through or sign-up), and concierge MVPs (manually delivering the value proposition to a small set of real customers before building the automated product). Each type has a specific application profile, and choosing the wrong type for your assumption produces a less efficient validation.

A/B tests are best for validating behavioral assumptions about an existing user base with measurable outcomes. They require a live product, sufficient traffic volume, and a behavioral metric that can be tracked cleanly. They are poor for validating demand for fundamentally new capabilities or value propositions, because you cannot A/B test something that does not yet exist.

Fake door tests (also called "painted door" or "smoke screen" tests) are best for validating demand before building. A typical implementation: add a button or link for a new feature, track the click-through rate, and present an "under construction" or "coming soon" page to users who click. The click-through rate is a behavioral signal of demand — it measures whether users notice and choose the option, which is a stronger signal than survey-stated intent. Fake door tests require very little engineering and can run in days.

Concierge MVPs are best for validating the value proposition and understanding the customer workflow before automating it. A concierge MVP delivers the outcome manually: if you are building an AI-powered data analysis tool, the concierge MVP has your team manually analyze the data and deliver the output to a small set of customers, before any product is built. This approach is invaluable for discovering what "success" actually looks like from the customer's perspective before committing to an automated implementation.

Hands-On Steps

Take your top 3–5 high-priority assumptions from the assumption register.
For each assumption, identify the experiment type that best fits: A/B test (for behavioral hypotheses on existing product with sufficient traffic), fake door test (for demand hypotheses on new capabilities), or concierge MVP (for value proposition hypotheses on new products or major pivots).
Run the experiment design prompt for each assumption, specifying the experiment type you have selected.
Review the experiment designs for completeness: hypothesis, control, treatment, success metric, sample size, duration, analysis plan, decision criteria.
Identify any ethical considerations — particularly for fake door tests, where customers are temporarily mislead about feature availability. Define how you will handle customers who engage with the fake door.
Sequence the experiments based on dependency, cost, and risk: run the cheapest, fastest disconfirmation experiments first.

Prompt Examples

Prompt: Fake Door Test Design

I need to design a fake door test to validate the following assumption:

**Assumption:** [PASTE YOUR SPECIFIC DEMAND ASSUMPTION — e.g., "Enterprise users will actively seek out and attempt to use a bulk data export capability if it is presented in the product interface"]

**Product context:**
- Current product: [Brief description of existing product]
- User base: [Description of users who will encounter the test]
- Monthly active users in the relevant flow: [Number]
- Current navigation structure relevant to the test: [Brief description]

Design a complete fake door test with:

1. **Entry point design:** Where exactly in the product will the fake door appear? What form will it take (button, menu item, feature card, etc.)? What copy/label will it use?

2. **Behind-the-door experience:** What happens when a user clicks the fake door? (Options: coming soon page with email signup, waiting list form, brief survey, direct contact with PM team) Recommend the approach that balances data quality with user experience impact.

3. **Primary metric:** What user behavior is being measured, and how? Define the metric precisely (e.g., "click-through rate on the Export button in the Settings > Data section, measured as unique users who clicked / unique users who viewed the Settings > Data section").

4. **Secondary signals:** What other data points would enrich the primary metric? (e.g., segment breakdown of who clicked, time-of-day patterns, prior session behavior before clicking)

5. **Duration:** How long should the fake door run to collect sufficient data, given the traffic levels provided?

6. **Decision threshold:** What click-through rate would count as "demand validated"? What would count as "demand not validated"? What range is ambiguous?

7. **Ethics and user experience:** How will you handle users who engage with the fake door? What is your follow-up plan for users who sign up for a waiting list?

Expected output: A fully specified fake door test design with entry point specification, behind-the-door experience design, primary and secondary metrics, duration, decision threshold, and ethics plan. Ready to hand to an engineer or designer for implementation.

Prompt: Concierge MVP Design

I need to design a concierge MVP to validate the following value proposition assumption:

**Assumption:** [PASTE YOUR SPECIFIC VALUE PROPOSITION ASSUMPTION — e.g., "If users receive AI-generated weekly insights from their product usage data, they will find enough value to incorporate this into their weekly decision-making workflow and express willingness to pay for it as a premium feature"]

**Proposed product concept:**
[BRIEF DESCRIPTION OF THE EVENTUAL AUTOMATED PRODUCT]

**Available resources for the concierge:**
- Team members who can deliver the value manually: [Who and their available time]
- Access to the customer data needed: [Yes/No and any constraints]
- Customer segment you can recruit: [Target number and how you will reach them]

Design a complete concierge MVP:

1. **Concierge delivery design:** What exactly will you manually produce and deliver to customers? Be specific about the content, format, delivery channel, and delivery cadence.

2. **Customer recruitment:** How will you recruit the right customers for the concierge? What screening criteria ensure they match the target segment?

3. **Delivery protocol:** What is the exact process for producing and delivering each concierge interaction? Who does what, by when? What quality standards apply?

4. **Learning instrument:** How will you collect structured feedback from concierge customers? Design 4-6 specific questions you will ask after each delivery cycle to assess value, behavior change, and willingness to pay.

5. **Success criteria:** What customer behaviors or statements would validate the value proposition assumption? What would refute it?

6. **Duration and scope:** How many customers, over how many delivery cycles, is needed to generate enough evidence to make a go/no-go decision on building the automated product?

7. **Key learning questions:** Beyond validating the core assumption, what 3 things about the customer experience are you most hoping to learn from this concierge that will inform the product design?

Expected output: A complete concierge MVP design specification covering delivery design, recruitment, protocol, feedback collection, success criteria, scope, and key learning questions. This specification can be handed to the team to execute the concierge study.

Learning Tip: For fake door tests, always design the "behind the door" experience as thoughtfully as the fake door entry point itself. A poorly designed "coming soon" page that just says "this feature isn't available yet" teaches you almost nothing beyond click-through rate. A well-designed page that includes 2-3 clarifying questions about what the user was trying to do, or a brief sign-up form that captures email and use case, turns a one-bit signal (clicked / didn't click) into a rich dataset about demand characteristics and use case distribution. The incremental effort is small; the learning value is substantial.

Using AI to Define Success Metrics and Minimum Sample Sizes for Experiments

Success metric definition and sample size calculation are the two most frequently skipped steps in informal product experimentation. Teams run A/B tests that were never properly powered, declare results after one week when the true difference between variants is not statistically detectable in that time, and make product decisions based on data that cannot support the conclusions drawn from it. AI can help you get these technical elements right quickly, without requiring deep statistical expertise.

Success metrics for experiments must satisfy three criteria: measurability (you can actually observe and record this with your current tools), sensitivity (the metric will change by a detectable amount if your hypothesis is true), and proximity (the metric is close enough to the behavior of interest that it reflects it cleanly, rather than being a distant proxy that could move for many reasons). The most common metric failure mode is using distant proxy metrics — overall revenue, DAU, NPS — to evaluate specific feature experiments, when these metrics are so influenced by other factors that the experiment's effect is undetectable.

The distinction between leading and lagging success metrics is important for experiment design. Lagging metrics (revenue impact, retention rate, NPS change) reflect the ultimate outcome you care about but take months to manifest clearly. Leading metrics (feature activation rate, workflow completion rate, time-on-task) are closer in time to the user behavior you are influencing and respond faster — but they are only valuable if you have evidence that they actually predict the lagging outcomes you care about. Good experiment design uses a leading metric as the primary decision metric (fast, sensitive signal) with a lagging metric as a longer-term confirmation (ultimate business impact).

Sample size calculation is not optional for A/B tests — it is the calculation that tells you whether your experiment can actually detect a real effect. An underpowered experiment that runs for two weeks with 500 users per variant cannot reliably detect a 5% improvement in conversion rate; the result will appear non-significant regardless of whether the effect is real or not. AI can calculate the required sample size given your baseline conversion rate, expected lift, statistical significance threshold, and desired statistical power — and then tell you whether your traffic levels support running the experiment in a reasonable timeframe.

Hands-On Steps

For each experiment in your validation plan, define the primary metric before writing any other part of the experiment design.
Check the primary metric against the three criteria: measurability, sensitivity, and proximity to the behavior of interest.
Run the metric evaluation prompt to have AI assess your metric choice and suggest improvements or alternatives.
Run the sample size calculation prompt with your baseline conversion rate, expected effect size, and traffic data.
Review the calculated sample size against your available traffic and experiment duration constraints.
If required sample size exceeds available traffic, use the underpowered experiment options prompt to identify alternatives: reduce the number of variants, increase the effect size you are testing for, or accept a lower confidence level with explicit acknowledgment.
Define both your primary metric (leading indicator, fast signal) and a confirmatory metric (lagging indicator, long-term validation) for each experiment.

Prompt Examples

Prompt: Success Metric Evaluation and Design

I am designing an experiment to validate the following hypothesis:

**Hypothesis:** [PASTE YOUR TESTABLE HYPOTHESIS]

**Proposed primary metric:** [PASTE YOUR PROPOSED METRIC]

Evaluate my proposed metric against the following criteria:

1. **Measurability:** Can this metric be measured with standard product analytics tools (e.g., Mixpanel, Amplitude, Google Analytics, SQL query on product database)? If not, what would be required to measure it?

2. **Sensitivity:** If my hypothesis is true and the expected behavior change occurs, will this metric move by a detectable amount? What range of change would I expect to see if the hypothesis is validated?

3. **Proximity:** How directly does this metric reflect the specific behavior I am trying to influence? What other factors (besides my experiment) could cause this metric to move in the same direction? How could I isolate the experiment's effect?

4. **Metric alternatives:** Suggest 2-3 alternative or complementary metrics that might be more sensitive or more proximate to the behavior of interest. For each alternative, explain its advantages and disadvantages relative to my proposed metric.

5. **Leading vs. lagging:** Is my proposed metric a leading indicator (fast-moving, close to behavior) or lagging indicator (slow-moving, reflects ultimate business outcome)? Recommend a pairing of one leading and one lagging metric for this experiment.

Conclude with a metric recommendation: either confirm my proposed metric with any adjustments needed, or recommend a better alternative with rationale.

Expected output: A structured metric evaluation with assessment on all three criteria, alternative metric suggestions, leading/lagging classification, and a final metric recommendation with any refinements.

Prompt: Sample Size Calculation

I need to calculate the minimum sample size required for the following A/B test experiment.

**Experiment parameters:**
- Hypothesis: [PASTE YOUR HYPOTHESIS]
- Primary metric: [PASTE YOUR CHOSEN METRIC]
- Baseline conversion rate: [PASTE YOUR CURRENT BASELINE — e.g., "Our current onboarding completion rate is 38%"]
- Minimum detectable effect: [PASTE THE MINIMUM IMPROVEMENT THAT WOULD BE MEANINGFUL — e.g., "We need at least a 5 percentage point improvement (from 38% to 43%) for this experiment to justify the implementation cost"]
- Statistical significance threshold: 95% confidence (alpha = 0.05) [adjust if different]
- Statistical power: 80% (beta = 0.20) [adjust if different]
- Number of variants: [2 = standard A/B, or more for A/B/n tests]

Calculate:
1. The minimum sample size required per variant to detect the specified effect with the specified confidence and power.
2. Given our current traffic to this flow ([PASTE DAILY OR WEEKLY TRAFFIC VOLUME]), how many days or weeks would this experiment need to run to reach the required sample size?
3. What happens to the required sample size if I reduce my minimum detectable effect to [LOWER NUMBER]? Show the calculation.
4. What happens to the required sample size if I accept 90% confidence instead of 95%? Show the calculation.
5. If we have seasonal traffic patterns (e.g., lower volume on weekends, spikes during month-end), what adjustments should I make to the experiment duration?

Provide the calculations with the formula used, so I can verify and adjust independently.

Expected output: A complete sample size calculation with the required per-variant sample, experiment duration estimate, and sensitivity analyses showing how the requirements change with different effect size and confidence assumptions. The formula will be included so you can verify and run your own calculations.

Learning Tip: When presenting experiment results to stakeholders, always include the sample size and confidence level in the headline number. "The experimental variant increased completion rate by 6%" is a claim that means very different things at 95% confidence with n=2,000 vs. 72% confidence with n=200. Stakeholders who are not statisticians will treat both as equally valid, and may make significant product decisions based on underpowered experiments. Building the habit of leading with confidence and sample size in experiment readouts creates a more statistically literate product culture over time.

How to Interpret Experiment Results and Decide Next Steps with AI Assistance

Experiment result interpretation is where many product teams make their worst decisions. The most common failure mode is "confirmation bias by result selection": running an experiment, seeing a result that supports the prior belief, declaring success, and moving on — without examining whether the result was actually statistically significant, whether the effect size is meaningful in practical terms, whether the result holds across segments, or whether there are concerning side effects in other metrics. AI can help you interpret results more rigorously by applying a structured analysis framework that checks all these dimensions systematically.

The decision after an experiment should rarely be a simple binary proceed/kill. More often, the right decision is proceed with modifications, conduct a follow-up experiment, proceed for a specific segment only, or shelve for now and revisit when conditions change. Framing experiment decisions as a decision tree rather than a binary simplifies the analysis and produces more calibrated outcomes: you are asking "what does this result tell us and what should we do next?" rather than "did we win or lose?"

The interpretation challenge that AI is most useful for is segment analysis. An experiment that appears non-significant in aggregate may contain a strongly significant effect in a specific user segment. Conversely, an experiment that shows a significant positive effect in aggregate may be hiding a strong negative effect in a high-value minority segment. AI can guide you through the segment breakdown analysis systematically, flagging segments that warrant closer examination.

Statistical significance tells you whether the observed difference is likely to be real (not random noise). Practical significance tells you whether the real difference is large enough to matter. These are independent questions, and good experiment interpretation addresses both. A statistically significant improvement from 38% to 39.1% completion rate (p=0.03) may not be worth implementing if the business case requires a 5-point lift. AI can help you calculate and communicate both dimensions clearly.

Hands-On Steps

Collect your experiment results: primary metric performance for control and treatment groups, sample sizes, observed difference, p-value (or confidence interval), and any available segment breakdowns.
Run the experiment result interpretation prompt with your full results data.
Review the AI interpretation for the statistical significance check, practical significance assessment, segment analysis, and side effect check.
Run the decision framework prompt to get a structured recommendation on next steps.
Document the experiment, its results, and your decision in your assumption register and validation log.
If the decision is to proceed, update your hypothesis register to reflect the validated assumptions.
If the decision is to kill or pivot, document the learning and the impact on your product strategy explicitly — this prevents the same failed bet from being re-proposed without the benefit of this learning.

Prompt Examples

Prompt: Experiment Result Interpretation

I have completed an experiment and need to interpret the results rigorously. Please analyze the following data:

**Experiment summary:**
- Hypothesis tested: [PASTE HYPOTHESIS]
- Experiment type: [A/B test / fake door / concierge MVP / etc.]
- Duration: [Start date to end date]
- Primary metric: [Metric name and definition]

**Results:**
- Control group: [N users, X% primary metric rate]
- Treatment group: [N users, Y% primary metric rate]
- Observed difference: [Y - X percentage points]
- Statistical significance: [p-value or confidence interval if available]
- Secondary metric results: [Any other metrics tracked]

**Segment data available:**
[PASTE ANY SEGMENT BREAKDOWNS — by user role, plan type, cohort, geography, etc.]

Perform the following analysis:

1. **Statistical validity check:** Is the observed difference statistically significant at the 95% confidence level? If significance data was not provided, what additional calculation would be needed? Are there any data quality issues that could invalidate the results (e.g., sample ratio mismatch, novelty effect)?

2. **Practical significance check:** Is the observed difference large enough to be meaningful for our business goals? Compare the observed effect size to the minimum detectable effect defined in the experiment design. Does this lift justify the implementation cost?

3. **Segment analysis:** Do any segments show notably different results from the aggregate? Are there segments where the effect is significantly larger or smaller? Are there any segments where the treatment appears harmful?

4. **Side effects check:** Did any secondary metrics move in unexpected directions? Could the improvement in the primary metric come at the cost of another important metric?

5. **Alternative explanations:** What alternative explanations (besides "our hypothesis was correct") could account for these results? How likely are each of these alternatives?

Expected output: A structured result interpretation covering statistical validity, practical significance, segment analysis, side effects, and alternative explanations. This analysis is the rigorous basis for a go/no-go decision.

Prompt: Next-Step Decision Framework

Based on the experiment result interpretation above, help me decide on the right next step using a structured decision framework.

**Interpretation summary:** [PASTE KEY FINDINGS FROM INTERPRETATION PROMPT]

**Available decisions:**
A. Proceed — implement the tested change for all users
B. Proceed for segment — implement for a specific segment where results were strongest
C. Iterate — run a follow-up experiment with a modified treatment
D. Extend the experiment — continue running to reach required confidence or sample size
E. Kill — do not proceed; this assumption is not validated
F. Pivot — the experiment results suggest a different, more promising direction

Evaluate each decision option:
1. Which decision is most supported by the evidence?
2. What are the risks of that decision (what could go wrong if we proceed with it)?
3. What conditions would need to be true for an alternative decision to be more appropriate?
4. If you recommend "Iterate," what specifically should change in the next experiment design?
5. If you recommend "Kill," what should we learn from this result and how should it change our problem framing?

Conclude with a clear recommendation — one specific next step with a brief rationale — that I can bring to my product team for alignment.

Expected output: A structured decision analysis evaluating all options against the evidence, with a clear final recommendation and rationale. This is the output you bring to your product team to align on next steps after an experiment.

Learning Tip: Build an "experiment results gallery" for your product team — a shared document or wiki page where every completed experiment is recorded with its hypothesis, design, results, interpretation, and decision. Teams that maintain this gallery benefit in three ways: they avoid re-running experiments that were already run, they can identify patterns across experiments (e.g., "every time we add friction for power users in order to help new users, power users disengage — this has happened 3 times"), and they build the team's collective statistical literacy over time. The gallery also becomes powerful evidence of a discovery discipline when presenting to executives or investors.

Key Takeaways

Every product decision rests on a stack of desirability, feasibility, and viability assumptions; making them explicit is the precondition for validating them.
Hidden assumptions — behavioral, contextual, organizational, counter-narrative, and timing — are often the most critical ones to validate, precisely because they are most taken for granted.
Experiment type selection should match the assumption being tested: A/B tests for behavioral hypotheses on existing products, fake door tests for demand validation before building, concierge MVPs for value proposition validation of new products.
The "behind the door" experience in a fake door test should be designed as carefully as the entry point — it turns a one-bit signal into rich demand-characterization data.
Success metrics must satisfy measurability, sensitivity, and proximity criteria; using distant proxy metrics to evaluate specific experiments is a primary cause of ambiguous results.
Sample size calculation is not optional — underpowered experiments produce non-significant results that are indistinguishable from false negatives, leading to the cancellation of ideas that would actually work.
Experiment result interpretation must address statistical significance, practical significance, segment breakdown, side effects, and alternative explanations — not just whether the primary metric moved.
Post-experiment decisions are rarely binary proceed/kill; framing them as a decision tree (proceed, proceed for segment, iterate, extend, kill, pivot) produces more calibrated and useful outcomes.
Maintaining a shared experiment results gallery builds team statistical literacy, prevents repeat experiments, and surfaces patterns that inform product strategy.