Sample Size Planner

Multi-arm A/B tests

Plan per-arm and total sample size for experiments with one control and multiple variants (A/B/C/…) on either conversion rates or means. Choose whether you care most about detecting a minimum lift vs. control for each arm or about any meaningful difference across all arms.

TEST OVERVIEW & EQUATIONS

This planner extends a two-arm A/B test to k-arm experiments with one control group and multiple variants (for example, one control subject line and three new lines). You can choose between testing proportions (conversion or response rates) and means (average order value, revenue, etc.).

For a given outcome type, the underlying math uses pairwise comparisons between each variant and the control. When your goal is to detect a minimum lift vs. control, the planner computes the required per-arm sample size for each variant–control comparison and then takes the maximum so that all lifts you care about are adequately powered. When your goal is to detect an omnibus difference, the same effect pattern is used but the focus is on having enough power to flag that at least one arm differs meaningfully from control.

Notes on alpha, power, and multiple arms

Adding more arms spreads traffic thinner and increases the chance of spurious findings if you do not adjust your decision rules. In the “lift vs. control” goal, this planner uses a simple Bonferroni-style adjustment to keep the overall false-positive rate across all variant–control comparisons close to your chosen alpha. In the “omnibus” goal, it uses the nominal alpha level and asks whether the design has enough power to detect that at least one arm differs from control by the specified amount.

INPUTS & SETTINGS

Design the multi-arm test

Arms & target conversion rates

Specify the expected conversion or response rate for the control and each variant. The planner will compute the minimum per-arm sample size to detect the differences you have entered at your chosen confidence and power.

Arm Label Target rate
Control (A)
Variant B
Variant C
Variant D

Arms, target means, and variability

Specify the expected mean outcome for the control and each variant, along with a common standard deviation. The planner assumes roughly equal variability across arms.

Arm Label Target mean
Control (A)
Variant B
Variant C
Variant D

Rough measure of spread around each arm’s mean (for example, dollars or points).

Lift vs. control focuses on detecting a meaningful improvement for each variant separately. Omnibus focuses on detecting that at least one arm differs meaningfully from control.

Confidence = 1 - alpha for a two-sided test.

Common choices are 80% or 90% power.

Advanced settings

Two-sided tests are standard when any increase or decrease matters. One-sided tests can be used when you only care about improvements relative to control, but they ignore strong evidence of a decrease.

PLANNING SCENARIOS

Use presets to explore common marketing multi-arm tests, such as three competing subject lines or four hero images on a landing page. Each scenario sets control and variant targets, a goal, and default confidence/power.

VISUAL OUTPUT

Required per-arm sample vs. effect size

This chart shows how the required per-arm sample size changes as the smallest lift vs. control you care about becomes larger. Stronger lifts reduce the required per-arm sample size for a fixed confidence and power.

Required per-arm sample vs. power

This chart plots the required per-arm sample size against desired power, holding the effect pattern and alpha fixed. Higher power always requires a larger per-arm n.

DESIGN SUMMARY

Required per-arm sample size (n):
Total required sample size (N total):
Outcome type:
Goal:
Design alpha / confidence:
Design power:

Per-arm summary vs. control

Enter control and variant targets above to see a summary of lifts and required n per arm.

Statistical Planning Statement

Provide control and variant targets, choose a goal (lift vs. control or omnibus), and set alpha and power to generate a planning statement for this multi-arm design.

Managerial Interpretation

This panel translates the design into plain language: how many observations per arm you need, how that depends on the minimum lift you care about, and how often the experiment would succeed in flagging a real winner vs. looking inconclusive just because of noise.