Vol. XII · No. 04 · Apr 2026
Jake Cuth.

The math is the
aesthetic.

A live A/B test simulator. Design an experiment, watch two Beta posteriors update frame-by-frame as users flow in, and see exactly how often peeking early turns noise into a "winner." All math runs in your browser — nothing phoned home, no preloaded answers.


Most failed experiments aren't failed by the data — they're failed by the decision. Tests get peeked at every morning, winners called the first time p < 0.05 flickers on the dashboard, and null results quietly buried because "we didn't get enough traffic."

The page below runs an honest test against itself. Set a baseline rate, set the true lift (or zero, if you're curious), and let the users flow. Two Beta posteriors update in real time. A pair of decision panels — frequentist and Bayesian — tell you what each framework would do right now. And a separate panel further down proves, in a thousand synthetic tests, what peeking actually costs.

All simulations are reproducible from a seed. The companion Python script runs the same peeking Monte Carlo in Python + scipy; the reference number is shown alongside the live number for sanity.


Required sample / arm
Speed
Preset
Fig. J.1 — Beta posteriors · shaded = P(B > A)
Variant A · control
0
users
0
conversions
observed rate
Variant B · test
0
users
0
conversions
observed rate
Live metrics
observed lift
p-value · freq
P(B > A) · bayes
Fig. J.1b — running decision metrics (P(B>A), observed lift)

Fig. A

Beta-Bernoulli conjugacy

Start with a uniform Beta(1, 1) prior on each variant's conversion rate. Each user is a Bernoulli trial: converted (+1 to α) or didn't (+1 to β). The posterior stays Beta at every step — closed-form, no numerical integration. That's why the distributions update smoothly at sixty frames per second.

Fig. B

Sample size, solved

The required-sample readout in the designer panel uses the two-proportion z-test formula. It's what a traditional power calculator would hand you, given baseline, minimum detectable effect, α, and 1 − β. Change any slider; the number updates instantly. Run fewer trials than that and your "null result" is almost certainly under-powered noise.

Fig. C

P(B > A), honestly

For the Bayesian side, we draw 5,000 samples from each Beta posterior every frame and count the fraction where B's draw exceeds A's. Takes ~2ms. Gives a probability that doesn't require a p-hacked threshold to interpret — "the test has P(B > A) = 0.91" is a statement a product manager can actually act on.


Set the true variant rate equal to the baseline (no real lift). Then peek at the test every few dozen users and stop the moment p < 0.05 flickers. Do that a thousand times. Count how often you declared a winner.

Nominal false-positive rate is 5%. What you'll actually get is closer to 30–40% — the inflation comes entirely from the decision process, not the data.

— idle —

Same simulation, two decision frameworks. The frequentist stops when p < α at the pre-registered sample size. The Bayesian stops when P(B > A) crosses a fixed threshold (default 95%). In small-effect tests they usually agree. In ambiguous regions they don't — which is where the page earns its keep.

Frequentist
rule · stop at required N, reject H₀ if p < α
p-value ·
CI 95% ·
awaiting data
Bayesian
rule · stop when P(B > A) > 0.95 or < 0.05
P(B > A) ·
expected loss ·
awaiting data


Engine

Pure JavaScript. Each frame generates a batch of Bernoulli trials (rate scaled by the Speed dial), updates the two Beta posteriors, recomputes the running z-test p-value and a Monte Carlo estimate of P(B > A). Sixty frames per second on any device since the iPhone 8.

Validation script

notebooks/ab_test_model.py ↗ runs the same peeking Monte Carlo in Python + scipy, writes the reference numbers to methodology.json. The live page fetches them and prints both side by side.

Reading list

Evan Miller · How Not to Run an A/B Test ↗
Kohavi et al. · Online Controlled Experiments at Large Scale ↗
Stucchio · Bayesian A/B Testing at VWO ↗

Limitations

One metric, two variants — no multi-arm or multi-metric tests. No sequential-testing adjustments (SPRT, α-spending); they'd be the honest answer to peeking, but the point of the Monte Carlo panel is to show the problem, not paper over it. Fixed uniform prior; real programs use empirical priors fit to historical tests.