M04 Experiments, run honestly

When A/B tests
lie.

A live A/B test simulator. Shows when the "winning" version is actually winning and when it's just noise. Stops losing experiments before they cost a full quarter.

A live A/B test simulator. Design an experiment, watch two Beta posteriors update frame-by-frame as users flow in, and see exactly how often peeking early turns noise into a "winner." All math runs in your browser. Nothing phoned home, no preloaded answers.

Filed

Engine

Beta-Bernoulli · 2-prop z-test · Monte Carlo

Source

notebooks/ab_test_model.py ↗

§ I Why most A/B tests are run wrong

Most failed experiments aren't failed by the data, they're failed by the decision. Tests get peeked at every morning, winners called the first time p < 0.05 flickers on the dashboard, and null results quietly buried because "we didn't get enough traffic."

The page below runs an honest test against itself. Set a baseline rate, set the true lift (or zero, if you're curious), and let the users flow. Two Beta posteriors update in real time. A pair of decision panels, frequentist and Bayesian, tell you what each framework would do right now. And a separate panel further down proves, in a thousand synthetic tests, what peeking actually costs.

All simulations are reproducible from a seed. The companion Python script runs the same peeking Monte Carlo in Python + scipy; the reference number is shown alongside the live number for sanity.

§ II · M04.1 The simulator · run an experiment

Baseline rate (A)

10.0%

True variant rate (B)

12.0%

Split (A / B)

50 / 50

Significance α

0.05

Power (1 − β)

0.80

Required sample / arm

—

Speed

Preset

Fig. 04.1 · Beta posteriors · shaded = P(B > A)

Variant A · control

users

conversions

—

observed rate

Variant B · test

users

conversions

—

observed rate

Live metrics

—

observed lift

—

p-value · freq

—

P(B > A) · bayes

Fig. 04.1b · running decision metrics (P(B>A), observed lift)

§ III How it works

Fig. A

Beta-Bernoulli conjugacy

Start with a uniform Beta(1, 1) prior on each variant's conversion rate. Each user is a Bernoulli trial: converted (+1 to α) or didn't (+1 to β). The posterior stays Beta at every step, closed-form, no numerical integration. That's why the distributions update smoothly at sixty frames per second.

Fig. B

Sample size, solved

The required-sample readout in the designer panel uses the two-proportion z-test formula. It's what a traditional power calculator would hand you, given baseline, minimum detectable effect, α, and 1 − β. Change any slider; the number updates instantly. Run fewer trials than that and your "null result" is almost certainly under-powered noise.

Fig. C

P(B > A), honestly

For the Bayesian side, we draw 5,000 samples from each Beta posterior every frame and count the fraction where B's draw exceeds A's. Takes ~2ms. Gives a probability that doesn't require a p-hacked threshold to interpret, "the test has P(B > A) = 0.91" is a statement a product manager can actually act on.

§ IV · M04.2 The peeking-bias Monte Carlo

Set the true variant rate equal to the baseline (no real lift). Then peek at the test every few dozen users and stop the moment p < 0.05 flickers. Do that a thousand times. Count how often you declared a winner.

Nominal false-positive rate is 5%. What you'll actually get is closer to 30–40%, the inflation comes entirely from the decision process, not the data.

idle

§ V · M04.3 Frequentist vs. Bayesian, two verdicts, one test

Same data, two philosophies. They can disagree. That's the point.

The frequentist stops when p < α at the pre-registered sample size. The Bayesian stops when P(B > A) crosses a fixed threshold (default 95%). In small-effect tests they usually agree. In ambiguous regions they don't, which is where the page earns its keep.

Frequentist

rule · stop at required N, reject H₀ if p < α

p-value · —

CI 95% · —

awaiting data

Bayesian

rule · stop when P(B > A) > 0.95 or < 0.05

P(B > A) · —

expected loss · —

awaiting data

§ VI Receipts

§ VII Methodology & Colophon

Engine

Pure JavaScript. Each frame generates a batch of Bernoulli trials (rate scaled by the Speed dial), updates the two Beta posteriors, recomputes the running z-test p-value and a Monte Carlo estimate of P(B > A). Sixty frames per second on any device since the iPhone 8.

Validation script

notebooks/ab_test_model.py ↗ runs the same peeking Monte Carlo in Python + scipy, writes the reference numbers to methodology.json. The live page fetches them and prints both side by side.

Reading list

Evan Miller · How Not to Run an A/B Test ↗
Kohavi et al. · Online Controlled Experiments at Large Scale ↗
Stucchio · Bayesian A/B Testing at VWO ↗

Limitations

One metric, two variants, no multi-arm or multi-metric tests. No sequential-testing adjustments (SPRT, α-spending); they'd be the honest answer to peeking, but the point of the Monte Carlo panel is to show the problem, not paper over it. Fixed uniform prior; real programs use empirical priors fit to historical tests.

← Back to the portfolio View the script on GitHub ↗