A/B Test Guide

A/B Test Guide · Chapter 4 of 4

How sure do you need to be?

We now understand baseline and minimum detectable effect (lift). In this final chapter, we discuss how to determine when a result is definitive enough to act on. After this final chapter youll be ready to use the sample size calculator with full confidence.


Middle is normal, tails are rare

Lets think about why the bell shape is so common. It arises whenever you average a pile of yes/no outcomes — like the signup case study we started with, where each visitor either signed up or didn't. There is some variability. So when you run the experiment again and again, you get a spread of results. Most of the time, the average lands somewhere in the middle. Rarely, just by chance, you get a more extreme result (the tails). The spread of the bell — the — tells you how wide the spread is. Hover or click on the regions in the chart. How many experiments would be expected to land in each area?

Interactive

Hover or click a region to see how much falls there

-2 SD-1 SDmean+1 SD+2 SD

The key takeaway: samples deep in the tails are rare. If A and B were identical and you ran the experiment a hundred times, you would get a spread that is centered around A's mean. Only a handful would land out in the right tail. That rarity is what lets you call a winner.

Drawing the line: how far out is too rare to happen by chance?

Calling a winner means picking a point out past A's usual range and saying: anything that lands further out, and I'll believe B really is better. Anything short of it, and I'll stay with A.

Move the slider in the widget. At which point is a conversion rate you are seeing from B so rare on the green curve (A, your control) that you would say: that can't be a coincidence?

Interactive
95%
thresholdfalse positivesfalse negatives

Pull the threshold left: more of B's real improvements get called as wins, but A's noise slips through more often. Push it right: A's noise gets filtered out, but real improvements in B start getting missed.

Without getting into the full statistics, the intuition is: the is a cutoff on A's usual noise. If B's result lands past it, that kind of result would be rare if A and B were really identical (our ). That's why you'd call it a win at this confidence level.

The threshold has a cost either way you move it.

If you are less confident (you pull it towards A), you catch more real wins, because your sensitivity is higher. You are basically calling a winner earlier, even if the result is not so rare. The catch: more noise slips through. You are more likely to call a winner when there isn't one, simply because you are seeing a rare result that A produced by chance due to sampling error. We call that a .

On the other hand, if you feel more confident and move the threshold toward B, you filter out more noise — you are less likely to call a winner when there isn't one. But the catch is that you also miss more real wins, because the threshold is now further from A and its natural variability is less likely to wander past it. Increasing your confidence will also need a bigger sample size as you'll see in the calculator. You are basically calling a winner only when the gap is larger, so you miss some of the smaller but still real lifts.

If A and B were the same, I accept being wrong 5% of the time

A 95% confidence level is the standard default. You're saying: I'll accept being fooled about 1 time in 20. The threshold is drawn so that only 5% of A's distribution sits above it. If A and B were really identical and you ran this experiment over and over, about 5% of the time A's result would land above the threshold on its own and look like a winner that wasn't there.

The threshold sits in A's right tail only. We're hunting a lift: B is either enough better than A to clear the line, or it isn't. A suspiciously low result from B doesn't count as a win, so the left tail stays out of the decision.

A winner is a gap big enough that you would only see very rarely if A and B were really identical.

What confidence actually costs

Stricter confidence pushes the threshold further from A's mean. Fewer false wins sneak through — which is the point. The catch: the threshold now sits further from B's mean too, so a real lift has to be that much larger before it clears.

At a fixed baseline and a fixed lift, more separation comes from one place: more visitors. More data tightens both bells — less spread — so a smaller real gap can still poke clear of a strict threshold.

That's the last lever. Baseline is given by your product. Lift is the smallest improvement worth chasing. Confidence is how strict you want to be about calling it. Put all three in and sample size is what falls out.

Guide complete

You now understand the three levers

  • Baseline — your starting conversion rate, given by your product.
  • Minimum detectable effect — the smallest lift worth catching.
  • Confidence — how strict you want to be about calling a winner.

Put all three in and sample size is what falls out. The calculator does the math — you now know exactly what you're telling it and why.

Open the calculator →