A/B Test Guide

A/B testing needs a big enough sample. But how big? Most A/B test calculators hit you with a wall of jargon. In this guide, you'll learn how to think about sample size intuitively, so you can make informed decisions about your tests.

This guide has interactive widgets you can play with to build your intuition. By the end, you'll master the three levers that matter — baseline, expected lift, and confidence — so you can find exactly how many visitors you need to run a test you can trust.

Where you see this icon: interactive, there's a widget you can play with.

Let's start with the problem: small samples lie to you.

Case study

Let's imagine you have a website with a sign-up button and you want to get more sign-ups. So you're testing a copy change on your sign-up button and are running an A/B test. Version A is your original copy (“Start your free trial”), you showed this one to 100 visitors. Version B is the new version (“Get started for free”), which you showed to another 100 visitors.

A couple days later, here is what you see in your dashboard:

Illustration

Case study

Version A

“Start your free trial”

Version B

“Get started for free”

Sign-ups

Group A: 20
Group B: 25

Static illustration (not interactive).

B seems to be better — but is it really?

Obviously, group B seems to be better. After all, it has more sign-ups. So should you make the change and ship it?

This is a common instinct. But it's a trap.

In your small sample size (100 users each), the difference you see might be due to random chance.

At 100 visitors per group, even two identical versions would routinely show a gap like this by pure chance.

Let me explain.

Imagine every potential visitor is a marble in a jar. Green means they signed up; gray means they didn't. In the jar below, 1 in every 5 marbles is green. So the jar's true rate of green marbles is 20%. We can't see inside the jar, but we can draw samples from it. We draw a handful of marbles and count how many are green. That gives us an estimate of the true rate of green marbles in the jar. If we draw enough samples, we can get a good picture of what the jar looks like on the inside. Now, in reality we wouldn't know what the exact true rate is, but let's pretend we do.

Illustration

Static illustration (not interactive).

Ready for the first interactive demonstration?

We will draw samples from the jar. Every draw takes 10 marbles out of the jar. We know the true rate in the jar is 20%. So on average, 2 in every 10 marbles will be green. But what does a single sample look like? And what happens to the average if you draw again and again?

Click “Draw a sample” to see for yourself.

Interactive

–latest sample

–your average0 samples

2.0true average

What you're really looking at: sampling error

What you just saw is sampling error: the natural spread in outcomes you get from a small random sample, even when nothing about the jar changed.

The jar's truth didn't move. Your samples did.

Some samples will have more green marbles than others.

On average, you'd expect 2 green marbles in each sample of 10. But sometimes you get 0, sometimes 1, and sometimes 3 or more. The more samples you draw, the closer your average will be to the true rate.

In our A/B test, the same thing happened. We had 100 visitors per group, not 10 (so 10 times more). Still, getting 20 on one “draw” and 25 on another is pretty likely. The difference could easily be produced by sampling error alone. Even if there was no real difference in how visitors respond to the two button versions.

Small samples are wobbly, large samples are more stable. If you take samples of 10 people and measure their size, a sample that was drawn from a basketball team might show that 9/10 people are extremely tall. Another sample of 10 people drawn from a random street might show that 0/10 people are extremely tall. The variability will be huge. The real answer - the true rate in the population - will be somewhere in between. The bigger your sample, the more likely your results will be closer to the true rate.

Stack the draws: a shape appears

As we saw, single draws bounce around. That's sampling error. But draw again. And again. Keep going and something happens: the chaos settles into a pattern.

Below is the same marble jar with a true rate of 20% green marbles. This time, instead of seeing each marble in a row, the count of green marbles gets stacked onto the chart. Keep adding draws and a shape fills in.

What do you think the shape will be? (Draw 10 or 100 samples to see it appear faster)

Interactive

–latest sample

–your average0 samples

2.0true average

true rate: 20% → 2 per 10

Draw a sample to begin.

That shape has a name: a sampling distribution. The middle is where most outcomes land; the edges are rare. At 10 marbles per draw the histogram is jagged, but the tendency is already there — results cluster around the jar's and thin out toward the extremes.

We can use this knowledge to make better decisions about our A/B tests. Some results will be rather rare, so that we can be confident they're not due to chance.

The rest of this guide will show you how to apply these principles to real-world A/B testing scenarios. Ready to learn more? →