A/B Test Guide

A/B Test Guide · Chapter 2 of 4

Your baseline matters

In chapter 1, we learned that we can't trust small samples. For our A/B tests to be reliable, we need larger samples. How big a sample depends on three things.

Let's look at the first of them: your baseline conversion rate. In other words, how many visitors convert right now, before you make any changes?


What's a baseline?

Your baseline conversion rate is how many visitors sign up right now, before any change.

If 100 people hit your signup page and 4 sign up, your baseline is 4%. That's it.

It sounds like background information. It isn't. Your baseline tells you how much signal each visitor brings, and that shapes how much data you need more than almost anything else.

In the case study, version A got 20 signups from 100 visitors, so the baseline is 20% — your best estimate of the page's true rate. That's on the high side. Many real signup flows might sit somewhere between 1% and 5%, and the difference matters a lot.

Low baseline hides the lift

Let's imagine that in our A/B test, version B lifts signups by 10%. Whether we see that lift clearly depends on the baseline. At 5% baseline, a 10% lift is a 0.1 percentage point increase, from 5% to 5.1%. However, if your baseline is 50%, the same 10% lift is a 0.5 percentage point increase, from 50% to 50.5%.

The higher the baseline, the bigger the absolute difference for the same relative lift, and the easier it is to detect that difference with a given sample size.

Now it's your turn again. Move the slider to change the baseline (conversion rate of version A). Notice the blue dotted line - it is a 10% improvement over the baseline.

When is it clearly different from your average? And when does it hide in your data?

Interactive
5%
02550relative likelihood0%5%10%20%30%40%50%60%Conversion rate per 100-visitor sampleaverage: 5%+10% lift: 5.5%
Theoretical shape (not a random draw).

At 5% baseline, the lift line moves by ~0.5 points, but single samples still bounce around (1%–9%). You can tell broad differences apart, not small ones.

At a low baseline, the lift line hides in a small signal. If the blue line is the reading you got from your B group, it would be hard to tell if it's really an improvement or just a random variance within your data.

At a high baseline, the lift line is further away from the average, and it would be easier to spot as a real improvement.

The signal-to-noise problem

When your baseline is low, most visitors leave without converting. They give you no signal. You're waiting for a rare event, and rare events need a lot of observations before the count stabilises enough to detect a real difference.

At 2% baseline, 100 visitors gives you roughly 2 conversions to work with. At 20%, the same 100 visitors gives you 20. Ten times the signal, same traffic.

The lower your baseline, the more data you need to hear the signal above the noise.

That gap compounds fast. A landing page at 2% baseline typically needs 10 to 20 times more visitors than a checkout flow at 30% to run the same test at the same confidence. Same relative improvement, very different data requirement, because the signal is that much rarer.

That's why there's no universal answer to “how many visitors do I need?” You start with your own baseline.

We've been using 100 visitors throughout because the numbers are easy to do percentage math in our head with. Real A/B tests rarely run at that scale. By the time you finish this guide, you'll see that depending on your baseline, a well-designed experiment might need thousands of visitors per group, not hundreds. The examples are simple on purpose. The actual numbers rarely are.

A well-designed experiment might need thousands of visitors per group, not hundreds.

A more standard visualization

Smooth curves

Before we wrap up this chapter, let's visualize the two groups in a simplified way. This will help us in the remaining chapters. In chapter 1, you stacked counts of marbles. If you did it often enough, it would resemble a bell shape. Since you counted finite marbles, we showed it as a discrete distribution. Meaning, there were gaps between the dots, and the dots represented actual counts of marbles.

Going forward, let' smoothen the curve out by imagining to draw the shape of the distribution instead of individual marbles. The shape is the same, it just doesn't have the individual dots anymore.

Illustration

Static illustration (not interactive).

Two bell shapes

Another change we will make is to also draw a bell shape for the B group. In the last interactive widget, you already saw the bell shape for the A group. For the B group, we only showed its as a blue dotted line. Actually, in an A/B test, both groups have their own bell shape. If we take repeated samples from both groups, we can draw a bell for each of them. Most samples will sit somewhere around the true rate, with some higher and some lower. Going forward, we will draw the bell for the B group as well.

The illustration below shows the same data as before, but with a bell for both groups. The more they overlap, the harder it is to tell them apart. Is B really better than A? The more they pull apart, the easier it is to spot a difference. The distance between the two means is the lift (or the “expected minimum effect”).

Interactive
20%
likelihood0%10%20%30%40%50%60%Conversion rate per 100-visitor sampleA: 20%B: 22.0%
Theoretical shapes. Gap between means: 2.0 pts.

The bells are starting to pull apart — about 2.0 points between the means. There's still a band in the middle where a sample from A and a sample from B look the same.

Use the slider to move the baseline. At 2%, the absolute gap between the two means shrinks to a tenth of what it was at 20%, and the bells almost sit on top of each other. Push to 20% and they pull apart. Same relative lift, very different picture — the same thread from earlier in this chapter, now with both distributions in view.

What we learned

  • Your baseline is your current conversion rate — and it's the biggest driver of how much data you need.
  • Baseline controls the shape. The lower the baseline, the more the spread swamps small differences.
  • Run an A/B test and both groups have their own bell. Whether the test calls a winner depends on how far apart those bells sit.

Next: the bells are set. Now decide what gap is worth chasing — and how small a lift you actually need to detect.

Next: Effect size →