5 Things You Should Know When Doing A/B Testing

A/B testing is fundamentally important to learning how something could be better. This could be your donation page, online ads, direct mail, and so on. I’m going to talk about 5 things that are really important to do it properly.

Question and hypothesis

The question and hypothesis are fundamentally important to the process and are often missed.

The question helps you develop the test. If you want to know whether changing an ask amount fundamentally changes how much you will receive, then you can develop a test specifically to answer that question.

The hypothesis gives you the framework to understand the results. The simple difference of, “let’s see which one does better,” and, “I believe that increasing it will cause donors to give more,” are technically different tests.

It is important to understand, too, that you will not be proving one is better. What you are actually doing is trying to disprove that any effect you see is not due to random happenstance. We call this the null hypothesis, and it is what you are actively trying to disprove (or reject).

Sample and Population

When you run a test, you are trying to see how a small group of people react, and use that as a basis for what you expect anyone who sees the piece will do.

What if you wanted to see how Comic Sans performs on your piece, and you send to one person and that person responds positively. Can you say that it will work for everyone? Maybe you found one of the people who really like Comic Sans.

For every person you add to the sample, your results have a better chance of reflecting what the general population believes, but the value of that extra person diminishes. Adding 1 person to a 2-person sample drastically changes things. Adding 1 person to a 1,000-person sample does not change much at all. This is why you will see surveys of populations only require a few hundred responses – eventually, you need to add a multitude more to see a major difference.

But how many will YOU need? Well, that’s going to depend on what you are looking for

Effect Size

When you are determining what you want to test, you need to look for effect size. That is, how much change are you looking for?

There are two ways to look at effect size. One is absolute. Let’s say you have a mailing that you expect to have 10% response rate. If you have a 2% effect size, in absolute terms that means that you would need at least 12% response to reject the null hypothesis. If you are looking for a 2% relative effect size change, you would be looking for 10.2% response.

The smaller the effect size you want to be able to see, the more you need in your sample.

Confidence/Significance

Significance is really about how confident you are that the effect that you see is due to the test being measured rather than simple randomness. That is, how confidently can you reject the null hypothesis based on the effect you saw.

When you choose a sample, there is a chance that you randomly select a group of people who do not represent the main population that you are trying to infer information about.

Depending on how small an effect you are trying to spot, and depending on how confident you want to be in the results being not due to randomness, your sample size will change.

While significance and confidence are technically different terms with precise definitions, what we are really trying to do is figure out how often we could be wrong even with the results we have. If I say my confidence level is 95%, basically we expect that if we repeated the test 20 times, 19 would give us the same results.

Multivariate Testing

Sometimes you have a large enough sample that you might want to try multiple tests at once. There really are two types of multivariate tests.

A/B/n tests are when you are testing multiple things directly against each other. For example, if I wanted to test 3 different designs for an envelope, I am doing an A/B/C test. In this case, we are directly comparing each result against the others – so we are testing A vs B, B vs C, and C vs A. This means that our sample size needs to change to account for the extra tests.

There are also factorial tests. This would be, for example, testing two different envelopes and two different letters. While the sample size doesn’t change drastically, it’s important to know that you should only be drawing conclusions about the items you are directly testing against each other. So, you can show that Envelope A does better than Envelope B, or Letter Y vs Letter Z. If you want to look at it as each combination, you would need to look at it more like an A/B/n test, requiring a larger change to sample size.

AUTHOR: Kirk Schmidt, data metrics mad scientist.

Ben Johnson