Understanding A/B Test Results

When you have drawn conclusions from an A/B test, you can deploy some or all of the experiences for its test groups to the site. Typically, you want to deploy the experiences of the winning test group, but you might sometimes want to deploy experiences for multiple groups. Deploy these experiences manually by creating a new campaign or adding to an existing one. Before you can draw conclusions, however, you need to determine how long tests should run, when they should end, and how you know that the tests have run for a sufficient amount of time. Key concepts to understand are confidence level and statistical significance.

Note: Salesforce B2C Commerce doesn't update A/B test statistics for orders that originally failed and then were manually opened later within Business Manager.

Statistical Significance

The point at which a test reaches statistical significance depends on a variety of parameters. Typically the more similar your test subjects, the more metrics you test. The lower your traffic, the longer it takes to reach statistical significance. But what is statistical significance?

In statistics, a result is considered statistically significant if it's unlikely to have occurred by chance. For example, the result of an A/B test that included 500 customers showed that the average order value for September was higher by 30% with an autumn-colored banner than with a dark-colored banner. Though the results might have been statistically significant, is the difference important? Tests of significance should always be accompanied by effect-size statistics, which approximate the size and thus the practical importance of the difference. The amount of evidence required to accept that an event is unlikely to have arisen by chance is known as the confidence level.

You can compare the metrics and values of the control and test groups on the A/B Testing page. B2C Commerce calculates the confidence level, which indicates the likelihood that these differences are because of your change in site experience and not random chance. When the confidence level reaches 90%, it's deemed a statistically significant result.

Note: The statistical significance testing procedure assumes you have set the sample size prior to starting the test. Because of this, your initial results can vary until the sample size is reached. Keep this in mind when you receive any email informing that your CL has reached 90%.

In the banner-color example, if the B2C Commerce-computed confidence level reaches 90%, the test result can be considered statistically significant. The merchandising team should use an autumn-colored banner in September, with a high degree of confidence that it derives better results than a dark-colored banner.

Test Length

How long a test should run depends on your average number of daily visitors, the percent of visitors included in a test, and other external factors. In general, you should run it until the test reaches statistical significance, or when it's clear that it won't. Understand, though, that a B2C Commerce A/B test can run for up to 90 days.

An A/B test automatically ends on its end date unless it's disabled by a user. An email is sent to recipients configured on the A/B test if it reaches a 95% confidence level for its key metric. The test continues to run after the email is sent, and it's possible for the confidence level to dip below 95% after the email is sent. The email only goes out once to avoid multiple emails going out if the confidence level changes from 95% to 94% and then back over 95%. You can deploy segment experiences even if a 95% confidence level isn't reached. For example, you can deploy for a confidence level of 90%, 85%, and so on.

Related Links

A/B Testing

A/B Test Metrics