AdWords ad testing: Simpson’s paradox and aggregating stats

Typically with A/B ad testing it is necessary to aggregate statistics across ad groups in order to come up with statistically significant conclusions. Unfortunately, when doing this, if your ads are not being displayed evenly, you can run into Simpson’s Paradox.

According to Wikipedia, Simpson’s Paradox is “a paradox in which a trend that appears in different groups of data disappears when these groups are combined, and the reverse trend appears for the aggregate data.” Simply put, when you combine data for ad groups, you may draw the opposite conclusion than you should, picking poorer performing ads over better performing ads.

As an example, say you are testing ads for greater clickthrough rates. You created two variations, with different description 1 lines, and have applied them to two different ad groups (four ads total). At the end of the test you aggregate data for the two variations to find a winner:

Aggregate Data

Impressions Clicks CTR
Ad1 10,600 125 1.18%
Ad2 7,500 125 1.67%
From this data, you’d conclude that Ad2 is the winning ad because it has the best CTR. But say you decide to look at individual statistics by ad group:

Ad Group 1 data

Impressions Clicks CTR
Ad1 600 25 4.20%
Ad2 2,500 100 4.00%
Ad Group 2 data

Impressions Clicks CTR
Ad1 10,000 100 1.00%
Ad2 5,000 25 0.50%
Now you see that Ad1 has the best CTR in both ad groups. Because Ad Group 2 tends to have lower CTRs and Ad1 gets displayed there much more, it has skewed your results. Ad2 is not the best ad even though it appears to be when you look at aggregate stats.

This is made-up data, so it’s important to consider whether it’s possible to actually see numbers like these inside of AdWords. Based on experience, you’d be most likely to see numbers like these with optimized ad serving turned on (“optimize for clicks” or “optimize for conversions”). Optimized ad serving distributes ads more frequently when they have greater CTRs; this is the case for Ad Group 2. It is also possible to see ads with greater CTRs distributed less by ad group if the data is not significant and occasionally if the better performing ad has such a different quality score that it gets distributed much more frequently for generalized queries. This would be the case for Ad Group 1. When optimized ad serving is turned on, aggregate statistics can be misleading in reporting.

If you’d like to see which ad is winning overall, you can still do so by looking at which ads are getting distributed more frequently. It may be possible for poorer performing ads to have greater overall distribution if the ad groups in the test are not similar. Another method is to look at ad groups individually and label which ad wins in the most ad groups; however, the reason why stats are aggregated in the first place is because individual ad groups are not providing statistically significant data. The best solution is to look at overall stats and ad group stats in pivot tables to check whether there are any anomalies or if there really does appear to be a winner across the board. It’s common to see one variation winning in some of the ad groups but not others; it’s rarer to see very dominant winners unless you haven’t followed best practices in the initial drafting of your ads. Seeing mixed results in ad groups can mean that your ads all perform very similarly. It can also mean that you aren’t looking at significant data in a lot of your ad groups.

Reducing the Risks of Simpson’s Paradox

If your ads are distributed evenly across ad groups, and your ad groups are closely themed, there’s no risk of falling into the Simpson’s Paradox trap. Google has an ad serving option for “rotating ads indefinitely.” This option can be useful if you’re interested in aggregating statistics to draw broad conclusions about what types of ads work for your account. There are a few drawbacks to doing this:

Optimized ad serving automatically adjusts serving to maximize performance at the ad group level. You lose this automation when you turn on even rotation.
Certain ads may perform better in some ad groups but not in others while another ad does better overall. When you apply ad changes universally across ad groups, you can lose optimization at the ad group-level that automatically takes place with optimized serving.
Rotating ads don’t actually rotate ads evenly – just approximately. If you have two ads that perform very differently, they are going to have different quality scores. The ad with the higher quality score will get distributed more frequently because it will be eligible for more auctions.
Speaking to this last point, here is real data from a campaign with ads set to rotate evenly:

Impressions Clicks CTR
Ad1 9,284 307 3.3%
Ad2 3,235 58 1.8%
These ads performed so differently in terms of CTR that one got distributed a lot more. The bids were set too low for the second ad to be eligible for the same number of auctions.

While even rotation may seem like the way to go, you can still run into distribution problems, lose more granular optimization, and lose out on automatic optimization.

Using Conversion Metrics for Optimization

Most of this post talks about using CTR as the metric for optimizing ads. It’s possible to use alternative metrics like conversion per impression or profit per impression. Most of these concepts still apply when using these other methods for optimizing ads – perhaps more so because you’ll be more likely to group data when analyzing less frequently occurring events like conversions and sales. Using these other metrics does not solve Simpson’s Paradox (and in fact, they have many drawbacks, but that is outside the scope of this post).

When looking up statistics across very different entities, be careful that you are reporting the results correctly. Besides wasting a lot of testing time, you can even make your account worse off!