Is A/B testing with Big Data still relevant?
Updated: Jul 17
It's tempting to conclude that with Big Data, we don't need A/B testing anymore. Why should we? With big data, we should visually see which ad is better. And, the big data should give us statistical confidence.
Let's take the following two ads, A and B:
We have over 40,000 impressions in total. That's far from big data, but it shows the concept. Just based on the conversion rate, ad A is better. However, because the conversion rate is "small" compared to the total number of clicks, the sample size needed to really determine the conversion rate, is bigger than we might expect.
The A/B scenario is based on the fantastic book “Practical Statistics for Data Science” (second edition), while the code implementation is mine. While the authors write their own permutation function and a standard library, I’m using the R package “infer.” Traditional statistic vs. modern statistic We can generate the null distribution one of two ways: theory-based methods and simulation where we permute the response and explanatory variables. The “infer” package follows a growing trend (modern statistic) of emphasizing data and simulation instead of classical probability theory and complex statistical tests. The “modern statistic” approach is more the approach of a Data Scientist. What’s the advantage of the modern statistic approach? Random variation can play in fooling the human brain and resampling processes (bootstrapping or permutation) can reduce the risk of being fooled by randomness. The thinking is mostly based on the famous blog post by Prof. Allen Downey “There’s only one test!” (i.e., with hypothesis tests, we need to resample via permutation tests …and for calculating the confidence interval, we need to bootstrap).
Let's get started. First, we need to generate replicates with the R package infer:
We get 4,639,000. In general, we should have reps of about 1,000, but 100 suffice in this case.
Next, we calculate the summary statistic and get a simulated difference between A and B:
We can visualize it with the following R code:
The visualization of the simulation-based null distribution:
One of the great things about the infer package is that we can jump seamlessly between conducting a hypothesis test and constructing confidence intervals:
Based on the confidence intervals (bootstrapped with replacement), we can see already that there's hardly a difference between the two ads A and B. Why? Because the values are centered around zero.
A proportion test in R gives us a p-value of 0.34 which confirms our expectation: we cannot reject the null hypothesis. In other words, we can't find a statistical difference between the ads A and B.
Last but not least, a power test in R solving for n, tells us that we need more ad impressions to make a statistically valid sample (we need around 661 total impressions to make a statistical claim).
Power and Sample Size As we have a relatively small difference in the conversion rates of ad A vers ad B, the sample size required is larger. This makes intuitive sense as for a small difference, we need a bigger sample size to show a significant difference (other samples).
Our A/B test with resampling used a permutation test for hypothesis test and a bootstrapping for calculating the confidence intervals. Based on those resampling tests, we cannot make a scientific claim which ad is better (yet). We need more data (i.e., more clicks).
This A/B test can improve the ROI of a campaign dramatically. We can confidentially drop a losing ad and stick with a winning ad. The A/B test, therefore, should reduce wasted money and increase profitable ad money.
I used to run A/B testing the "old" way as well. And, from what I've seen, this modern approach is being used by many high-tech companies nowadays (summer, 2020). Franco