The Monte Carlo simulation is arguably one of the most important algorithms of the last century. Coincidentally, one of the greatest mathematicians ever lived was involved in developing the method: John von Neumann.
20 years ago, one needed a Ph.D. in statistics in order to run a Monte Carlo simulation properly. I remember it well when we worked with quants on the trading floor and valued crazy financial products for which there was no closed formula.
However, today, some basic coding skills are sufficient to run a Monte Carlo simulation. And, we can do crazy useful things such as:
- Sample size determination (even for highly skewed data).
- Create a larger sample size via Bootstrapping.
- Hypothesis tests via Monte Carlo simulation.
Basic Monte Carlo Simulation in R
A beautiful aspect of probability is that it is often possible to study problems via simulation. The basic function and its arguments are:
sample ( n, k, replace = TRUE, prob = NULL )
n = the numbers to be simulated.
k = the number of simulations
replace = TRUE if you replace the selected number, otherwise FALSE.
prob = probability for each n, otherwise NULL.
"MONTE CARLO SIMULATION IS A TECHNIQUE USED TO APPROXIMATE THE PROBABILITY OF AN EVENT BY RUNNING THE SAME SIMULATION MULTIPLE TIMES AND AVERAGING THE RESULTS". Prof. John V. Guttag
To some, the function with its arguments might look simple, However, Monte Carlo simulation can get quickly insanely complicated.
The following is a simple example: we flip a fair coin 10 times:
We can see that the expected value is 0.70. If we increase the number of simulations (k) to 1,000, we would get closer to the expected value of 0.50. In fact, with 1,000 simulations, we get 0.56.
In my doctoral thesis, I used Monte Carlo simulations for sample size determination and permutation-based hypothesis tests. Learning Monte Carlo simulations can be tricky; most books are full of math combined with little or no code. If you're an applied Data Scientist, I would ignore books without code. Why? You might lose too much valuable time with math. One of my all-time favorite Data Science books by Rafael A. Irizarry "Introduction to Data Science: Data Analysis and Prediction Algorithms with R" covers Monte Carlo simulations in an extremely accessible way using R. Don't be deterred by the few reviews on Amazon, the online version to the book has over 500,000 students (Harvard's Data Science in R).
Probability with Monte Carlo Simulations
Essentially, Monte Carlo simulations allow us to calculate distributions where we don't have a closed formula. However, once you understand Monte Carlo simulations, you probably going us it, even though a close formula exists.
For example, how often can we expect a sum of 18 with 6 dice?
Assume you are playing a game in which you with probability 2/3, you with probability 1/12, and you lose with probability 1/4. How often are you going to win in 10,000 games?
If you enjoy Monte Carlo simulations in R, I highly recommend the following books:
- Probability: With Applications and R
- Introduction to Data Science: Data Analysis and Prediction Algorithms with R
- Statistical Inference via Data Science
The bootstrap is a really cool statistical method that allows us to approximate the population with just a sample. The inventor is one of the most influential statisticians ever: Bradley Efron.
The bootstrap is the building block for "modern hypothesis testing." With the bootstrap, rather than using a static dataset, we use a simulation. I show a simple example, however, this can get extremely technical (e.g., A/B Testing in R). The bootstrap does belong to the Monte Carlo family. Important to note: the bootstrap does not create new samples. We simply take samples from the left WITH REPLACEMENT.
Let's say you got the following conversion rates today: 0.02%, 0.03%, 0.04%, 0.05%, 0.02%, 0.04%, 0.04%, 0.01%, 0.12%, and 0.15%. At a 95% confidence interval, you want to know the conversion range you can expect. Based on the Central Limit Theorem (CLT), your sample size (10 conversions) is most likely too small. Additionally, we can't calculate a meaningful confidence interval from the small sample (the mean is 5.2% and the standard deviation is 4.5% which would result in a negative confidence interval!).
Below is a very simple Monte Carlo simulation that takes randomly 1,000 samples (N) from the sample with replacement. At a 95% confidence interval (1.96 z-score) we get a meaningful confidence interval of 3.7% - 6.5%. In other words, at a 95% confidence interval, we can expect the conversion rate to be within 3.7% - 6.5%
This is an area where R shines over Python. In my opinion, R dominates statistical methods because it offers the libraries AND knowledge. In other words, for bootstrap, there are great libraries with contributors who are willing to share their knowledge. For example, the difficulty in bootstrap lies less in its code, but more in its application. Excellent sources are "Modern Data Science with R" and "Statistical Inference via Data Science" which offers a book and a brilliant library called infer.
Accuracy of the Central Limit Theorem (CLT)
The Central Limit Theorem (CLT to its friends) tells us that when the sample size is "large," the probability distribution of the sum of the independent draws are approximately normal. The usual rule of thumb, found in most textbooks, is that the CLT is reasonably accurate if n >30 unless the data is quite skewed. Because sampling is so important in statistics, the CLT is considered one of the most important mathematical insights in history.
- The samples need to be random + independent. This can be extremely hard (see vote prediction).
- If the population is normally distributed, the CLT can hold for 10 - 100 samples.
- If the population is heavily skewed, this could result in >1000s of samples.
- How close to "normal" is normal? The Shapiro-Wilk test can help.
I love statistics. However, I feel that most statistics books are focused on explaining statistics better, without any concern about progress. For example, explaining the CLT with just text is futile to me. We need to formalize statistics with either math or code. I strongly support the latter.
A permutation is a mathematical term for "shuffling": taking a series of values and reordering them randomly. Below are all the ways we can choose two numbers from a list consisting of 1,2, and 3:
Order matters with permutations: 3,1 is different from 1,3. In fact, permutations are another form of resampling, like bootstrap. While the bootstrap method involves resampling with replacement, permutation methods involve resampling without replacement (e.g., the hypothesis test with permutation).
A/B Testing of LinkedIn Ads
Below you'll find the R code for a quick and dirty A/B testing.
Attention: A/B testing gets advanced quickly, if you really want to dig deep, the best course I know is DataCamp's A/B testing in R (link).
Question #1: Are my results statistically significant?
Question #2: Is my sample size big enough?
To run this test on your own, you need R and the library(pwr) installed.
Ad "A" = 600 clicks with 32 conversions
Ad "B" = 400 clicks with 44 conversions
Consequently, we add the clicks and conversions in R:
As you can see, the results are statistically significant at a p-value of 0.001418. Generally, a p-value below 0.05 is desired.
Assumption: we assume the distribution is identical and normally distributed (ad A vs. ad B).
Next, we want to know how many samples we need:
If you're unsure, just enter the total clicks in n1 = 1000 and leave the rest as is.
For ad "B", we have 400 samples (clicks). Thus, we can reasonably conclude that we have enough samples.
As we have enough samples, the hypothesis test can be regarded as valid. In other words, ad "A" is statistically significantly better than ad "B.".
I.e., you can dump ad "B", statistically speaking :-)
Hypothesis Testing: Are we promoting more men than women?
Let's say human resources approaches us. Then want to know whether our company favors men in promotions. Over the last 12 months, they have promoted 117 women and 203 men. They want to know whether we favor men over women. Is the result statistically significant? To answer this question, we need to run a hypothesis test.
The classical approach would be to calculate the p-value based on a t-test. In "modern statistics", we simulate the confidence interval by using a Monte Carlo simulation technique called "permutation-based hypothesis test." For this test, I'm using the amazing R package infer. Anyone interested in learning statistical inference with R, I can highly recommend the book "Statistical Inference via Data Science." At first, we set a p-value of 0.05.
With a permutation-based hypothesis test, we sort of mixed the two outcomes (promotions men and women) as there was no difference. Then, based on the histogram, we check how rare the difference in promotions between men and women is. In this case, we can see from the histogram that a difference of 3% in promotions is not statistically significant. In other words, statistically we cannot find proof that we are favoring men over women in promotions.