1

Hi there, I'm Franco.


The aim of this video tutorial is, that you can use Benford's Law with Excel, as fast as possible.

Years ago, it took me several weeks, to learn how to apply Benford's Law with Excel. With this video tutorial, I want you to learn it in less than an hour.


With the goal of application in mind, we keep the theory to a minimum. However, where necessary, I’ll show you a few technical slides.


1b
For example, after this course, I want you to take a dataset from your company and test it, whether it follows Benford's Law.

​Who knows, you might save your company a multi-million dollar fraud.

2
Below the video you're watching, you'll find the download link for the Excel file. The Excel file contains all the formulas and datasets.

However, I suggest you first watch the video and then download the Excel file. Without the knowledge from the video, it's probably hard to use the Excel file on your own.

3
Every year, companies lose billions to fake invoices. For example, in 2019, CNBC reported that Google and Facebook paid 123 million dollars to a single company. All invoices were fake.

In fact, Facebook alone paid around 100 million dollars in fake invoices. In our case study, we pretend to be working for Facebook and trying to detect the 100-million-dollar fraud.

4
In the Excel sheet "invoices," you'll find a dataset of invoices from companies A, B, and C. Similarly, to what happened to Facebook, one of those vendors sent us fake invoices of around 100 million dollars.

Our ultimate goal is to find out, which company is cheating us? Is it A, B, or C?

5
To be clear, those invoices were simulated during my Ph.D. thesis. In other words, we don't have the exact invoices Facebook received. If you want to (wanna) learn more about the research, please visit BenfordAnalytics dot com slash research.

6
Now it's time to speak about Benford's Law. Benford's Law is attributed to Frank Benford, an American engineer. Benford’s Law is often used in fraud analytics.

Benford's Law essentially states that the sequence of numbers from real-life sources is likely to be distributed in a specific way.

The formula on the top can help us detecting data that deviates from the expected distribution. For example, the digits 49 highlighted in red differ the most from Benford's Law. Thus, it indicates a potential fraud.

7
Here, we have the formula to calculate the expected Benford distribution.

For example, in a dataset, we expect numbers to start with one, 30.1% of the time. This calculation is based on the logarithm with a base 10, and a formula, with the corresponding digit, in this case, 1, in the denominator. In other words, we always use the same formula but divide by the corresponding digits such as 2, 3, 4, and so on.

Essentially, the log transformation gives us the expected probability for each digit.

8
Now I would like to highlight one critical part; there are different variations of Benford's Law. If you Google "Benford's Law," you're most likely gonna read about the first digit version. This version covers the digits 1 to 9.

However, in the Excel file, we're going to cover a more advanced version. The reason for this is not to make it more complicated, but because of the following:

9
The Association of Certified Fraud Examiners, or ACFE in short, is arguably the global standard of fraud investigation.

And, in a recent paper by the same association, they stated that they favor the 10 to 99 version, or the advanced version, over the simpler 1 to 9 version.

It's a bit more work for us, but there are three reasons why I believe it's beneficial for us. One, once you've understood the first version, the more advanced version is pretty straightforward.

Two, many of you might be certified fraud examiners, so it's natural to follow the ACFE guidelines. And last but not least, the ACFE is 100% right. The simpler version is really useless, at least for visualization purposes.

 

10
So, here's the calculation for the advanced Benford variation. As promised, this should be fairly simple for you. The advanced version is merely an extension of the first calculation.

Again, the log transformation gives us the expected probability for each digit. However, we're now analyzing the digits 10 to 99. That's all.

For example, invoices starting with the digits "10" are expected to occur 4.14% of the time. As you can see, the more advanced version is more granular. Because the advanced version is more granular, we can drill down better into potential anomalies.

11
Let's see the advanced version in Excel and start visually.

On the x-axis, we have the numbers 10 – 99. On the y-axes, we have the expected distribution in percentages.

In data science, this is one of the most important visualization types. It's called a histogram. A histogram has a categorical data type on the x-axis and a measure of the y-axis. Why is it so important? One reason is that it can easily visualize anomalies in our data.

In our case, on the x-axis, we have the categorical data of the first digits 10,11, 12, and up to 99. This kind of data, often referred to as bins, or buckets, are categorical because they can only be 10 or 11, but nothing in between.

On the y-axis, we see the expected Benford distribution. This distribution is also indicated by the red line. The blue bars refer to the observed distribution.

For example, invoices starting with the digits 38 are expected to appear 1.13% of the time. As you can see, the observed distribution is almost perfect. In other words, the expected distribution, indicated by the red line is pretty close to the observed distribution, indicated by the blue bars.

12
Of course, the visualization we have just seen is based on our formula, in Excel.

Staying with invoices starting with the digits "38", here we count how often we observe invoices starting with the first two digits 38.

Our count is 112. Then, we divided the count by the total number of invoices. In other words, we divide 112 by 10,000 and get 1.12%. And finally, we compare our observed distribution with Benford's expected distribution. In this case, Benford's Law expects the digits "38" to appear 1.13%.

Unfortunately, in the real world, we hardly ever get a perfect Benford distribution. Therefore, we need a formula that helps us in determining how much deviation is still acceptable.

 

13
Professor Mark Nigrini is arguably the world’s foremost expert on Benford’s Law. Thanks to his extensive research, we have a threshold.

In other words, his research gives us a threshold level to what deviation is acceptable, and what deviation is not acceptable.

14
In Professor Nigrini’s research, he analyzed natural science data and compared it to Benford's Law. Why natural science data, you might ask? Well, essentially, he needed data that was not tampered with by humans. Now, Prof. Nigrini compared the natural science data to Benford's expectation. With this analysis, we have a baseline to evaluate whether a dataset is conforming or not.

15
Finally, we need is a method to calculate conforming or not. As you can see, this dataset follows Benford’s expected distribution almost to a t.

However, we cannot use traditional measures such as mean or standard deviation. The reason for this is that Benford's distribution is highly skewed. Remember, we expect lower digits to have a higher distribution than higher digits. Visually, this gives a tail to the right. Statistically speaking, the distribution is non-normal and non-parametric.

16
According to Prof. Nigrini, the best way to calculate whether a dataset is conforming or not is the MEAN ABSOLUTE DEVIATION, or MAD. MAD simply takes the sum of all deviations from the expected deviations, expressed as an absolute value, and divides it by the number of observations. In our case, for the digits 10 to 99, which is 89.

17
Based on Prof. Nigrini’s research on natural science data, empirically, a MAD of above 0.0022 indicates nonconformity. A value below 0.0022 indicates conformity to Benford's distribution.

There are other measures to calculate conformity for those among you with a statistics background, such as Chi-square. The problem with Chi-square is that the threshold is somewhat arbitrary, while with MAD, the threshold is backed up by empirical research.

18
Of course, we don't have to memorize those levels. All the critical values are listed in the Excel file. And, if the critical values are above nonconformity, it turns red.

19
Let's see what this looks like in Excel.

For the first two digits, "14", we count it 427 times. With 10,000 invoices, that's a proportion of 4.27%. In other words, we observed a distribution of 4.27% while we expected 3.00%, based on Benford’s Law. The calculation is simply 4.27% minus 3.00%, which equals 1.27%. It's important to note here, that the difference is always expressed in absolute terms. It can only be a positive number.

20
In the Excel file, the calculation for the Mean Absolute Deviation is calculated automatically. As we can see here, if the value is above the threshold of 0.0022, the value is displayed in red. In other words, the invoices from this vendor are not conforming.

Now we're armed with the basic Benford knowledge to run datasets with Excel. Let’s jump into the Excel file (open Excel file).

21
Something we have touched upon yet is the following: MAD defines if a vendor is suspicious or not. Once we declared a vendor as suspicious, we can use Z-statistic to pinpoint those invoices which deviate the most.

22
As we can see, the z-statistic for digit 62 is exceptionally high, indicated by the red color-coding. Thus, in this case, we want to check first all invoices starting the digits 62.

At this stage, you might wonder what is more critical; conformity expressed as MAD level or z-statistic? The answer is that conformity always comes first. Only once we have identified a dataset as nonconform, we refer to the z-statistic.

23
Regarding the usage of the Excel file, here are a few technical details: make sure that your dataset comprises numbers higher than ten or lower than minus ten. Your dataset should be large. The exact number of samples depends on your required confidence level, such as 90%, 95%, and 99%.

 

As a rule of thumb, aim for a dataset with at least 1000 variables. And last but not least, be careful when you copy and paste data. You want to avoid mixing datasets. As a general rule, I would recommend keeping the original Excel file and save new tests under a different name.

24
At this stage, you might be wondering where Benford's Law works best. In general, Benford's Law works best with financial numbers such as invoice fraud, reimbursement fraud, accounting fraud, or tax fraud.


26
If you’re in doubt whether Benford’s Law is applicable to your case, think of the following: does a person have to come with those numbers? Does a person have to enter those numbers? If the answer is yes, then Benford’s Law is likely to work.

27
As we give away the Excel training, you might be wondering how we make a living. The answer is the following: if you are happy with the free version of BenfordAnalytics for Excel, great. However, some companies need an automated process that works in real-time. Automation means scanning thousands of invoices in real-time, which is hard to achieve with Excel.

In this case, we offer our data science skills for a consulting fee. For example, our real-time solution could have caught the Facebook fraud much earlier. Facebook could have saved around 90 million dollars.

28
We are now at the end of this video tutorial. I hope you learned a lot. I suggest you go ahead now and download the Excel file. If you have a question, feel free to drop me a line at franco at BenfordAnalytics dot com. Good luck!