Forecasting in Tableau (without R/Python)

Tableau's built-in "simple" exponential smoothing for forecasting is a powerful alternative to more advanced models. In my personal opinion, we probably can cover 80% or more of use cases without falling back to more complicated models such as ARIMA or Facebook's Prophet.

The MASE (Mean Absolute Squared Error) is one of the most important indicators for our forecasting model. We can think of MASE as the equivalent of R square for regression models. A MASE value of 1.00 means, that our model is not better than simply taking the last value to forecast (a.k.a., the naive approach to forecasting).

A MASE value of 0.50 means, that we doubled the prediction accuracy compared to a naive approach. Tableau terms such a prediction power as "OK."

Resources:
- "Forecasting: Principles and Practice" (website) is probably the most well-known book on the topic. Requires R.
- "Forecasting and Time Series Analysis in Tableau" (Udemy) is an amazing video course that uses only a bit of R.

What percentage of customers did reorder?

Several advanced analytics techniques (e.g., customer order frequency, cohort analysis, or customer acquisition) are relatively simple to implement compared to R or Python. One of them is customer order frequency, which can be useful in marketing or promotions.

For example, we want to know how many customers ordered a second time, split by gender. In Tableau lingo, this requires a simple LOD (Level of Detail) combined with bins.

Visually, this analysis is not aesthetically pleasing. However, the content can have a massive impact. Our analysis uncovered that only 24% of male customers and only 25% of female customers reordered. Again, this analysis would be quite hard to accomplish in R or Python (unless you do it every day).

Resources:
- "Advanced Tableau for Business Analytics" (Udemy) is an amazing course with techniques rarely seen anywhere else.

- "Tableau Certified Professional" (website) is probably the first online course to teach this material.

Machine Learning directly in Tableau

To implement a Machine Learning algorithm (e.g., Logistic Regression) in Tableau, one possible technique is to use TabPy, an external API. Many companies don't allow to install third-party libraries on their Tableau server. Those policies are understandable. One way is to "convert" the prediction directly into Tableau's Calculated Field.

For example, the following Calculated Field predicts the probability, whether a customer churns (i.e., leaves us as a customer).

Similarly to linear regression, the coefficients from a logistic regression can be extracted and converted into a Calculated Field. Below is an example from the coefficient and the coefficients from a logistic regression (Python).

The huge advantage of this technique? No need for any external code such as TabPy. Additionally, using a machine learning algorithm in production is not that simple. One nasty challenge is that you have to normalize new data based on the training and testing. With Python, one solution is to pickle.

With this proposed technique, we "only" need to normalize the data directly in Tableau. In this case, we are normalizing the data with the MinMax scaler (there are different techniques though).

Below is the MinMax scaling in Tableau which brings each measure to a range of 0 - 1:

How to add a Multiple-Linear Regression in Tableau

In Tableau, adding a simple linear regression (a.k.a., trend) with two variables is easy to implement and visualize. Adding a multiple linear regression with more than two variables required a bit of help from R or Python. In my example below, we extract the intercept and coefficients from R, implement them in a Calculated Field, and link them with Parameters.

Of course, visualizing a multiple linear regression is almost impossible. However, we can use multiple linear regression to achieve advanced calculations such as expected returns from an advertising campaign or dynamic pricing. Some call this prediction. However, unlike forecasting, with multiple linear regression, we should only interpolate and not extrapolate. In other words, with multiple linear regression, we can only stay within the range in which we trained our algorithm.

Hypothesis Testing: Are we promoting more men than women?

Let's say human resources approaches us. Then want to know whether our company favors men in promotions. Over the last 12 months, they have promoted 117 women and 203 men. They want to know whether we favor men over women. Is the result statistically significant? To answer this question, we need to run a hypothesis test.

The classical approach would be to calculate the p-value based on a t-test. In "modern statistics", we simulate the confidence interval by using a Monte Carlo simulation technique called "permutation-based hypothesis test." For this test, I'm using the amazing R package infer. Anyone interested in learning statistical inference with R, I can highly recommend the book "Statistical Inference via Data Science." At first, we set a p-value of 0.05.

With a permutation-based hypothesis test, we sort of mix the two outcomes (promotions men and women) as there was no difference. Then, based on the histogram, we check how rare the difference in promotions between men and women is. In this case, we can see from the histogram that a difference of 3% in promotions is not statistically significant. In other words, statistical we cannot find proof that we are favoring men over women in promotions.

Empowering employees to make data-driven decisions.

Dual-axis and combination charts open a wide range of possibilities for mixing mark types and levels of detail for unique insights. From a Data Science perspective, those visualizations are simple. However, visualizations that are used daily are often simple, in my experience. If they were complicated, probably fewer people would use it.

For example, in the visual below, the grey bar charts represent the total sales while the line charts represent the sales for each category. It's immediately visible, that the category technology is currently losing some momentum. Maybe it's just a seasonality effect? Maybe there's more behind it ...

Info icons

Info icons in Tableau are particularly useful in providing detailed complimentary information within a dashboard. Info icons are typically created using a combination of custom shapes and tooltips. How it works; whenever a user hovers over the info icon, a pop-up appears (tooltip).

Resources:
- "Advanced Tableau for Business Analytics" (Udemy) is an amazing course and covers info icons and many more advanced techniques.

Smart design: think like a designer

Well-designed data visualization - like well-designed objects - is easy to interpret and understand. One theory is the zigzag "z" of taking information on a dashboard, at least in the Western world.

I personally have never been into design. "If I see the key number" it's all I need, was my thinking. Time learnt me another lesson though. Users pay a lot of attention to sometimes pixel-perfect design. In other words, content alone is often not enough. I learned it the hard way. Nowadays, I pay a considerable part of my energy on the layout, low cognitive load, colors, location (x and y-axis), and pixels.

After a while, you don't just have your favorite color, but your favorite color is #5e7986 as a background with a white font.

Basic Monte Carlo Simulation in R

A beautiful aspect of probability is that it is often possible to study problems via simulation. The basic function and its arguments are:

sample ( n, k, replace = TRUE, prob = NULL )

n = the numbers to be simulated.

k = the number of simulations

replace = TRUE if you replace the selected number, otherwise FALSE.
prob = probability for each n, otherwise NULL.

To some, the function with its arguments might look simple However, Monte Carlo simulation can get quickly insanely complicated.

The following is a simple example: we flip a fair coin 10 times:

We can see that the expected value is 0.70. If we increase the number of simulations (k) to 1,000, we would get closer to the expected value of 0.50. In fact, with 1,000 simulations, we get 0.56.

In my doctoral thesis, I used Monte Carlo simulations for sample size determination and permutation-based hypothesis tests. Learning Monte Carlo simulations can be tricky; most books are full of math combined with little or no code. If you're an applied Data Scientist, I would ignore books without code. Why? You might lose too much valuable time with math. One of my all-time favorite Data Science books by Rafael A. Irizarry "Introduction to Data Science: Data Analysis and Prediction Algorithms with R" covers Monte Carlo simulations in an extremely accessible way using R. Don't be deterred by the few reviews on Amazon, the online version to the book has over 500,000 students (Harvard's Data Science in R).

The 4 most important visualization types?

With the following four visualization types, we probably cover 80% of the visualizations. Those are:

1 BAR CHART (categorical vs. numerical)
2. SCATTER PLOT (numerical vs. numerical)

3. HISTOGRAM (numerical)

4. LINE CHART (numerical vs. time)