• Franco Arda

Breast Cancer Prediction with Machine Learning in Tableau using Python and Scikit-Learn.

On the Internet, I've seen many attempts to implement a Machine Learning algorithm in Tableau. Most of them are simply wrong.

One example is trying to implement the Logistic Regression algorithm without normalizing the underlying data. The person got a prediction rate of 47% (below guessing!) and concluded, that the algorithm did not fit the problem.

That hurts my heart. The Logistic Regression algorithm is an incredibly powerful classification algorithm. It's "beauty" can be probably only appreciated by programmers who have implemented a lot of Machine Learning algorithms. Why? If Logistic Regression achieves a satisfactory high accuracy, it's incredibly robust. In Machine Learning lingo, this is called a low variance.

Again, this is a bare minimum Machine Learning model. The focus is on the integration with Tableau in production, not in the nitty-gritty detail of creating an algorithm:

Breast cancer dataset (source: Kaggle)

With another algorithm (XGBoost), I found the three most important features for predicting breast cancer:

Many Machine Learning algorithms require data normalization. If you want to deep in dive, check out Sebastian Raschka's blog post. Many algorithms require normalization such as Logistic Regression or Deep Learning models. Data normalization is a rescaling of the data from the original range so that all values are within the range of 0 and 1. Normalization requires that you know or are able to estimate the minimum and maximum observable values (that's super important). The calculation is as follows: Y = (X - Min) / (Max - Min).

Adding MinMax in Python code is easy. Just make sure you avoid data leakage by scaling train and test separately (see above).

The tricky part with MinMax scaling is in production when we have to predict based on new variables (see last Python code).

If you're like me and are more Data Scientist then Software Developer, the following part is hard. We have to get the Python function right: adding the min and max for MinMaxScaling, pull the correct prediction and deliver it to Tableau in production. This is really tricky, but with it, we have an end-to-end Machine Learning solution.

This is my code to call the Python function in Tableau using TabPy:

The scatterplot helps immensely in creating an interactive Tableau Dashboard:

With Tableau Parameters, we can calculate the breast cancer probability on the fly. There are at least two different ways of incorporating the Parameters.

This is the power of combining Tableau and Python: we can create end-to-end Machine Learning models for some problems. Most problems should be numeric / categorical. Examples: - Cancer Prediction - Customer Churn Prediction - Employee Churn Prediction - Predictive Maintenance

- Fraud Detection ... The complete Jupyter Notebook can be found in my GitHub. Franco Arda


Franco Arda, Frankfurt am Main (Germany)                                                                                                                 franco.arda@hotmail.com