Customers:

___________________________________________________________________________________________________________________________________

Outlier Detection: Moving Percentile based on Bins 

 

Let's say you launch rockets, and you want to detect outliers (99th percentile) based on bins. Each bin corresponds to 10 kilometers (X-axis).

 

Within each bin, you want to detect outliers on the Y-axis. This is not so simple because the traditional Tableau method WINDOW_PERCENTILE() calculates a rolling percentile (e.g., always for the last 10 kilometers). This results in erratic outliers while a moving percentile based on bins does not.

New: Predictive Modeling Functions in Tableau

Smoothing is a very powerful technique designed to detect trends in the presence of noisy data in which the shape of the trend is unknown. Harvard Data Science Prof. Irizarry (Smoothing). With Tableau's version 2020.4, Tableau has added some advanced statistical calculations which were previously only possible with R or Python.

Below, I've played around with the function Model_Quantile(), the new date and axis range, and the new block comments:

In short; for historical data, we calculated the median and for future data, we calculated the median based on linear regression (other options are regularized and Gaussian regression).

To make predictions possible, we need to set "infer properties from missing values" and set the ExtendDateRange.

 

The options for statistical tools are still greater in R. However, for visualizations in "production," the solutions in Tableau are catching up quickly.

 

 

Forecasting in Tableau (without R/Python)

Tableau's built-in "simple" exponential smoothing for forecasting is a powerful alternative to more advanced models. In my personal opinion, we probably can cover 80% or more of use cases without falling back to more complicated models such as ARIMA or Facebook's Prophet.

The MASE (Mean Absolute Squared Error) is one of the most important indicators for our forecasting model. We can think of MASE as the equivalent of R square for regression models. A MASE value of 1.00 means, that our model is not better than simply taking the last value to forecast (a.k.a., the naive approach to forecasting). 

A MASE value of 0.50 means, that we doubled the prediction accuracy compared to a naive approach. Tableau terms such a prediction power as "OK."

Resources:
- "Forecasting: Principles and Practice" (website) is probably the most well-known book on the topic. Requires R.
- "Forecasting and Time Series Analysis in Tableau" (Udemy) is an amazing video course that uses only a bit of R.

Machine Learning directly in Tableau via R or Python

To implement a Machine Learning algorithm (e.g., Logistic Regression) in Tableau, one possible technique is to use TabPy, an external API. Many companies don't allow to install third-party libraries on their Tableau server. Those policies are understandable. One way is to "convert" the prediction directly into Tableau's Calculated Field.

For example, the following Calculated Field predicts the probability, whether a customer churns (i.e., leaves us as a customer).

Similarly to linear regression, the coefficients from a logistic regression can be extracted and converted into a Calculated Field. Below is an example from the coefficient and the coefficients from a logistic regression (Python). 

The huge advantage of this technique? No need for any external code such as TabPy. Additionally, using a machine learning algorithm in production is not that simple. One nasty challenge is that you have to normalize new data based on the training and testing. With Python, one solution is to pickle. 

With this proposed technique, we "only" need to normalize the data directly in Tableau. In this case, we are normalizing the data with the MinMax scaler (there are different techniques though). 

Below is the MinMax scaling in Tableau which brings each measure to a range of 0 - 1:

How to add a Multiple-Linear Regression in Tableau

In Tableau, adding a simple linear regression (a.k.a., trend) with two variables is easy to implement and visualize. Adding a multiple linear regression with more than two variables required a bit of help from R or Python. In my example below, we extract the intercept and coefficients from R, implement them in a Calculated Field, and link them with Parameters. 

Of course, visualizing a multiple linear regression is almost impossible. However, we can use multiple linear regression to achieve advanced calculations such as expected returns from an advertising campaign or dynamic pricing. Some call this prediction. However, unlike forecasting, with multiple linear regression, we should only interpolate and not extrapolate. In other words, with multiple linear regression, we can only stay within the range in which we trained our algorithm.

Hypothesis Testing: Are we promoting more men than women?

Let's say human resources approaches us. Then want to know whether our company favors men in promotions. Over the last 12 months, they have promoted 117 women and 203 men. They want to know whether we favor men over women. Is the result statistically significant? To answer this question, we need to run a hypothesis test.

The classical approach would be to calculate the p-value based on a t-test. In "modern statistics", we simulate the confidence interval by using a Monte Carlo simulation technique called "permutation-based hypothesis test." For this test, I'm using the amazing R package infer. Anyone interested in learning statistical inference with R, I can highly recommend the book "Statistical Inference via Data Science." At first, we set a p-value of 0.05. 

With a permutation-based hypothesis test, we sort of mix the two outcomes (promotions men and women) as there was no difference. Then, based on the histogram, we check how rare the difference in promotions between men and women is. In this case, we can see from the histogram that a difference of 3% in promotions is not statistically significant. In other words, statistical we cannot find proof that we are favoring men over women in promotions.

Empowering employees to make data-driven decisions.

Dual-axis and combination charts open a wide range of possibilities for mixing mark types and levels of detail for unique insights. From a Data Science perspective, those visualizations are simple. However, visualizations that are used daily are often simple, in my experience. If they were complicated, probably fewer people would use it.

For example, in the visual below, the grey bar charts represent the total sales while the line charts represent the sales for each category. It's immediately visible, that the category technology is currently losing some momentum. Maybe it's just a seasonality effect? Maybe there's more behind it ...

Info icons

Info icons in Tableau are particularly useful in providing detailed complimentary information within a dashboard. Info icons are typically created using a combination of custom shapes and tooltips. How it works; whenever a user hovers over the info icon, a pop-up appears (tooltip). 

Resources:
- "Advanced Tableau for Business Analytics" (Udemy) is an amazing course and covers info icons and many more advanced techniques.

Smart design: think like a designer

Well-designed data visualization - like well-designed objects - is easy to interpret and understand. One theory is the zigzag "z" of taking information on a dashboard, at least in the Western world.

I personally have never been into design. "If I see the key number" it's all I need, was my thinking. Time learnt me another lesson though. Users pay a lot of attention to sometimes pixel-perfect design. In other words, content alone is often not enough. I learned it the hard way. Nowadays, I pay a considerable part of my energy on the layout, low cognitive load, colors, location (x and y-axis), and pixels.

After a while, you don't just have your favorite color, but your favorite color is #5e7986 as a background with a white font.

Basic Monte Carlo Simulation in R

A beautiful aspect of probability is that it is often possible to study problems via simulation. The basic function and its arguments are:

                                                    sample ( n, k, replace = TRUE, prob = NULL )

n = the numbers to be simulated.

k = the number of simulations

replace = TRUE if you replace the selected number, otherwise FALSE.
prob = probability for each n, otherwise NULL.

"MONTE CARLO SIMULATION IS A TECHNIQUE USED TO APPROXIMATE THE PROBABILITY OF AN EVENT BY RUNNING THE SAME SIMULATION MULTIPLE TIMES AND AVERAGING THE RESULTS". Prof. John V. Guttag

To some, the function with its arguments might look simple However, Monte Carlo simulation can get quickly insanely complicated.

The following is a simple example: we flip a fair coin 10 times:

We can see that the expected value is 0.70. If we increase the number of simulations (k) to 1,000, we would get closer to the expected value of 0.50. In fact, with 1,000 simulations, we get 0.56.

In my doctoral thesis, I used Monte Carlo simulations for sample size determination and permutation-based hypothesis tests. Learning Monte Carlo simulations can be tricky; most books are full of math combined with little or no code. If you're an applied Data Scientist, I would ignore books without code. Why? You might lose too much valuable time with math. One of my all-time favorite Data Science books by Rafael A. Irizarry "Introduction to Data Science: Data Analysis and Prediction Algorithms with R" covers Monte Carlo simulations in an extremely accessible way using R. Don't be deterred by the few reviews on Amazon, the online version to the book has over 500,000 students (Harvard's Data Science in R).

The 4 most important visualization types?

With the following four visualization types, we probably cover 80% of the visualizations. Those are:

1 BAR CHART (categorical vs. numerical)
2. SCATTER PLOT (numerical vs. numerical)

3. HISTOGRAM (numerical)

4. LINE CHART (numerical vs. time)

Data-driven email alerts

I think email alerts are one of the most underrated tools on Tableau's server. As soon as we have a measure on the y-axis, we can create an email alert. For example, below I set an email alert if the volatility is > 32%. In other words, as soon as the volatility goes above 32%, I get an email alert. 

Email alerts have so many applications such as in sales or profits, when a certain level is reached (high or low), and in statistical analysis.

What type of ranking?

Tableau offers in the Table Calculation RANK() several options:

- Competition (1,2,2,4)

- Modified Competition (1,3,3,4)
- Unique (1,2,3,4)

And RANK_DENSE() which is illustrated in the image on the right.

I personally don't recommend memorizing the different ranking versions. However, I highly recommend keeping the RANK() function in mind. The RANKE_DENSE() function assigns the identical rank to individual values, but it will not skip the next rank. For example, I had several values "62" in the bin with a rank of #1. The bin "63" was rank #2. Without RANKE_DENSE(), it's impossible to visualize the ranking correctly.

How to read a parameterized LOD?

The LOD (Level of Detail) function below labels rows with an X that fulfill a certain condition:

IF MAD or COUNT based on on a certain aggregate fulfill the conditions, then a row is labeled with X. As we know from formal logic, it's an OR operation, not to confuse with AND or NOR. Lastly, two parameters give the user control over the LOD.

Addressing & Partitioning

With Tableau's table calculations, we can define the addressing fields to change the direction of the (table) calculation.

 

 

Partitioning (a.k.a., scope) defines the level or group at which table calculations are evaluated (i.e., cells, tables, and panes). In other words, partitioning defines what data we use.

Addressing fields (a.k.a., direction) defines the direction in which a calculation (i.e., down, across, across then down) is defined.

Percent Change from a Reference Date

With a reference date parameter, we can compare the performance of several categorical values in a pre-defined range. For example, we can compare the performance of Ford, Toyota Motor, and Testa within a certain time range.

This advanced analytics function is often taught in courses for the preparation of the Tableau Certified Desktop Professional.

Percent change from a reference date can be used in different business settings. For example, the user wants to compare different sales regions or products based on a reference date. To make this work, we need to follow the following three steps:

I personally believe it's very hard to memorize those steps. Unlike a traditional programming language such as R or Python, it's challenging to make it reproducible. My strategy is to keep a workbook with each step being annotated. The example is taken from Kirill Eremenko's Certified Professional video training (source).

Common Baseline

To me, a common baseline is a visualization technique that can give powerful insights, unlike any other statistical measure. In other words, a visualization of a common baseline can often show the power of visualizations (which is not always the case). 

Let's take a dataset from the Kickstarter project (Kaggle). Since not all projects start on the same day, we don't care about the date a project is launched, but rather how long it lasted. This creates our x-axis with the number of days. 

On the y-axis, we have the amounts raised for each individual project. Now we can gain potential insights from the behavior of Kickstart projects. 

Creating a common baseline is not so easy in Tableau, R, or Python. 

Source: Kirill Eremenko "Tableau Certified Desktop Professional."

Table Calculation Functions: WINDOW_AVG()

In this example, we're looking at salaries in different industries (source: Glassdoor). Analytically, we want to know whether an industry is above or below average. Because a user might exclude certain industries, we cannot hard code average. In other words, the average calculation must be based on the view (or scope). 

Thus, we need a function that dynamically updates based on the view. In Tableau lingo, this is called a table calculation. In particular, the function in the image is based on "_WINDOW_AVG" which is simply:

WINDOW_AVG(SUM([Meidan Base Pay])

In summary, the table calculation WINDOW_AVG() allows us to calculate the average dynamically, based on the user's preferences. We could not achieve this with a hard-coded average formula

International football results from 1872 - 2020

In team sports, the term home advantage refers to the benefit that the home team has over the visiting team. Thanks to Kaggle, we now have a dataset with over 40,000 international football results.

For the analysis, we ignore all games on neutral ground. Then we grouped the results by:

- Home win 
- Home loss

- Home tie

In a rare instance, we are using a pie-chart. From this simple analysis, we can see immediately, that the home advantage seems to be real. In other words, the probability of losing is only around 25%, if the corresponding team plays at home.

That's a pretty powerful insight gained from data.

Relationships vs. Joins

Relationships are part of Tableau's new dynamic data model interface, which automatically assigns join conditions between related tables without materializing those tables into a single data source.

 

The advantages? The relationships should (1) reduce the upfront data preparation work by utilizing a flexible relationship join. It should (2) reduce the number of data sources pulled in to answer analytic questions, and potentially most important, (3) relationships ensure measures will not duplicate due to join clauses (a.k.a., join risk).

                                     Relationships                                                                                               Joins

In other words, relationships (logical layer) keeps tables separate but defines a relationship between them while the joins (physical layer) physically joins tables together.

Optimizing Relationship Queries: CARDINALITY

In mathematics, cardinality means the number of different variables in a set. For example, if we have a column "countries" with the USA, Germany, and Switzerland in it, then the set or column contains 3 elements. A high cardinality can become a nasty problem in Machine Learning and visualizations.

In data modeling, the cardinality of the join between two tables is the numerical relationship between rows of one table and rows in the other. Tableau offers one-to-one, one-to-many, many-to-one, and many-to-many.

In general, I would not change the default settings unless you have a deep knowledge of your database. The default setting is many-to-many and if "one" is chooses and the records are not unique, duplicates will occur.

Optimizing Relationship Queries: REFERENTIAL INTEGRITY

Referential integrity is a property of data stating that all its references are valid. It requires that if a value of one attribute of relation references a value of another attribute, then the referred value must exist (Wiki). 

Unless you really know what you're doing, I recommend keeping the default setting at "some records match." In other words, we don't assume referential integrity.

With the "all records match" option, if all values in selected fields have a match in both tables. Joins are optimized with Inner Joins, but records can be removed(missing due to unmatched values. In other words, "all records match" treats records as if there are no null records in matching fields.

Nested LOD's (Level of Detail)

As in traditional programming languages, we can nest several layers of LOD's to answer more complicated questions. There are a few things to look out for. For example, the inner expression inherits the dimensionality of the outer expression. The relationship between the nested expression and the filters is determined by the parent expression (e.g., INCLUDE and EXCLUDE are evaluated after the dimension filter).

The best way to read a nested LOD is from the inside out.

In my experience, I more common but simpler nested LOD is something like AVG({ FIXED [State], [Customer Name] : SUM([Sales])}) where we can wrap almost anything "around" the LOD expression. An amazing course on Udemy is called "Advanced Tableau - Level of Detail Expressions / LOD" which goes meticulously in the details of LOD's and nested LOD's.

"LOD planning"

The general idea of "LOD planning" is that we write a LOD (or any other complex function) and test the results first in a tabular dataset. The idea has been around for a while, but to my knowledge, Kirill Eremenko was the first person to give this concept a name.

As we can see, the function above gives us the average of the Week of Date.

Once we have tested the function and now it works properly, we can use it in visualizations. The reason for this cautious approach is that errors in visualization can be hard to spot.  In other words, code in Tableau is hard to "read" unlike in traditional programming languages such as Python.

Visualizing changes in ranking

When we want to visualize changes in ranking, slope and bump charts can be extremely useful. Slope charts are useful when comparing changes in ranking between two points in space and time. Bump charts on the other hand, extend this concept across more than two points.

On the right, we can see a bump charts that shows the sales ranking from some US states.

 

Note the simplicity of the view that allows the user to easily see the degree of change as indicated by the angle of the slope.

For example, we can see that "Idaho" improved its ranking from #7 to #4.

The example if from the amazing book "Learning Tableau 2020."

Do we have seasonality?

Seasonality can be hard to visualize (see trend line in the red box). But what is seasonality anyway? First, seasonality is generally caused by weather or humans. Second, seasonality is a characteristic of a time series in which the data experiences regular and predictable changes

Ithat recur yearly. Any predictable fluctuation or pattern that recurs or repeats over a one-year period is said to be seasonal.

A potential solution to visualize seasonality is the cycle plot. For example, in the visualization on the right, we compare each individual month with bins as a year range.

In other words, for "September", we compare the sales numbers for September in 2014, 2015, 2016, and 2017. The average is displayed with an orange dotted line.

We can immediately see that September has the highest average sales. Other strong seasonal months are December and June.

Algorithmic complexity (Big O notation) in Tableau terms

For Data Scientists and data analysts, the first three kinds of algorithmic complexity are probably enough knowledge. Computer scientists call this the Big O notation which quantifies the performance of various algorithms as the input size grows.

The following are approximations. Let's say you have 10 million rows and it takes your dashboard 10 seconds to load.

O(1)

Stands for a constant time and is the fastest version (though unlikely to see in the real world). In our case, if you have 100 million rows, your dashboard still needs only 10 seconds to render (i.e., no change, thus constant).

O(log(n))
Complexity grows logarithmically. Let's say you change your dataset from 10 million rows to 100 million rows. Now your dashboard takes like 30 seconds to render. It's slower than constant but faster than linear (next).

O(n)

Complexity grows linearly. Loading 10 million rows takes 10 seconds and loading 100 million rows takes 100 seconds.
 

O(n^2)

Anything more complex than linear is probably not going to work in Tableau. For example, quadratic is most likely going to crash your dashboard.  

Bootstrapping

The bootstrap is a really cool statistical method that allows us to approximate the population with just a sample. The inventor is one of the most influential statisticians ever: Bradley Efron

The bootstrap is the building block for "modern hypothesis testing." With the bootstrap, rather than using a static dataset, we use a simulation. I show a simple example, however, this can get extremely technical (e.g., A/B Testing in R). The bootstrap does belong to the Monte Carlo family. Important to note: the bootstrap does not create new samples. We simply take samples from the left WITH REPLACEMENT.

 

Let's say you got the following conversion rates today: 0.02%, 0.03%, 0.04%, 0.05%, 0.02%, 0.04%, 0.04%, 0.01%, 0.12%, and 0.15%. At a 95% confidence interval, you want to know the conversion range you can expect. Based on the Central Limit Theorem (CLT), your sample size (10 conversions) is most likely too small. Additionally, we can't calculate a meaningful confidence interval from the small sample (the mean is 5.2% and the standard deviation is 4.5% which would result in a negative confidence interval!).

Below is a very simple Monte Carlo simulation that takes randomly 1,000 samples (N) from the sample with replacement. At a 95% confidence interval (1.96 z-score) we get a meaningful confidence interval of 3.7% - 6.5%. In other words, at a 95% confidence interval, we can expect the conversion rate to be within 3.7% - 6.5%.

This is an area where R shines over Python. In my opinion, R dominates statistical methods because it offers the libraries AND knowledge. In other words, for bootstrap, there are great libraries with contributors who are willing to share their knowledge. For example, the difficulty in bootstrap lies less in its code, but more in its application. Excellent sources are "Modern Data Science with R" and "Statistical Inference via Data Science" which offers a book and a brilliant library called infer.

Accuracy of the  Central Limit Theorem (CLT)

Textbook explanation:

The Central Limit Theorem (CLT to its friends) tells us that when the sample size is "large," the probability distribution of the sum of the independent draws are approximately normal. The usual rule of thumb, found in most textbooks, is that the CLT is reasonably accurate if n >30 unless the data is quite skewed. Because sampling is so important in statistics, the CLT is considered one of the most important mathematical insights in history

Real-world:
- The samples need to be random + independent. This can be extremely hard (see vote prediction).

- If the population is normally distributed, the CLT can hold for 10 - 100 samples.

- If the population is heavily skewed, this could result in >1000s of samples.

- How close to "normal" is normal? The Shapiro-Wilk test can help.

I love statistics. However, I feel that most statistics books are focused on explaining statistics better, without any concern about progress. For example, explaining the CLT with just text is futile to me. We need to formalize statistics with either math or code. I strongly support the latter.

Can we combine live + extract data in Tableau?

Challenge: you need your data live. However, you also need historical data in your Tableau dashboard. Problem: your historical data is really big (e.g. 20 GB which is big, if needed in memory). One potential solution is to combine live and extract. It's simply a matter of connecting to both, a live connection and a hyper file. 

Tableau Hyper API

If your company deals with huge data such as Terabytes (1,000 GB) or Petabytes (1,000 Terrabytes) and you need to bring some of the data into Tableau, the following might be a solution. For example, Databricks offers a Delta Lake Hyperleaup to Tableau. It's similar to Tableau's Hyper API which allows pushing hyper files to the Tableau server.

Let's assume that the created hyper file has a size of 100 GB and is updated daily (e.g., 2:00 AM). As a Tableau developer, we can use the "mother" hyper file to create "hyper children." For example, based on the 100 GB hyper file, we might create another hyper file with a size of 1 GB.

This mother and children hyper file architecture can be very effective and efficient with large datasets. With incremental refreshes, we could even display near real-time data (e.g., every 1 hour). Additionally, as hyper files and dashboards are most of the time using the same server, this approach can alleviate some of the burden on the server.

Big data is a vague term that I'm trying to define here:

- SMALL DATA: We can do anything with it. For example, a 1 MB dataset in .csv. We can easily store it and work with it in memory.

- MEDIUM DATA: We can store it locally, but can't bring it to memory. For example, a 20 GB dataset is probably too large for most computers.

- BIG DATA: It's so large that we can't even store it on a regular computer. For example, 1 Terabyte (1,000 GB). 

Industry 4.0 - Predictive Maintenance in Tableau

One serious challenge for businesses is implementing Machine Learning in production. Tableau as a platform is almost unrivaled in giving access to data in production. For example, once in production, users can access dashboards from desktops, tablets, and even mobile phones. Dashboards can be easily modified for specific needs including data-driven email alerts. 

Unfortunately, implementing a classification algorithm (Logistic Regression, Decision Tree, Random Forest ...) is not straightforward. Below is an image from an Industry 4.0 algorithm implemented in Tableau using Python in Google Colab. "Risk" refers to the probability of machine failure. 

The upside of implementing a Machine Learning algorithm in Tableau is tremendous though. For example, we can empower mechanics in the field with algorithm predictions - a feat which was a decade ago impossible - let alone without a Ph.D. Today, we can empower mechanics, only with their mobile devices, to do things only a Ph.D. could do a while back.

Cancel & Apply Button

The cancel and apply button somehow gets little attention in the Tableau community.

 

However, if you speak to Tableau professionals (e.g., Dustin), it's an important tool, in particular for reports.

In general, we can use apply buttons when there are more than 2 - 3 options. The apply button allows users to select items he wishes to keep or exclude before the view refreshes. Most often, those filters are set to context, which can enable a "drill-down." 

 

Without cancel and apply button, the view refreshes after every single selection, which can be really annoying.

 

And, if you have a lot of data or complex calculations, it can be a real performance killer since each refresh has to execute a new query.

Table-Scoped LOD (Level of Detail)

Sometimes we need to define a level of detail expression at the table level without using any scoping keywords. For example, in the following view we want to see each country's GDP divided by the total GDP. In the case of China (as of 2018), China contributes around 16% of the total GDP. However, as I'm drilling down on the Asia region (not visible in the image below), I can see that China's contribution is over 50%.

Fun fact: LOD's depend on dimension fitlers, but table-scoped LOD's do not. In the case above, my table-scoped LOD completely ignored my date filter, even though it's a dimension filter. 

Extracting email addresses with RegEx

Regular expressions (RegEx) use special characters to define search patterns and find/replace text. One example of such a use case is extracting the email address from the text. RegEx is available in many programming languages and whole books have been written on the subject. 

Unless you're using RegEx regularly, you probably don't want to learn RegEx in detail. One nice trick is to go to www.regexr.com and test it our RegEx expression (tip from Dustin's awesome Advanced Tableau Training). The image below shows regexr.com. (1) I past in a sample text. (2) I test the RegEx function and see (3) the highlighted text (the correct one).

Regular expressions (RegEx) use special characters to define search patterns and find/replace text. One example of such a use case is extracting the email address from the text. RegEx is available in many programming languages and whole books have been written on the subject. 

Permutations

A permutation is a mathematical term for "shuffling": taking a series of values and reordering them randomly. Below are all the ways we can choose two numbers from a list consisting of 1,2, and 3:

Order matters with permutations: 3,1 is different from 1,3. In fact, permutations are another form of resampling, like bootstrap. While the bootstrap method involves resampling with replacement, permutation methods involve resampling without replacement (e.g., the hypothesis test with permutation).

Boolean Algebra

I think that the study of Boolean algebra can help tremendously in thinking smarter. In essence: "The idea of Boolean algebra is to formalize a mathematical approach to logic." In Boolean algebra, values are either true or false (1 or 0).

There is a very important relationship between Boolean algebra and set theory. If we interpret A and B as sets, then we can think of "A and B" as the set of elements that are in A and are in B, i.e, the intersection A ∩ B.

In Tableau, we have three Boolean operators:
- AND
- OR
- NOT

Because there are only two possible values (true and false), we can record all Boolean operations in a table. We can use Booleans for simple visualizations (e.g., highlighting) to algorithms (e.g., decision tree).

Cannot mix aggregate ...

 

In Tableau, one of the most annoying errors is "Cannot mix aggregate and non-aggregate arguments with this function."

Sometimes, a potential solution is to add a Fixed LOD to get around the error message.

On the right is a potential fix where we add a Fixed LOD in order to make the calculation valid.

Set theory

Set theory is a branch of mathematical logic that studies sets, which informally are a collection of objects - often used in discrete mathematics books.

In Tableau we have:
- Fixed sets

- Dynamic sets

- Combined sets

"Fixed sets" are kind of useless as they are hardcoded members. "Dynamic sets" are most commonly used with top-N or in combination with parameters.

Intellectually, I find "combined sets" the most interesting. For example, in the image on the right, we have a set with sales >5,000 and a set with profit >1,500. The "combined set" is highlighted in red, where observations fulfill the conditions of both sets.

In other words, the combined sets highlight observations which have sales >5,000 AND profit >1,500.

Functional programming in R (for Data Science)

The following example is from Hadley Wickham. In short, he argues that for loops aren't necessarily bad, but duplicated code can conceal important differences. 

On the left is a for loop. It's pretty hard to see the difference while with the functional approach (on the right), the difference is immediately visible.

Scatter plot:  country vs. country

Visually, this scatter plot with X vs. Y can be pretty useful in a drill-down.

Technically, it's "simply" a combined set.

Country Set 1 and Country Set 2 are filtered with a Combined Set (full join). The part I spent most time is that those two sets are "Show set" and NOT "Show Filter." The X vs. Y approach does not work if they are set as filters.

Special thanks to Tableau Zen Master Toan Hoang from Tableau Magic.

Filter Menus

Filter menus enable users to show/hide filters on a dashboard. Using filter menus helps conserve dashboard space, enhance usability, and create a more interactive user experience. We can use the Tableau default icons (pretty nice) or create or own designs. 

Important: in the dashboard, we can only drag the filter to a FLOATING CONTAINER.

Dynamic Sorting (not to be confused with parameter sorting)

Dynamic sorting allows us to sort by any variable.

Attention: in case we're using ATTR(), we're losing the dynamic sort feature.

What is ATTR()? My personal explanation ...

There are different use cases for ATTR(), and Tableau offers in-depth explanations. However, in my personal use, I most often use ATTR() as "turning a dimension into a measure (aggregated)."

The following example was used by Tableau Tim:

IF ISNULL(ATTR([City])) THEN NULL

ELSE SUM([Sales])

END

Without ATTR() we would have gotten an error from Tableau not being able to mix aggregate with non aggregated ...bla bla.

With ATTR(), we converted the dimension "City" into an aggregated measure. Thus, the function works.

Another example where we want to take the first two digits:

IF ATTR(LEFT(STR([Amount]),2) = "10") THEN ....

In this case, we converted the the first two digits, which are a string, into an aggregated measure.

The TOTAL() function:

The TOTAL() function can help us getting totals. For example, TOTAL(AVG(Sales)) gives us the average based on the selected partitioning. 

In my sample case, the average was calculated based on the category. Attention, the average for each sub-category would be different.

An awesome video tutorial offers Tableau Tim.

RANK(INDEX())

 

 

 

A great idea from Tableau Tim: let's say we always want to show the last 6 months. We can do this with RANK(INDEX()).

 

Sloppy analysis and thinking ...

The visualization on the right was presented by a "Tableau expert." The expert argued, that the linear regression is statistically significant because the p-value is extremely low.

Well, the answer would have been better like this: while the linear regression is statistically significant, it's significant at predicting ... NOTHING!

Why? R-Squared is at 3%. In other words, the linear regression explains 3% of the relationship between Y and X.

I'm annoyed because of the following point, in particular, in the data analysis world, fast-thinking is predominant.

 

A quick glance at the scatterplot would have been sufficient to see, that the linear regression cannot predict anything. In general, you won't see this sloppy thinking in the R community

Nothing and zero are NOT the same

If we have 100 rows at 5 each, 10 NANs (no value), and want to take the average, then Tableau calculates the averaged based on the available data. In other words, 90 x 5 / 90 =  5. However, if we replace the NAN's with 0, then we get an average of 90 x 5 + 10 x 0 / 100 = 4.5. 

We can achieve this with Tableau's function ZN(Variable). It all depends on what we want to achieve.

 

END

Franco Arda, Frankfurt am Main (Germany)                                                                                                                 franco.arda@hotmail.com