Search
  • Franco Arda

Data Science with Tableau: 200+ Questions & Answers.

Updated: Mar 23

My book is work in progress.


(1) Can you give me one or two samples where seasonality is common? I can think of weather and human activity.

(2)

If your data has a positive skew, where is the long tail of your data on the left or right? With positively skewed data, the tail of your data is on the right.

(3)

The IQ average is 100 and the standard deviation is 15. What’s the probability that a random person has an IQ above 115? 115 is 1 standard deviation from the mean which equals 68% of the distribution. So 32% is below 85 and above 115. Thus, 32%/2 or 16% have an IQ above 115. Mathematically, this is:

Figure 1: Formula for the probability of picking randomly a person with an IQ of >115. (4)

Let’s say you have the stock price of [Tesla] on the Y-axis and time on the X-axis. You want to add an additional axis that shows the Standard Deviation (a.k.a. volatility in finance) for the last 30 days. How can you achieve that?

The following function calculates the Standard Deviation of [Tesla] for the last 30 days:

WINDOW_STEDV(SUM([Tesla], -30,0)). Now I add this Calculated Field (function) as a Dual Axis (Y-axis on the right). Ideally, I display the standard deviation is percentages. Having the standard deviation on the Y-axis allows me to create an email alert (e.g. send an email alert if standard deviation >32%).

Figure 0: Calculated Field for Standard Deviation (STDEV). The number of days to calculated STDEV is parameterized and can be modified by the users.

Figure 3: Dual Axis with Tesla and Standard Deviation. (5) You have 100 days of IoT machine [Temperature] data. You want to get the maximum temperature for the last 10 days. How can you write it?


The following function returns the maximum of the temperature over the last 10 days including today: WINDOW_MAX(MAX[Temperature], -10,0) (6) Can you explain to me what a nonlinear problem is? Use anything such as age 20-80 years. For example, age group 20-30 and 40-50 loves Yoga, but the other age groups not so much. (7) How would you define an outlier? There are many ways, one popular one is Tukey's method: 1.5 times the IQR (Interquartile Range) which is the data range between the 25thto 75th percentile.


(8) If you use 1.5 IQR as an outlier, how do you calculate the upper range? With Q3 + 1.5 IQR we calculate the upper range. In other words, any number higher is an outlier.

(9)

According to one study, the correlation between total SAT (a test for college admission in the U.S.) scores and IQ scores is 0.72. How much does the SAT score explain in IQ?


The calculation is 0.72 * 0.72 = 0.5184. In words, the SAT score accounts for only 51.85% of the variance in IQ. (10) What’s a spurious relationship? A relationship between two variables that is caused by a statistical artifact or a factor, not included in the model, that is related to both variables. According to Wiki, an example of a spurious relationship can be seen by examining a city’s ice cream sales. These sales are highest when the rate of drownings in city swimming pools is highest. To allege that ice cream sales cause drowning, or vice versa would be to imply a spurious relationship between the two. (11) You have [Age] between 20 - 80. Convert it with a function to a dimension? IF [Age] > 60 THEN '60 - 80' ELSEIF [Age] <60 AND [Age] >35 THEN '35-60' ELSEIF [Age] <35 THEN '20-35' END (12)

What does skewness mean?

Skewness is a property that describes the shape of a distribution. If the distribution is symmetric around its central tendency, it is unskewed. If the value extends farther to the right, it is “right-skewed” and if the value extends left, it is “left-skewed.” (13) Can you write a fixed LOD that takes the average of [City], [Customer Name] divided by the sum of [Sales]?

AVG({ FIXED [City], [Customer Name] : SUM(Sales)}) (14) What is the Confidence Interval (CI)? A range that includes a given fraction of the sampling distribution. For example, a 90% confidence interval is the range form the 5th to the 95th percentile. Sadly, people often think that there’s a 95% probability that the actual parameter falls in the 95% confidence interval. It provides both a lower and upper bound of likelihood: - Smaller Confidence Interval: A more precise estimate. - Larger Confidence Interval: A less precise estimate. Some argue that confidence intervals may be preferred in practice over the use of statistical significance tests. The reason is that they are easier for practitioners and stakeholders to relate directly to the domain.

(15a) What are logarithms?


The logarithm is the inverse of the power function. The best way to understand logarithms is through a simple example. If we take to the the fourth power (10 x 10 x 10 x 10), the result is 10,000. The logarithm (base 10 – see next point) of 10,000 is the power of 10 that gives 10,000. So the logarithm of 10,000 is 4. In other words, if we multiply 10 by itself four times, we get 10,000. (15b) What’s Tableau’s default base for the Logarithmic scale? It’s log10. For example, if you have 1,10,100 and 1,000, then the steps are equal in log10. In other words, the difference between 10 and 100 is the same as between 100 and 1,000.

Figure 4: Logarithmic scale with base 10.

(16) What’s the beauty of the normal distribution? The beauty of the normal distribution stems from the fact, that we know exactly what proportion of the observation lies within one standard deviation. For example, 68.2% lie within one standard deviation.

(17) We need some random numbers and connect Tableau with TabPy. The following script is taken from the book “Mastering Tableau 2019.1” (page 503). You want to parameterize (add two Parameters) for mu and sigma. How you can do that?

Figure 5: Random number generator. Awesome question. Parameters are super powerful, but we can add them to SQL or Python functions as well. First, I create two Parameters in Tableaus such as ‘mu_parameter’ and ‘sigma_parameter’. Then, I’ll add them after the helper function:

Figure 6: Random number generator with Parameters.

(18)

I’ve seen Data Scientists using Logistic Regression (Machine Learning algorithm) running straight in a Calculated Field. Below is part of the code. The accuracy is model was really bad, do you know why?

Image 7: Logistic Regression straight in Tableau’s Calculated Field.

This approach is plain wrong for two reasons: (1) Logistic Regression requires almost always to normalize our data. Without it, the accuracy is mostly random (i.e. totally bad) and (2) there’s no training and test split of the dataset.

Machine Learning is not rocket science, but it’s still quite hard. In general, a Machine Learning algorithm in production (which is what we want with Tableau), is quite challenging. In particular, normalizing is hard. We need normalizing for the training and then use the same scale with new data. Achieving this often requires us to pickle (store the normalization scale). This is quite challenging for numerical data – for categorical data, it gets even harder.


(21) Solving the problem of missing data is a core skill for data science. What is the technical word?

The filling is known as imputation.

(22) What’s the goal of Classical Hypothesis Testing? The fundamental question we want to address is whether the effects we see in a sample are likely to appear in a larger population. In other words, “given a sample and an apparent effect, what is the probability of seeing such an effect by chance?” (23) Can you give me an example of Classical Hypothesis Testing? Our CEO wants to test, if increasing marketing spending increases the ROE (Return on Equity). To test such a hypothesis, we assume temporarily, that it is not. That’s the null hypothesis. Based on that assumption, we compute the probability of the apparent effect. That’s the p-value. If the p-value is low, we conclude that the null hypothesis is unlikely to be true. The logic of this process is similar to a proof by contradiction. To prove a mathematical statement, A, you assume temporarily that A is false. (24) What does skewed data mean? A normal distribution is not skewed. Skewed means the data tends to have a long tail on one side or the other.

(25)

You see in a presentation the following: “The mean sales per customer is $12,000 per year. But half of our customers spend less than $10,000.” What do you say? This is entirely possible. The first number is the mean while the second number is the median.

(26)

Why do you separate between mathematics and statistics? I believe the former is about certainty while the later is about uncertainty. For example, 4 out of 5 is much different to 400 out of 500 to a statistician, even though both fractions equal to 80 percent. The latter represents a much more precise results because it’s based on much higher sample size. This is my best answer to show that mathematics and statistics are different.

(27) You have 2% missing data in one measure. Your data is not normally distributed. Should you replace the missing data with the mean or median? Replacing the missing data with the mean would be correct, if the data was normally distributed. If the data is not normally distributed, replacing by the median makes much more sense.

(28) You want to replace NAN’s in [Age] with 24. How do you do this in Tableau? We can do this with the following Calculated Field: IFNULL([Age], 25) (29)

You have two measures with a high correlation of 0.9. Name a few potential problems? We should always view correlation with scientism. Correlation is not causation. When two variables X and Y are correlated — meaning they increase together, decrease together, or one goes up as the other does down — there are four possible explanations:

A. X causes Y B. Y causes X C. A 3rd variable, Z, affects both X and Y D. X and Y are completely unrelated 30 Can you give me an example of the Law of Large Numbers with a fair coin (head/tail)? With a few tosses, you might get 5x head a 1x tail. With a 1000 tosses, we can expect the result to be close to 50/50. In other words, the law of large numbers is the process by which the sample mean of a set of random variables tends towards the population mean. 31 Can you give me an example of reversion to the mean with sales data? Let’s say that on average, we make $100 for each customer. Currently, the average is $120. Potentially we revert to the mean soon. 32 As a rule of thumb, the sample size must be at least 30 for the central limit theorem to hold. True of false? That’s generally considered to be true provided the samples are random and representative.

33 In a court of law, the starting assumption should be that the defendant is innocent. Would you call the null hypothesis or the alternative hypothesis? The defendant being innocent should be the null hypothesis. A bit more formally: Hypothesis 0 (H0): Assumption that the defendant is innocent. Hypothesis 1 (H1): Assumption that innocence is rejected at some level of significance. The significance level is often referred to by the Greek lower-case letter (α). A common value used for alpha is 5% or 0.05. A smaller alpha suggests a more robust interpretation of the result, such as 1% or 0.1%. A result is generally considered to be statistically significant when the p-value is less than alpha. This signifies a change we detected: that the default or null hypothesis can be rejected. p-value < alpha: significant result, reject null hypothesis (H1). p-value < alpha: not significant result, fail to reject the null hypothesis (H0).

34

What’s the difference between “Reject” vs. “Failure to Reject”? The p-value is probabilistic. The danger in natural language (human language) is that we accept the null hypothesis. In this case, natural language suggests that the null hypothesis is true. Instead, it’s safer to say that we fail to reject the null hypothesis, as in, there is insufficient statistical evidence to reject it. 35 What does the p-value refer to? The p-value refers to the probability, that an effect could occur by chance. Let’s say we get a p-value of 2%. This means, that in this test, there’s only a 2% chance that the effect could occur by chance. In general, the alpha values include: - 1% (alpha = 0.01) - 5% (alpha = 0.05) - 10% (alpha = 0.10 36 So the p-value gives us the probability of the hypothesis being true? No, that’s false. The p-value does not give us the probability of the null hypothesis being true or false given the data. Mathematically, what you are saying is Probability( hypothesis | data ). But that’s incorrect. Instead, the p-value can be thought of as the probability of the data given the pre-specified assumption embedded in the statistical test. Using the probability notion again, this could be written as: Probability ( data | hypothesis )

In my words I would say “With this data, given the hypothesis, we get this probability.” It allows us to reason whether or not the data fits the hypothesis. Not the other way around. The p-value is a measure of how likely the data sample would be observed if the null hypothesis was true. 37

Do you always work with a statistical significance level of 5%? The significance levels such as 5% or 1% are quite common in many fields of science, but physics and finance uses often more aggressive levels. In finance, it’s common to use a significance level of (3 x 10)-7 often referred to as 5-sigma. This means that the findings were due to chance with a probability of 1 in 3.5 million independent repeats of the experiments. To use a threshold like this requires a large data sample.

Below is an example with the Superstore dataset from Tableau. We can easily see that with too little data (!), we get already a few 5-sigma events (i.e. Sales values above 5 Standard Deviations).

38 What does statistically significant mean? An effect is statistically significant if it is unlikely to occur by chance. In general, we take the p-value of 5% is statistically significant. We might also observe some difference due to luck or random variation, so statisticians speak about statistically significant differences when the difference is larger than could be easily be produced by luck. 39 What’s a false positive? The conclusion that an effect is real when it is not. 40

What’s a false negative? The conclusion that an effect is due to chance when it is not.

41 Can you explain the difference between the Probability Density Function (PDF) vs. the Cumulative Density Function (CDF)? The PDF calculates the probability of observing a given value while the CDF calculates the probability of an observation equal or less than a value. Let’s say you have a 50% chance of scoring a goal. If you kick three times, the probability of scoring at least one goal is 100% - (100%-50%)3 = 87.50%. This refers to the CDF as you might score once or more often. The PDF calculation is a bit more complicated, but the result is 3/8. 42

In a box plot, Q1 and Q3 represent what percentile? The 25th and 75th percentile. The median splits the box into two parts. If the two parts are not equal, the larger part contains data that’s more variable than the other part, in terms of its range of values. However, there is still the same amount of data (25%) in the larger part of the box as there in the smaller part. In other words, the length of the box represents the variability in the data, not the number of data values.

43

In a box plot, what do the whiskers refer to? Either to the min and max or your data unless you defined otherwise such as 1.5 times the interquartile range (Tukey's outlier). 44 In a dataset, the population mean is 100 and the median is 200. Is the data normally distributed? With a normally distributed dataset, the mean and median are equal. If they are not, as in this case, the data is most likely not normally distributed. 45

If you take a sample from a population, what are two requirements? The two main requirements are, that the samples are chosen at random and that the samples are representative for the population.

46 What's ordinal data? Categorical data that is ordered: S/M/L/XL, mild/hot/very hot ...

47 You bin age into groups 20-40, 40-60 and >60. What do you get, a measure or dimension?

When we bin a measure, we create a dimension. This is also known as discretization.

48 You need a function that returns "OFFICE HOURS" or "NOT OFFICE HOURS." Office hours are between 07:00 - 19:00. Use [DATE] as a dimension. IF DATEPART(‘hour’, [DATE]) > 7 AND DATEPART(‘hour’, [DATE]) < 19 THEN “OFFICE HOURS” ELSE “ NOT OFFICE HOURS” END

49

You have customer age 20 years - 80 years. Age is obviously a measure. How could you make it discrete? Create age bins such as 20-40, 40-60 and 60-80.

50

Can you explain to me the difference between numerical and categorical data? Numerical data can be measured while categorical data can only be counted. 51

Your dataset has 100,000 rows. 500 are extreme outliers. What can you do? Outliers only account for 0.5%. In general, anything below 5% we can consider deleting it.

51 You have 100,000 credit applications. There are 1,000 NANs and you suspect that NANs might indicate a pattern. What can you do? We can create a new column and label NANs with 1, otherwise 0. This can help us catching potential patterns (it’s a technique used in Machine Learning as well). The function is ISNULL([Measure]) which returns a Boolean value TRUE or FALSE. 52

You have 100,000 emails. You want to extract the topics from the emails. What technique would you use?


NER (Named Entity Recognition) might be a solution. For example, if the majority of emails mention "Tesla", I can find this pattern quickly with NER. I can't think of a faster technique.

53

If your data has a positive skew, the mean is usually greater that the median. True or false?


With a positive skew, the extreme values are to the right while most of the data is to the left. So the answer is true.


54

How would you describe 1, 2 and 3 standard deviations in terms of likelihood?


- LIKELY to be within 1 standard deviation (68 out of 100)

- VERY LIKELY to be within 2 standard deviations (95 out of 100)

- ALMOST CERTAINLY to be within 3 standard deviations (99 out of 100)

55

Discrete data is counted and continuous data is measured. True or false?


True.

56

What’s a common mistake with hypothesis testing?


We should never say, that we accept the null hypothesis. Let’s say that the p-value is greater than alpha, we should conclude that we cannot reject the null hypothesis. But that does not proof, that the null hypothesis is true.

57

Can you give me a few of your favorite “Data Fallacy” examples?


(1) Cherry picking is probably the worst and most harmful example of being dishonest with data. Cherry picking is the practice of selecting results that fit your claim and excluding those that don’t.


(2) False causality is to falsely assume that when two events occur together that one must have cause the other. In other words, we should not assume causation because of correlation alone and always gather more evidence.


(3) Survivorship bias is drawing conclusions from an incomplete set of data, because that data has “survived” some selection criteria.


(4) Sampling bias is drawing conclusions from a set of data that isn’t representative of the population we’re trying to understand.


The danger of summary metrics, or why visualizing (a.k.a. Exploratory Data Analysis) is so important. Anscombe’s quartet visualizes this nicely (see next).

58

So what’s this Anscombe’s quartet?

For Data Science, the famous statistician John Tukey summarized it best: “The greatest value of a picture (visualization) is when it forces us to notice what we never expected to see …”. Anscombe’s quartet “proofs” this nicely.


The danger of numbers or summary metrics alone is, that they can be misleading. To demonstrate this effect, statistician Francis Anscombe put together four example data sets in the 1970s. Known as the Anscombe’s quartet, each data set has the same mean, variance and correlation. However, when graphed, it’s clear that each of the data sets are totally different. The point Anscombe most likely want to make is that the shape is as important as the summary metrics and cannot be ignored in data analytics. 59 Can you explain the difference between univariate and multivariate? Univariate is one variable measured over time. Multivariate measures multiple variables over time. 60 Can you explain geocoding to me? Geocoding is the process of converting addresses (like a street address) into geographic coordinates (like latitude and longitude), which you can use to place markers on a map, or position the map. Reverse geocoding is the process of converting geographic coordinates into a human-readable address.

61 What defines a good graph? - Self explanatory: a good graph doesn’t require much explanation from the analyst nor much thinking from the user. - Contains only the necessary information - Focuses only on 1-2 problems - Supports fast understanding and fast decision making Is the following function a recursion or iteration? IF LAST() = 0

THEN (SUM([Sales]) - LOOKUP(SUM([Sales]), -1)) / LOOKUP(SUM([Sales]), -1)

END 62 We can’t write an iteration, such as a for loop, in Tableau. A recursion in computer science is a method of solving a problem where the solution depends on solutions to smaller instances of the same problem (as opposed to iteration). Thus, this is recursion. 63

How could you bin “hour”? Sweet question! This is called “time discretization”. We could segregate in morning, afternoon, evening and night. Fun fact: did you know that there are discretization algorithms?

64 Generally speaking, how large has a sample dataset to be in order to be representative?

Numerous studies have been conducted and the results of the studies suggests that, in general, the Central Limit Theorem holds for n > 30. However, we should not apply this rule blindly. If the population is heavily skewed, the sampling distribution for y will probably still be skewed even for 50 or 60 samples. 65 What does statistically significant mean? The idea of statistical significant is quite straightforward: if a p-value is small enough (below alpha), then we say that the results are statistically significant. 66 How does the average line work? The average line works the same way as a constant line, however, we don’t have to enter a value. Tableau automatically calculates the average of the measure. The average can be calculated based on the table, pane or cell. In general, I like the average line to be in color (such as orange) and use a dotted line. 67 What does the Average with 95% CI (Confidence Interval) present? The CI quantifies the uncertainty of the average in the future. The formula for 95% confidence is 1.96*Standard Deviation / √N. Thus, the larger our sample (N) the higher our confidence and therefore the small our range. For example, for our sales numbers we might get an upper confidence of 80,000 and a lower confidence of 40,000. With 95% confidence, we expect our average sales number to be between those two numbers. 68 Can you tell me what Boolean logic is? Boolean logic can be used to compare and manipulate sets using just three operators: AND, OR and NOT. For example: - You can have Strawberry OR Vanilla OR Chocolate. - You can have Strawberry AND Vanilla AND Chocolate. - You can have Strawberry OR Vanilla NOT Chocolate. 69

What is the difference between a Type I and a Type II error? Type I Error (false positive): We reject the null hypothesis when there is in fact no significant effect. The p-value in this case is optimistically small. A better definition for Type I error would probably be “false alarm.” Type II Error (false negative): We do not reject the null hypothesis when there is in fact a significant effect. In this case, the p-value is pessimistically large. A better definition for Type II error would probably be “missed detection.”

70

What is statistical power? Statistical power, or the power of a hypothesis test is the probability that the test correctly rejects the null hypothesis. In Wikipedia, we find the following formula: Statistical Power = Probability ( reject H0 | H1 is true ) Thus, the higher the statistical power for a given experiment, the lower the probability of making a Type II (false negative) error. More intuitively, the statistical power can be thought of as the probability of accepting an alternative hypothesis, when the alternative hypothesis is true. When interpreting statistical power, we seek experimental setups that have a high statistical power. Low Statistical Power = Large risk of committing Type II errors, e.g. a false negative. High Statistical Power = Small risk of committing Type II errors. 71

How can you test statistical power?

The ability to detect when H0 is truly false is what we try to measure with the power of test. Testing statistical power is a pretty complicated issues, but what’s important for us practitioners is that the higher the sample size, the more powerful a test is. A powerful test has a small chance for a Type II error (or “missed detection”). In other words, in order to minimize the chances of a Type II error is to select a large sample size to ensure that any differences that really exist won’t be missed. 72 What are the z-values for 95% and 99% confidence levels? 95% = 1.96 99% = 2.58 73 Can you explain the Big O notation in plain words? In plain words, Big O notation describes the time and space complexity of your code using algebraic terms.

74 What’s the time complexity of a simple Boolean function in Tableau? For example, IF SUM([Sales])>500000 THEN “Great” ELSE “Bad” END This is called O(1) or “constant time. This has the least complexity as Tableau has to calculate the Calculated Field (function) only once. 75 What’s a major mistake with statistics?

We don’t expect variability to depend on sample size. In other words, there is more random variation within small groups that within large groups. This simple principle might be logical to you, yet it is not to many people.

76

What’s a major difference between Probability and Statistics? In general, probability calculations go from population to sample. Statistical calculations work in the opposite direction; from sample to population. For example, probabilistically we could say that “in our industry, we have a churn rate of 20% p.a. and we can expect the same.” Statistically, we make the most money with customers range 25-35. Let’s focus on this cohort and see if the pattern continuous for a larger cohort.

77

What’s a discrete random variable? When observations on a quantitative random variable can assume only a countable number of values, the variable is called a discrete random variable. For example, a customer has a credit rating ranging from 1 – 100. 78 What’s a continuous random variable? When observations on a quantitative random variable can assume any one of the uncountable number of values, the variable is called a continuous random variable. For example, the number of purchases of a customer can be continuous (technically, it can be any number). The distinction between random and continuous variables is pertinent when we are seeking the probabilities associated with specific values of a random variable.

79

Which clustering algorithm is out-of-the-box available on Tableau? In Tableau under Analytics, the k-means clustering algorithm is available for two continues (Measures) variables. K-means clustering is one of the most common clustering techniques. 80 How does it work in one sentence? In k-means clustering, the algorithm attempts to group observations into k groups, with each group having roughly equal variance. A bit flashier, the clustering algorithm helps us uncovering hidden patterns in our data.

81

How does this algorithm fit in the group of Machine Learning algorithms? Fundamentally, regression is about predicting a quantity, classification is about predicting a label, forecasting is about predicting values and clustering is about predicting a group. 82

What’s a particular feature of clustering algorithms? Clustering algorithms are also called unsupervised models. In supervised Machine Learning (such as Logistic Regression or Decision Tree – see later) we have both, the features and the target.

83 I don’t understand the supervised learning. Can you give me an example? Let’s say we have a list of customers. We want to predict the probability that a customer is going to churn. In this simplified table, we have to manually write in column “Churned”, if the customer has churned.


84 Let’s stay with Logistic Regression. You seem to be a big fan of this algorithm, why? Logistic Regression is a statistical model (today often called a Machine Learning algorithm) for classification. For example, what’s the probability that this customer going to churn? What’s your probability of suffering from Diabetes Type 2? The algorithm returns either 1 or 0 or a probability between 0% - 100% (technically it’s never 0% or 100%). The algorithm fails with non-linear problems, but it’s pretty robust in case of linear problems – even in several dimensional problems. In Machine Learning lingo, the algorithm has a high bias and low variance. This means that in real world applications, the algorithm is most likely not extremely accurate, but robust (consistent) in predicting new data. 85

What’s this thing about normalizing data? Let’s say you want to predict the probability that you get Diabetes Type II. You only have two features: weight and age. If you weigh 200 kg and are 20 years old, that algorithm might think that weight (10 times bigger number) is much more important than age. Thus, we want to bring both features on the same scale. One technique of normalizing data is called Min-Max scaling. With Min-Max scaling, we would bring both features to a scale 0 – 1. 86 Do you know the formula for Min-Max scaling? (X - Xmin) / (X max - X min)

87 Tell me an algorithm where you need scaling such as Min-Max scaling, and one where you don't.

Decision Tree no, Logistic Regression and Neural Networks yes. 88 How do you code Min-Max scaling directly in Tableau? One solution is: (SUM[Measure] – TOTAL(MIN([Measure])]))) / (TOTAL(MAX([Measure])) – TOTAL(MIN([Measure])))

In this case, where you normalize (Min-Max scale) your data in Tableau directly, you don’t have to pickle the Min-Max scaler, just the model. In any other case, you have to pickle the model and the scaler. I find the later quite complicate. That’s why I prefer normalizing the data directly in Tableau. 89 Let’s stick with pickle. WTF is it? Let’s say you trained a Machine Learning algorithm in Python. Now you use the same model in production where Tableau calls the model to make a prediction. Without pickle, we would train the model every time we make a prediction. This would take way too long. With pickle, we rather call the serialized model. In the case of Logistic Regression, in pickle are all the weights (or coefficients) stored in pickle in order to make the prediction.

90

What does pickle look like in code? Simplified, let’s say you have: … model = LogisticRegression(penalty=’l2’ ………) model.fit(X_train, y_train) Now you dump the model: pickle.dump(model, open(‘/Users/francoarda/….your local path’) Of course, if you run TabPy on sour server, your pickle model needs to be on your server as well. The pickle model is now used to make predictions in Tableau. 91 Can you tell me what that looks like? Let’s say we trained a Logistic Regression model that predict money laundering with seven variables: def Money_Laundering(_arg1, _arg2, _arg3, _arg4, _arg5, _arg6, _arg7): df_ = np.column_stack((_arg1, _arg2, _arg3, _arg4, _arg5, _arg6, _arg7))

Import pickle model = pickle.load(‘/Users/francoarda/….your local path’) predict = model.predict_proba(df_) pred = predict[0][0] return pred.item()

In your code you need to add a connection to Tableau using TabPy. 92 How does the code for connecting with Tableau look? In the “Money Laundering” case, like: import tabpy_client connection = tabpy_client.Client(‘http://localhost:9004/’) connection.deploy(‘Money_Laundering, Money_Laundering, ‘Mondey_Laundering’, override=True) In words, we import the tabpy_client library and call the function “Money_Laundering.” The last step is now to add the function (Calculated Field) in the Tableau workbook. 93 What does this function look like in Tableau’s Calculated Field? Let’s say we have seven features in Tableau. One is “Balance” and we normalized directly in Tableau call the normalized feature “Balance_norm”. SCRIPT_REAL(“ return tabpy.query(‘Money_Laundering’, _arg1, _arg2, _arg3, _arg4, _arg5, _arg6, _arg7)[‘response’]”, SUM([Balance_norm]),SUM([Risk_0_norm]), SUM([Risk_1_norm]),SUM([Transactions_norm]),SUM([Risk_2_norm]),SUM([Activity_nnorm]),SUM([Amount_norm]),

We can call this Calculated Field “Probability”. In the case of Logistic Regression, we should get a probability output for each row. For example, row 1 might get a “Probability” of 5% (quite low). 94 Adding a Machine Learning algorithm in Tableau sounds quite complicated? True, it’s quite hard. But don’t forget that adding a Machine Learning algorithm means that you run it in production. And running a Machine Learning algorithm in production is quite challenging – no matter if you use Google, Amazon or Tableau. In Tableau’s case, you have several amazing advantages: you don’t pay additional fees for running the Machine Learning algorithm. You have a near real-time Machine Learning algorithm in the hands of your employees. For example, imagine a sales representative accessing Tableau on his iPad on the way to a customer. He can see immediately the risk of leaving of a particular customer.

95

Can you show me a use case of Logistic Regression in Tableau? I don’t understand why Tableau isn’t more popular in the IoT / Predictive Maintenance space. For example, we can train a Machine Learning algorithm to predict when a machine needs maintenance. With the power of Tableau, we can equip engineers with the Machine Learning algorithm. It can be accessed on an iPad or iPhone.

96

If you’re predicting fraud or cancer, it’s very important that you catch as many true positives as possible. What would you look for? In such a case, you don’t mind a small proportion of false positives (false alarm) and focus more on getting as many true positives as possible. The ratio is called sensitivity. Sensitivity is a term used in medicine while with Machine Learning (such as in the Scikit-Learn library), this is called recall. Recall is a diagnostic test for the true positive rate. That is, P(test is positive | diseases is present). We read the vertical bar (|) as “given.” For example, if the total of true positives is 20 and we caught 15, then our recall is 0.75 (15/20).


97

Let's say I need help from a Data Scientist regarding Machine Learning, what's a sign of lack of experience?

Using no arguments is a strong sign of no experience:

Sign of a an expert:





0 views

Franco Arda, Frankfurt am Main (Germany)                                                                                                                 franco.arda@hotmail.com