The goal of this article is to give an overview of SHAP values which are generated from Qlik AutoML model predictions. SHAP values serve as a way to m...
The goal of this article is to give an overview of SHAP values which are generated from Qlik AutoML model predictions. SHAP values serve as a way to measure variable importance and how much they influence the predicted value of the model.
SHAP Importance explained
SHAP Importance represents how a feature influences the prediction of a single row relative to the other features in that row and to the average outcome in the dataset.
The goal of SHAP is to explain the prediction of an instance x by computing the contribution of each feature to the prediction. The SHAP explanation method computes Shapley values from coalitional game theory. The feature values of a data instance act as players in a coalition. Shapley values tell us how to fairly distribute the "payout" (the prediction) among the features. A player can be an individual feature value or a group of feature values.
For more information and mathy fun please reference this chapter from Interpretable Machine Learning:
Medical Cost Personal dataset: https://www.kaggle.com/datasets/mirichoi0218/insurance
Note: I added an ID column, but not including as a feature
age, sex, bmi, children (number of), smoker, region
I uploaded this dataset into Qlik Cloud and generated 4 models. Random Forest Regression was the champion model.
From the UI, we see the SHAP Importance visualization. This shows that smoker, age, and bmi are the top 3 prediction influencers. Meaning their values have the most effect on the predicted charges.
Understanding how the values are calculated
I deployed the model and generated predictions from the Qlik Cloud interface. At this point you can open the data as a Qlik Sense app and combine the predicted output table with the original dataset (see Qlik AutoML: How to join predicted output to original trained dataset).
This is an example of the original table combined with the SHAP values by record.
Click the image below to enlarge.
Example interpretation of record 1001->
Smoker_SHAP value is 19315 which represents the following:
How much does Smoker=Yes affect the amount of charges given that the account holder is a Female, 19 years old, has a bmi of 27.9, has no children, and is in the Southwest region.
The sum of Shapley values for each row is how much that rows prediction differs from average.
Average Predicted Charges (across all records) = 13511.5
Sum of SHAP values = 3396
Predicted charges manual SHAP calculation = 16908
sumSHAPS is a calculated column of the sum of the SHAP values in the record.
f(x) = age_SHAP+sex_SHAP+bmi_SHAP+smoker_SHAP+children_SHAP+region_SHAP
shaps_avgpredcharges is sumSHAPS+average(predicted_charges)
f(x) = sumSHAPS+13511.5
Charges is from the original dataset
Charges_predicted is the model predicted value
Value of generated SHAP values
The _SHAP values can be used in visualizations and further analysis to understand which features are driving the model predictions. For 1001, smoking increased total charges while non-smokers this led to reduced charges.
- I rounded the numeric values in the combined table to nearest whole number for readability in the article.
Ex: f(x) = round(bmi_SHAP,1)
- Qlik AutoML Random Forest use approximate Shapley values. This is why in our example, shaps_avgpredcharges does not equal charges_predicted but are fairly close.
- Average Predicted Charges , f(x) = average(charges_predicted)
The information in this article is provided as-is and to be used at own discretion. Depending on tool(s) used, customization(s), and/or other factors ongoing support on the solution below may not be provided by Qlik Support.