Explainable Machine Learning:
– Opening up the [black box] with SHAP
This notebook illustrates how any Machine Learning (ML) model can be made more transparent with SHAP (SHapley Additive exPlanations) values.
The notebook walks us through the process of training a regression model, making predictions on a training set, and interpreting the predictions with SHAP values. The focus is strictly on explaining the output with SHAP, rather than focusing on other parts of the workflow.
Motivation
ML models are often used as black boxes, meaning that their internal workings are considered a mystery. This makes it challenging to understand and explain the behavior of a model. In fact, this is one of the biggest hurdles that ML faces today is public trust and acceptance.
Solutin (SHAP)
SHAP addresses the black box problem of ML by providing a game-theoretic approach to explain the output of any machine learning model. In practice, it is a method for assigning each feature an importance value for a particular prediction.
Example Application: Finnish house price prediction
In this notebook we showcase the power of SHAP in the context of predicting house prices in Finland. A predictive model for house price estimation can for example add significant value to realtors by automizing the job and reducing errors.
# data analysis and wrangling
import pandas as pd
import numpy as np
# visualization
import matplotlib.pyplot as plt
%matplotlib inline
# machine learning
import catboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score
# shap
import shap
shap.initjs()
The data in use is a 12 months snapshot (from 2020 and 2021) of the Finnish realtors’ database of apartment selling prices. It is scraped dataset from asuntojen.hintatiedot.fi and holds $\sim$ 40k observations.
Target Variable $y$:
Features $X$:
# dataset
df = pd.read_csv('datasets/fin_house_clea.csv',
sep=",",
dtype={'Postal_code':'str'},
encoding="utf-8",
index_col=0)
# features
X = df[['Postal_code', 'Building_type', 'm2', 'Age', 'Own_lot', 'Floor', 'Condition']]
# target variable
y = df['€/m2']
# data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)
We use the CatBoost model for gradient boosting on decision trees. It is featured to have great quality without parameter tuning, and is able to handle categorical features without pre-processing.
# initiating the model
cb = catboost.CatBoostRegressor(loss_function='RMSE')
# training the model
cb.fit(X=X_train,
y=y_train,
cat_features=['Postal_code', "Building_type"],
silent=True)
# performance
print('Training MSE: ' + str(round(mean_squared_error(y_train, cb.predict(X_train)), 0)))
print('Test MSE: ' + str(round(mean_squared_error(y_test, cb.predict(X_test)), 0)))
print('\nR-squared: ' + str(round(r2_score(y_test, cb.predict(X_test)), 3)))
Training MSE: 218935.0 Test MSE: 255711.0 R-squared: 0.921
Now it is finally time to showcase the power of SHAP! Instead of just taking the predictions as given, we can use SHAP to analyze how each feature value affects the output.
Let us take one instance from the testing dataset and see how well our model predicts the squareprice $(€/m^2)$.
print("Features:\n\n", X_test.iloc[100])
print("\n\nTrue label:\n\n", y_test.iloc[100], "€/m2")
print("\n\nPrediction:\n\n", cb.predict(X_test.iloc[100:101])[0], "€/m2")
Features: Postal_code 90120 Building_type kt m2 61.5 Age 58 Own_lot False Floor 2 Condition 4 Name: 21667, dtype: object True label: 1902 €/m2 Prediction: 1920.5714261325602 €/m2
So our model predicted the squareprice for this apartment to be 1920 $€/m^2$ while the true squareprice is 1902 $€/m^2$. Traditionally we would have no idea how our model came to this outcome but with SHAP we can assigns each feature an importance value:
# compute SHAP values
explainer = shap.TreeExplainer(cb)
shap_values = explainer(X_test)
# the waterfall_plot shows how we get from shap_values.base_values to
plt.title("SHAP for a single prediction", fontsize=15)
shap.plots.waterfall(shap_values[100])
The waterfall plot above visualizes the shap values for each of the features. The base value $E[f(X)] = 2643.574$ is the expected value if we did not have any features to help the prediction. Features having a positive SHAP value increase the prediction, while the negative ones naturally decrease it. For this particular prediction we can see that the apartment age (SHAP of -367.57) and postal code (SHAP of -244.71) have the biggest (negative) influence on the model prediction. So instead of blindly relying on the predicted value we now know that the old age and less expencive living district are driving the squareprice down for the apartment.
To understand our model's decision making more generally, we can simultaneously visualize all of the SHAP values in the test set:
plt.title("SHAP explaining test set predictions", fontsize=15)
shap.summary_plot(shap_values, X_test, max_display=12, show=False)
Interpretations for SHAP explaining test set predictions:
Now that we have an overview of the model's decision making (how each feature affects the outcome), we can take a closer look at particular features that are most important for the model . In the case of Finnish house price prediction these are postal code, age, and $m^2$. To reveal the effect of particular features across the whole dataset we can visualize their SHAP values in a dependence plot. Let's do this for age, and $m^2$.
shap_values = explainer.shap_values(X_test)
shap.dependence_plot("Age", shap_values, X_test,
interaction_index='m2', xmax=150)
Plotting the age shows that our model assigns an increasing impact on output for +60 years old buildings selling prices. Also smaller apartments (low $m^2$) seem to have even higher positive impact. This may indicate that the model captures the positive effect that renovations and valuable districts (e.g. small flats in downtown Helsinki) have on older houses.
This interactions is hard for humans to identify with such a precision. Also, simpler e.g. linear models aren’t able to capture these kinds of patterns as they will naively predict the price to decrease linearly. Thus using SHAP allows us to simultaneously achieve high performance with complex models, while keeping the model behavior transparent.
Throughout this notebook we have explored how SHAP values can help users interpret the predictions of complex models in the context of house price prediction in Finland. SHAP does this by assigning each feature an importance value for a particular prediction.
We showed how SHAP can (1) explain the output of a single prediction, (2) give an overview of the model's decision making, and (3) show the effect of a single feature across the whole dataset.
In addition to improving the user experience and trust in predictions with interpretability, SHAP can help in implementing the social right to explanation. By opening up the black box, SHAP provides an explanation for the model output. This is mandatory for decisions that significantly affect an individual.
Let's keep in touch. I'm always happy to receive messages, whether it is a general question, suggestion, or if you just want to say hi.