Introduction¶

This notebook illustrates how any Machine Learning (ML) model can be made more transparent with SHAP (SHapley Additive exPlanations) values.

The notebook walks us through the process of training a regression model, making predictions on a training set, and interpreting the predictions with SHAP values. The focus is strictly on explaining the output with SHAP, rather than focusing on other parts of the workflow.

Motivation

ML models are often used as black boxes, meaning that their internal workings are considered a mystery. This makes it challenging to understand and explain the behavior of a model. In fact, this is one of the biggest hurdles that ML faces today is public trust and acceptance.

Solutin (SHAP)

SHAP addresses the black box problem of ML by providing a game-theoretic approach to explain the output of any machine learning model. In practice, it is a method for assigning each feature an importance value for a particular prediction.

Example Application: Finnish house price prediction

In this notebook we showcase the power of SHAP in the context of predicting house prices in Finland. A predictive model for house price estimation can for example add significant value to realtors by automizing the job and reducing errors.

In [1]:

# data analysis and wrangling
import pandas as pd
import numpy as np

# visualization
import matplotlib.pyplot as plt
%matplotlib inline

# machine learning
import catboost
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# shap
import shap
shap.initjs()

Data¶

As the focus is on SHAP values, the dataset is already preprocessed and we skip the data exploration part.

The data in use is a 12 months snapshot (from 2020 and 2021) of the Finnish realtors’ database of apartment selling prices. It is scraped dataset from asuntojen.hintatiedot.fi and holds $\sim$ 40k observations.

Target Variable $y$:

Squareprice $(€/m^2)$

Features $X$:

Apartment area $(m^2)$
Age (years)
Own lot (true/false)
Postal code (categorical)
Floor (integer)
Condition (1-5 stars)
Building type (categorical)

In [2]:

# dataset
df = pd.read_csv('datasets/fin_house_clea.csv',
                 sep=",",
                 dtype={'Postal_code':'str'},
                 encoding="utf-8",
                 index_col=0)

# features
X = df[['Postal_code', 'Building_type', 'm2', 'Age', 'Own_lot', 'Floor', 'Condition']]

# target variable
y = df['€/m2']

# data split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state=42)

Model¶

As the focus is on SHAP values, we skip the process of model selection and use a model with great performance without parameter tuning.

We use the CatBoost model for gradient boosting on decision trees. It is featured to have great quality without parameter tuning, and is able to handle categorical features without pre-processing.

In [3]:

# initiating the model
cb = catboost.CatBoostRegressor(loss_function='RMSE')

# training the model
cb.fit(X=X_train,
       y=y_train,
       cat_features=['Postal_code', "Building_type"],
       silent=True)

# performance

print('Training MSE: ' + str(round(mean_squared_error(y_train, cb.predict(X_train)), 0)))
print('Test MSE: ' + str(round(mean_squared_error(y_test, cb.predict(X_test)), 0)))
print('\nR-squared: ' + str(round(r2_score(y_test, cb.predict(X_test)), 3)))

Training MSE: 218935.0
Test MSE: 255711.0

R-squared: 0.921

Explainable ML with SHAP

Now it is finally time to showcase the power of SHAP! Instead of just taking the predictions as given, we can use SHAP to analyze how each feature value affects the output.

Example1: Using SHAP to assigns each feature an importance value for a particular prediction

Let us take one instance from the testing dataset and see how well our model predicts the squareprice $(€/m^2)$.

In [4]:

print("Features:\n\n", X_test.iloc[100])

print("\n\nTrue label:\n\n", y_test.iloc[100], "€/m2")
print("\n\nPrediction:\n\n", cb.predict(X_test.iloc[100:101])[0], "€/m2")

Features:

 Postal_code      90120
Building_type       kt
m2                61.5
Age                 58
Own_lot          False
Floor                2
Condition            4
Name: 21667, dtype: object


True label:

 1902 €/m2


Prediction:

 1920.5714261325602 €/m2

So our model predicted the squareprice for this apartment to be 1920 $€/m^2$ while the true squareprice is 1902 $€/m^2$. Traditionally we would have no idea how our model came to this outcome but with SHAP we can assigns each feature an importance value:

In [5]:

# compute SHAP values
explainer = shap.TreeExplainer(cb)
shap_values = explainer(X_test)

In [6]:

# the waterfall_plot shows how we get from shap_values.base_values to
plt.title("SHAP for a single prediction", fontsize=15)
shap.plots.waterfall(shap_values[100])

The waterfall plot above visualizes the shap values for each of the features. The base value $E[f(X)] = 2643.574$ is the expected value if we did not have any features to help the prediction. Features having a positive SHAP value increase the prediction, while the negative ones naturally decrease it. For this particular prediction we can see that the apartment age (SHAP of -367.57) and postal code (SHAP of -244.71) have the biggest (negative) influence on the model prediction. So instead of blindly relying on the predicted value we now know that the old age and less expencive living district are driving the squareprice down for the apartment.

Example2: Using SHAP to get an overview of the model's decision making

To understand our model's decision making more generally, we can simultaneously visualize all of the SHAP values in the test set:

In [7]:

plt.title("SHAP explaining test set predictions", fontsize=15)
shap.summary_plot(shap_values, X_test, max_display=12, show=False)

Interpretations for SHAP explaining test set predictions:

Postal code: Houses in expensive areas are predicted to have up to 6000 $€/m^2$ higher price, while cheaper areas decrease the base prediction by −2000 $€/m^2$ at most.
Age: Newer apartments (low feature value) have a positive impact on the squareprice. However suprisingly the model assigns positive values for the oldest apartments in the dataset. This might be due to them being located in some valuable district.
$m^2$: Small buildings have +0-4000 $€/m^2$ impact on predictions, while large buildings decrease the price predictions by about 0-1000 $€/m^2$.
Building type: Can have between +/- 1000 $€/m^2$ impact on the model's predictions.
Condition: Low condition can decrease the squareprice predictions up to about 800 $€/m^2$.
Floor: An apartment being on high floor can increase the squareprice predictions up to about 1000 $€/m^2$.

Example3: Using SHAP to get an overview of a single feature

Now that we have an overview of the model's decision making (how each feature affects the outcome), we can take a closer look at particular features that are most important for the model . In the case of Finnish house price prediction these are postal code, age, and $m^2$. To reveal the effect of particular features across the whole dataset we can visualize their SHAP values in a dependence plot. Let's do this for age, and $m^2$.

In [8]:

shap_values = explainer.shap_values(X_test)

shap.dependence_plot("Age", shap_values, X_test,
                     interaction_index='m2', xmax=150)

Plotting the age shows that our model assigns an increasing impact on output for +60 years old buildings selling prices. Also smaller apartments (low $m^2$) seem to have even higher positive impact. This may indicate that the model captures the positive effect that renovations and valuable districts (e.g. small flats in downtown Helsinki) have on older houses.

This interactions is hard for humans to identify with such a precision. Also, simpler e.g. linear models aren’t able to capture these kinds of patterns as they will naively predict the price to decrease linearly. Thus using SHAP allows us to simultaneously achieve high performance with complex models, while keeping the model behavior transparent.

Conclusion¶

Throughout this notebook we have explored how SHAP values can help users interpret the predictions of complex models in the context of house price prediction in Finland. SHAP does this by assigning each feature an importance value for a particular prediction.

We showed how SHAP can (1) explain the output of a single prediction, (2) give an overview of the model's decision making, and (3) show the effect of a single feature across the whole dataset.

In addition to improving the user experience and trust in predictions with interpretability, SHAP can help in implementing the social right to explanation. By opening up the black box, SHAP provides an explanation for the model output. This is mandatory for decisions that significantly affect an individual.

Get In Touch

Let's keep in touch. I'm always happy to receive messages, whether it is a general question, suggestion, or if you just want to say hi.

Contact LinkedIn GitHub