# User Case

## 一个模型的哪些东西是可解释的

• 哪些特征在模型看到是最重要的？
• 关于某一条记录的预测，每一个特征是如何影响到最终的预测结果的？
• 从大量的记录整体来考虑，每一个特征如何影响模型的预测的？

• 调试模型用
• 指导工程师做特征工程
• 指导数据采集的方向
• 指导人们做决策
• 建立模型和人之间的信任

# Permuation Importance

• 计算速度快
• 广泛使用和理解
• Consistent with properties we would want a feature importance measure to have

## 工作原理

• 训练好模型
• 拿某一个feature column, 然后随机打乱顺序。然后用模型来重新预测一遍，看看自己的metric或者loss function变化了多少
• 把上一个步骤中打乱的column复原，换下一个column重复上一个步骤，直到所有column都算一遍

## 代码示例

```import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv('../input/fifa-2018-match-statistics/FIFA 2018 Statistics.csv')
y = (data['Man of the Match'] == "Yes")  # Convert from string "Yes"/"No" to binary
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
my_model = RandomForestClassifier(random_state=0).fit(train_X, train_y)
```

```import eli5
from eli5.sklearn import PermutationImportance

perm = PermutationImportance(my_model, random_state=1).fit(val_X, val_y)
eli5.show_weights(perm, feature_names = val_X.columns.tolist())
```

# Partial Dependence Plots

• 在所有的房屋的特征中，经度和纬度是如何影响房价的？或者说，在不同的地区同样面积的房屋价格有多少相似呢？
• 预测人们健康状况时，饮食结构的不同会带来多大的影响？还是有其他更重要的影响因素？

## 代码示例

```import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier

data = pd.read_csv('../input/fifa-2018-match-statistics/FIFA 2018 Statistics.csv')
y = (data['Man of the Match'] == "Yes")  # Convert from string "Yes"/"No" to binary
feature_names = [i for i in data.columns if data[i].dtype in [np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
tree_model = DecisionTreeClassifier(random_state=0, max_depth=5, min_samples_split=5).fit(train_X, train_y)
```

```from sklearn import tree
import graphviz

tree_graph = tree.export_graphviz(tree_model, out_file=None, feature_names=feature_names)
graphviz.Source(tree_graph)
```

```from matplotlib import pyplot as plt
from pdpbox import pdp, get_dataset, info_plots

# 创建好画图所需的数据
pdp_goals = pdp.pdp_isolate(model=tree_model, dataset=val_X, model_features=feature_names, feature='Goal Scored')

# 画出“Goal Scored”这一特征的partial dependence plot
pdp.pdp_plot(pdp_goals, 'Goal Scored')
plt.show()
```

```feature_to_plot = 'Distance Covered (Kms)'
pdp_dist = pdp.pdp_isolate(model=tree_model, dataset=val_X, model_features=feature_names, feature=feature_to_plot)

pdp.pdp_plot(pdp_dist, feature_to_plot)
plt.show()
```

```# Build Random Forest model
rf_model = RandomForestClassifier(random_state=0).fit(train_X, train_y)

pdp_dist = pdp.pdp_isolate(model=rf_model, dataset=val_X, model_features=feature_names, feature=feature_to_plot)

pdp.pdp_plot(pdp_dist, feature_to_plot)
plt.show()
```

## 2D Partial Dependence Plots

```# Similar to previous PDP plot except we use pdp_interact instead of pdp_isolate and pdp_interact_plot instead of pdp_isolate_plot
features_to_plot = ['Goal Scored', 'Distance Covered (Kms)']
inter1  =  pdp.pdp_interact(model=tree_model, dataset=val_X, model_features=feature_names, features=features_to_plot)

pdp.pdp_interact_plot(pdp_interact_out=inter1, feature_names=features_to_plot, plot_type='contour')
plt.show()
```

# SHAP Values

## How They Work

NIPS 论文地址：A Unified Approach to Interpreting Model Predictions，也可以参考这篇博客One Feature Attribution Method to (Supposedly) Rule Them All: Shapley Values，这个计算SHAP值的想法来自于博弈论中的shapley value(shapley单词源自于2012年诺贝尔经济学奖获得者Lloyd Stowell Shapley)。膜拜了一圈，很深奥的样子，我理解的大意就是：计算一个特征加入到模型时的边际贡献，然后考虑到该特征在所有的特征序列的情况下不同的边际贡献，取均值，即某该特征的SHAP baseline value。

```sum(SHAP values for all features) = pred_for_team - pred_for_baseline_values
```

## 代码示例

```import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv('../input/fifa-2018-match-statistics/FIFA 2018 Statistics.csv')
y = (data['Man of the Match'] == "Yes")  # Convert from string "Yes"/"No" to binary
feature_names = [i for i in data.columns if data[i].dtype in [np.int64, np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
my_model = RandomForestClassifier(random_state=0).fit(train_X, train_y)
```

```import shap  # package used to calculate Shap values

row_to_show = 5
data_for_prediction = val_X.iloc[row_to_show]  # use 1 row of data here. Could use multiple rows if desired
data_for_prediction_array = data_for_prediction.values.reshape(1, -1)

# Create object that can calculate shap values
explainer = shap.TreeExplainer(my_model)

# Calculate Shap values
shap_values = explainer.shap_values(data_for_prediction)
```

```shap.initjs()
shap.force_plot(explainer.expected_value[1], shap_values[1], data_for_prediction)
```

• shap.DeepExplainer 是用于解释深度学习模型的
• shap.KernelExplainer 可用于解释所有的模型，尽管它运行时间比其他解释器时间长一点，但是它能够给出一个近似的的Shap values.

```# use Kernel SHAP to explain test set predictions
k_explainer = shap.KernelExplainer(my_model.predict_proba, train_X)
k_shap_values = k_explainer.shap_values(data_for_prediction)
shap.force_plot(explainer.expected_value[1], shap_values[1], data_for_prediction)
```

# SHAP Value的进阶玩法

## 前情提要

Shap values可以告诉我们一个特征对我们的一条预测结果产生了多大的影响。

Shap library 是个不错的工具，可以告诉我们具体的shap values是多少并且有不错的可视化来展现

## Summary Plots

Permutation importance很不错，因为它用很简单的数字就可以衡量特征对模型的重要性。但是它不能handle这么一种情况：当一个feature有中等的permutation importance的时候，这可能意味着这么两种情况：1：对少量的预测有很大的影响，但是整体来说影响较小；2：对所有的预测都有中等程度的影响。

• 竖直坐标是说明它属于哪个特征
• 颜色代表了这个特征在某一行数据里的数值是高还是低
• 水平位置代表了这个特征在某一行数据里是提高预测值还是降低预测值

• 这个模型几乎忽略了Red and Yellow和**Red**这两个特征
• Yellow Card这个特征，通常不会有大的影响，但是也有几个极端的例子大幅降低了预测值
• Goal Scored这个特征，大体上和预测值有正相关性

## 代码示例: Summary Plots

```import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

data = pd.read_csv('../input/fifa-2018-match-statistics/FIFA 2018 Statistics.csv')
y = (data['Man of the Match'] == "Yes")  # Convert from string "Yes"/"No" to binary
feature_names = [i for i in data.columns if data[i].dtype in [np.int64, np.int64]]
X = data[feature_names]
train_X, val_X, train_y, val_y = train_test_split(X, y, random_state=1)
my_model = RandomForestClassifier(random_state=0).fit(train_X, train_y)
```

```import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(my_model)

# calculate shap values. This is what we will plot.
# Calculate shap_values for all of val_X rather than a single row, to have more data for plot.
shap_values = explainer.shap_values(val_X)

# Make plot. Index of [1] is explained in text below.
shap.summary_plot(shap_values[1], val_X)
```

• 当画图的时候，我们用变量`shap_values[1]`，如果是分类问题，那么这变量是一个array，每一个元素里都是某一个类别的SHAP values。在这个例子里，这里是存放的为True的概率
• 计算SHAP values可能会比较慢。在这里不会是大问题，因为这个数据集比较小。不过将来你用的时候得小心使用的数据集不要太大。不过有一个例外：当使用xgboost的时候，SHAP会有一些优化来加速。

## 代码示例：SHAP Dependence Contribution Plots

```import shap  # package used to calculate Shap values

# Create object that can calculate shap values
explainer = shap.TreeExplainer(my_model)

# calculate shap values. This is what we will plot.
shap_values = explainer.shap_values(X)

# make plot.
shap.dependence_plot('Ball Possession %', shap_values[1], X, interaction_index="Goal Scored")
```