I have the following dataframe:
JavaScript
x
12
12
1
import pandas as pd
2
import random
3
4
import xgboost
5
import shap
6
7
foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
8
'var1':random.sample(range(1, 100), 10),
9
'var2':random.sample(range(1, 100), 10),
10
'var3':random.sample(range(1, 100), 10),
11
'class': ['a','a','a','a','a','b','b','c','c','c']})
12
For which I want to run a classification algorithm in order to predict the 3 class
es
So I split my dataset into train and test and I run an xgboost
JavaScript
1
9
1
cl_cols = foo.filter(regex='var').columns
2
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
3
foo[['class']],
4
test_size=0.33, random_state=42)
5
6
7
model = xgboost.XGBClassifier(objective="binary:logistic")
8
model.fit(X_train, y_train)
9
Now I would like to get the mean SHAP values for each class, instead of the mean from the absolute SHAP values generated from this code:
JavaScript
1
3
1
shap_values = shap.TreeExplainer(model).shap_values(X_test)
2
shap.summary_plot(shap_values, X_test)
3
Also, the plot labels the class
as 0,1,2. How can I know to which class
from the original do the 0,1 & 2 correspond ?
Because this code:
JavaScript
1
3
1
shap.summary_plot(shap_values, X_test,
2
class_names= ['a', 'b', 'c'])
3
gives
and this code:
JavaScript
1
3
1
shap.summary_plot(shap_values, X_test,
2
class_names= ['b', 'c', 'a'])
3
gives
So I am not sure about the legend anymore. Any ideas ?
Advertisement
Answer
By doing some research and with the help of this post and @Alessandro Nesti ‘s answer, here is my solution:
JavaScript
1
58
58
1
foo = pd.DataFrame({'id':[1,2,3,4,5,6,7,8,9,10],
2
'var1':random.sample(range(1, 100), 10),
3
'var2':random.sample(range(1, 100), 10),
4
'var3':random.sample(range(1, 100), 10),
5
'class': ['a','a','a','a','a','b','b','c','c','c']})
6
7
cl_cols = foo.filter(regex='var').columns
8
X_train, X_test, y_train, y_test = train_test_split(foo[cl_cols],
9
foo[['class']],
10
test_size=0.33, random_state=42)
11
12
13
model = xgboost.XGBClassifier(objective="multi:softmax")
14
model.fit(X_train, y_train)
15
16
def get_ABS_SHAP(df_shap,df):
17
#import matplotlib as plt
18
# Make a copy of the input data
19
shap_v = pd.DataFrame(df_shap)
20
feature_list = df.columns
21
shap_v.columns = feature_list
22
df_v = df.copy().reset_index().drop('index',axis=1)
23
24
# Determine the correlation in order to plot with different colors
25
corr_list = list()
26
for i in feature_list:
27
b = np.corrcoef(shap_v[i],df_v[i])[1][0]
28
corr_list.append(b)
29
corr_df = pd.concat([pd.Series(feature_list),pd.Series(corr_list)],axis=1).fillna(0)
30
31
# Make a data frame. Column 1 is the feature, and Column 2 is the correlation coefficient
32
corr_df.columns = ['Variable','Corr']
33
corr_df['Sign'] = np.where(corr_df['Corr']>0,'red','blue')
34
35
shap_abs = np.abs(shap_v)
36
k=pd.DataFrame(shap_abs.mean()).reset_index()
37
k.columns = ['Variable','SHAP_abs']
38
k2 = k.merge(corr_df,left_on = 'Variable',right_on='Variable',how='inner')
39
k2 = k2.sort_values(by='SHAP_abs',ascending = True)
40
41
k2_f = k2[['Variable', 'SHAP_abs', 'Corr']]
42
k2_f['SHAP_abs'] = k2_f['SHAP_abs'] * np.sign(k2_f['Corr'])
43
k2_f.drop(columns='Corr', inplace=True)
44
k2_f.rename(columns={'SHAP_abs': 'SHAP'}, inplace=True)
45
46
return k2_f
47
48
foo_all = pd.DataFrame()
49
50
for k,v in list(enumerate(model.classes_)):
51
52
foo = get_ABS_SHAP(shap_values[k], X_test)
53
foo['class'] = v
54
foo_all = pd.concat([foo_all,foo])
55
56
import plotly_express as px
57
px.bar(foo_all,x='SHAP', y='Variable', color='class')
58