Background of the Problem
I want to explain the outcome of machine learning (ML) models using SHapley Additive exPlanations (SHAP) which is implemented in the shap library of Python. As a parameter of the function shap.Explainer()
, I need to pass an ML model (e.g. XGBRegressor()
). However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different). Also, the model will be different as I am doing feature selection in each iteration.
Then, My Question
In LOOCV, How can I use shap.Explainer()
function of shap
library to present the performance of a machine learning model? It can be noted that I have checked several tutorials (e.g. this one, this one) and also several questions (e.g. this one) of SO. But I failed to find the answer of the problem.
Thanks for reading!
Update
I know that in LOOCV, the model found in each iteration can be explained by shap.Explainer()
. However, as there is 250 participants’ data, if I apply shap
here for each model, there will be 250 output! Thus, I want to get a single output which will present the performance of the 250 models.
Advertisement
Answer
You seem to train model on a 250 datapoints while doing LOOCV
. This is about choosing a model with hyperparams that will ensure best generalization ability.
Model explanation is different from training in that you don’t sift through different sets of hyperparams — note, 250 LOOCV
is already overkill. Will you do that with 250’000 rows? — you are rather trying to understand which features influence output in what direction and by how much.
Training has it’s own limitations (availability of data, if new data resembles the data the model was trained on, if the model good enough to pick up peculiarities of data and generalize well etc), but don’t overestimate explanation exercise either. It’s still an attempt to understand how inputs influence outputs. You may be willing to average 250 different matrices of SHAP
values. But do you expect the result to be much more different from a single random train/test split?
Note as well:
However, in each iteration of the Leave One Out Cross Validation (LOOCV), the ML model will be different as in each iteration, I am training on a different dataset (1 participant’s data will be different).
In each iteration of LOOCV
the model is still the same (same features, hyperparams may be different, depending on your definition of iteration
). It’s still the same dataset (same features)
Also, the model will be different as I am doing feature selection in each iteration.
Doesn’t matter. Feed resulting model to SHAP
explainer and you’ll get what you want.