I’m reading about decision trees and bagging classifiers, and I’m trying to show the first decision tree that is used in the bagging classifier. I’m confused about the output.
from sklearn.model_selection import train_test_split from sklearn.datasets import make_moons from sklearn.ensemble import BaggingClassifier from sklearn import tree from sklearn.tree import DecisionTreeClassifier from sklearn.tree import export_graphviz from graphviz import Source X, y = make_moons(n_samples=500, noise=0.30, random_state=42) X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42) bag_clf = BaggingClassifier( DecisionTreeClassifier(), n_estimators=500, max_samples=100, bootstrap=True, n_jobs=-1) bag_clf.fit(X_train, y_train) Source(tree.export_graphviz(bag_clf.estimators_[0], out_file=None))
Here’s a snippet out of the output
It’s been my understanding that the value
is supposed to show how many of the samples are classified as each category. In that case, shouldn’t the numbers in the value
field add up to the samples
field? Why is that not the case here?
Advertisement
Answer
Nice catch.
It would seem that the extra bootstrap samples are included in the value
, but not in the total samples
; repeating your code verbatim but changing to bootstrap=False
eliminates the discrepancy:
The behavior is similar in Random Forest, both classifier and regressor – see respectively: