I’m currently working on a model to predict a probability of fatality once a person is infected with the Corona virus. I’m using a Dutch dataset with categorical variables: date of infection, fatality or cured, gender, age-group etc. It was suggested to use a decision tree, which I’ve already built. Since I’m new to decision trees I would like some assistance. I would like to have the prediction (target variable) expressed in a probability (%), not in a binary output. How can I achieve this? Also I want to play around with samples by inputting the data myself and see what the outcome is. For instance: let’s take someone who is 40, male etc. and calculate what its survival chance is. How can I achieve this? I’ve attached the code below:
from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier import pandas as pd import random as rnd filename = '/Users/sef/Downloads/pima-indians-diabetes.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(filename, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234) model = DecisionTreeClassifier() model.fit(X_train, Y_train) DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best') rnd.seed(123458) X_new = X[rnd.randrange(X.shape[0])] X_new = X_new.reshape(1,8) YHat = model.predict_proba(X_new) df = pd.DataFrame(X_new, columns = names[:-1]) df["predicted"] = YHat print(df)
Advertisement
Answer
you can use the method “predict_proba” of the DecisionTreeClassifier to compute the probabilities instead of the binary classification values.
In order to test individual data, that you can create by hand, you have to create an array of the shape of your X_test data (just that it only has one entry). Then you can use that with model.predict(array) or model.predict_proba(array).
By the way, your tree is currently not useful for retrieving probabilities. There is an article that explains the problem very well: https://rpmcruz.github.io/machine%20learning/2018/02/09/probabilities-trees.html
So you can fix your code by defining the max_depths of your tree:
from pandas import read_csv from sklearn.model_selection import train_test_split from sklearn.tree import DecisionTreeClassifier import pandas as pd import random as rnd filename = 'pima-indians-diabetes.csv' names = ['preg', 'plas', 'pres', 'skin', 'test', 'mass', 'pedi', 'age', 'class'] dataframe = read_csv(filename, names=names) array = dataframe.values X = array[:,0:8] Y = array[:,8] X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.3, random_state=1234) model = DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=1, max_features=None, max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, presort=False, random_state=None, splitter='best') model.fit(X_train, Y_train) rnd.seed(123458) X_new = X[rnd.randrange(X.shape[0])] X_new = X_new.reshape(1,8) YHat = model.predict_proba(X_new) df = pd.DataFrame(X_new, columns = names[:-1]) df["predicted"] = list(YHat) print(df)