Decision tree with a probability target

I’m currently working on a model to predict a probability of fatality once a person is infected with the Corona virus. I’m using a Dutch dataset with categorical variables: date of infection, fatality or cured, gender, age-group etc. It was suggested to use a decision tree, which I’ve already built. Since I’m new to decision trees I would like some assistance. I would like to have the prediction (target variable) expressed in a probability (%), not in a binary output. How can I achieve this? Also I want to play around with samples by inputting the data myself and see

I keep getting ValueError: Shapes (10, 1) and (10, 3) are incompatible when training my model

Turning the number of inputs when I call makeModel from 3 to 1 allows the program to run without errors but no training actually happens and the accuracy doesn’t change. Answer LabelEncoder transforms the input to an array of encoded values. i.e if your input is [“paris”, “paris”, “tokyo”, “amsterdam”] then they can be encoded as [0, 0, 1, 2]. It is not one-hot encoding scheme which is expected by categorical_crossentropy loss. If you have a integer encoding you will have to use sparse_categorical_crossentropy Fix change your code loss to sparse_categorical_crossentropy : Sample

sklearn roc_auc_score with multi_class==“ovr” should have None average available

I’m trying to compute the AUC score for a multiclass problem using the sklearn’s roc_auc_score() function. I have prediction matrix of shape [n_samples,n_classes] and a ground truth vector of shape [n_samples], named np_pred and np_label respectively. What I’m trying to achieve is the set of AUC scores, one for each classes that I have. To do so I would like to use the average parameter option None and multi_class parameter set to “ovr”, but if I run I get back This error is expected from the sklearn function in the case of the multiclass; but if you take a look

How to split parallel corpora while keeping alignment?

I have two text files containing parallel text in two languages (potentially millions of lines). I am trying to generate random train/validate/test files from that single file, as train_test_split does in sklearn. However when I try to import it into pandas using read_csv I get errors from many of the lines because of erroneous data in there and it would be way too much work to try and fix the broken lines. If I try and set the error_bad_lines=false then it will skip some lines in one of the files and possibly not the other which would ruin the alignment.

Why does TfidVectorizer.fit_transform() change the number of samples and labels for my text data?

I have a data set that contains 3 columns for 310 data. The columns are all text. One column is text input by a user into an inquiry form and the second column are the labels (one of six labels) that say which inquiry category the input falls into. I am doing the following preprocessing to my data before I run it through the KMeans algorithm from sklearn.cluster Now, From where I’m looking I seem to have lost data. I no longer have my 310 samples. I believe the shape of vectorized refers to [n_samples, n_features]. Why does the value

Imbalanced-Learn’s FunctionSampler throws ValueError

I want to use the class FunctionSampler from imblearn to create my own custom class for resampling my dataset. I have a one-dimensional feature Series containing paths for each subject and a label Series containing the labels for each subject. Both come from a pd.DataFrame. I know that I have to reshape the feature array first since it is one-dimensional. When I use the class RandomUnderSampler everything works fine, however if I pass both the features and labels first to the fit_resample method of FunctionSampler which then creates an instance of RandomUnderSampler and then calls fit_resample on this class, I

Cross validation with grid search returns worse results than default

I’m using scikitlearn in Python to run some basic machine learning models. Using the built in GridSearchCV() function, I determined the “best” parameters for different techniques, yet many of these perform worse than the defaults. I include the default parameters as an option, so I’m surprised this would happen. For example: This is the same as the defaults, except max_depth is 3. When I use these parameters, I get an accuracy of 72%, compared to 78% from the default. One thing I did, that I will admit is suspicious, is that I used my entire dataset for the cross validation.

list of all classification algorithms

I have a classification problem and I would like to test all the available algorithms to test their performance in tackling the problem. If you know any classification algorithm other than these listed below, please list it here. Your help is highly appreciated. Answer The answers did not provided the full list of classifiers so i have listed them below

Merging results from model.predict() with original pandas DataFrame?

I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object. To merge these predictions back with the original df, I try this: But that raises: ValueError: Length of values does not match length of index I know I could split the df into train_df and test_df and this problem would be solved, but in reality I need to follow the path above to create the matrices X and y (my actual problem is a text classification problem in which I normalize the entire feature matrix before splitting into train and

sklearn Logistic Regression “ValueError: Found array with dim 3. Estimator expected <= 2.”

I attempt to solve this problem 6 in this notebook. The question is to train a simple model on this data using 50, 100, 1000 and 5000 training samples by using the LogisticRegression model from sklearn.linear_model. https://github.com/tensorflow/examples/blob/master/courses/udacity_deep_learning/1_notmnist.ipynb This is the code i trying to do and it give me the error. ValueError: Found array with dim 3. Estimator expected <= 2. Any idea? UPDATE 1: Update the link to the Jupyter Notebook. Answer scikit-learn expects 2d num arrays for the training dataset for a fit function. The dataset you are passing in is a 3d array you need to reshape