Tag: scikit-learn

Creating a new column for predicted cluster: SettingWithCopyWarning

This question will be a duplicate unfortunately, but I could not fix the issue in my code, even after looking at the other similar questions and their related answers. I need to split my dataset into train a test a dataset. However, it seems I am doing some error when I add a new column for predicting the cluster. The

Python: Develope Multiple Linear Regression Model From Scrath

linear-regression machine-learning python scikit-learn

I am trying to create a multiple linear regression model from scratch in python. Dataset used: Boston Housing Dataset from Sklearn. Since my focus was on the model building I did not perform any pre-processing steps on the data. However, I used an OLS model to calculate p-values and dropped 3 features from the data. After that, I used a

imblearn.oversampling SMOTENC ValueError

data-science data-science-experience pandas python scikit-learn

This is my first time using SMOTENC to upsampling my categorical data. However, I’ve been getting error. Can you please advice what should I pass for categorical_features in SMOTENC? ERROR: Answer As per documentation: So, just replace the line with the line

Suppress scientific notation in sklearn.metrics.plot_confusion_matrix

matplotlib python scikit-learn

I was trying to plot a confusion matrix nicely, so I followed scikit-learn’s newer version 0.22’s in built plot confusion matrix function. However, one value of my confusion matrix value is 153, but it appears as 1.5e+02 in the confusion matrix plot: Following the scikit-learn’s documentation, I spotted this parameter called values_format, but I do not know how to manipulate

How to load SVMlight format files in compressed form to pandas?

pandas python scikit-learn svmlight

I have data in SVMlight format (label feature1:value1 feature2:v2 …) as such I tried sklearn.load_svmlight_file but it doesn’t seem to work with categorical string features and labels. I am trying to store it into pandas DataFrame. Any pointers would be appreciated. Answer You can do it by hand… One way you can convert the file you want in a DataFrame:

sklearn roc_auc_score with multi_class==”ovr” should have None average available

auc machine-learning python scikit-learn

I’m trying to compute the AUC score for a multiclass problem using the sklearn’s roc_auc_score() function. I have prediction matrix of shape [n_samples,n_classes] and a ground truth vector of shape [n_samples], named np_pred and np_label respectively. What I’m trying to achieve is the set of AUC scores, one for each classes that I have. To do so I would like

How to split parallel corpora while keeping alignment?

dataset pandas python scikit-learn unix

I have two text files containing parallel text in two languages (potentially millions of lines). I am trying to generate random train/validate/test files from that single file, as train_test_split does in sklearn. However when I try to import it into pandas using read_csv I get errors from many of the lines because of erroneous data in there and it would

How to reverse Label Encoder from sklearn for multiple columns?

categorical-data python scikit-learn

I would like to use the inverse_transform function for LabelEncoder on multiple columns. This is the code I use for more than one columns when applying LabelEncoder on a dataframe: Is there a way to modify the code and change it so that it be used to inverse the labels from the encoder? Thanks Answer In order to inverse transform

Why does TfidVectorizer.fit_transform() change the number of samples and labels for my text data?

python scikit-learn tfidfvectorizer

I have a data set that contains 3 columns for 310 data. The columns are all text. One column is text input by a user into an inquiry form and the second column are the labels (one of six labels) that say which inquiry category the input falls into. I am doing the following preprocessing to my data before I

Sklearn PCA explained variance and explained variance ratio difference

covariance pca python scikit-learn

I’m trying to get the variances from the eigen vectors. What is the difference between explained_variance_ratio_ and explained_variance_ in PCA? Answer The percentage of the explained variance is: The variance i.e. the eigenvalues of the covariance matrix is: Formula: explained_variance_ratio_ = explained_variance_ / np.sum(explained_variance_) Example: Also based on the above formula: 7.93954312 / (7.93954312+ 0.06045688) = 0.99244289 From the documentation: