I am trying to merge the results of a predict method back with the original data in a pandas.DataFrame object. To merge these predictions back with the original df, I try this: But that raises: ValueError: Length of values does not match length of index I know I could split the df into train_df and test_df and this problem would
Tag: scikit-learn
Retrieve list of training features names from classifier
Is there a way to retrieve the list of feature names used for training of a classifier, once it has been trained with the fit method? I would like to get this information before applying to unseen data. The data used for training is a pandas DataFrame and in my case, the classifier is a RandomForestClassifier. Answer Based on the
How to use `Dirichlet Process Gaussian Mixture Model` in Scikit-learn? (n_components?)
My understanding of “an infinite mixture model with the Dirichlet Process as a prior distribution on the number of clusters” is that the number of clusters is determined by the data as they converge to a certain amount of clusters. This R Implementation https://github.com/jacobian1980/ecostates decides on the number of clusters in this way. Although, the R implementation uses a Gibbs
How to predict new values using statsmodels.formula.api (python)
I trained the logistic model using the following, from breast cancer data and ONLY using one feature ‘mean_area’ There is a built in predict method in the trained model. However that gives the predicted values of all the training samples. As follows Suppose I want the prediction for a new value say 30 How do I used the trained model
label-encoder encoding missing values
I am using the label encoder to convert categorical data into numeric values. How does LabelEncoder handle missing values? Output: For the above example, label encoder changed NaN values to a category. How would I know which category represents missing values? Answer Don’t use LabelEncoder with missing values. I don’t know which version of scikit-learn you’re using, but in 0.17.1
Using NearestNeighbors and word2vec to detect sentence similarity
I have calculated a word2vec model using python and gensim in my corpus. Then I calculated the mean word2vec vector for each sentence (averaging all the vectors for all the words in the sentence) and stored it in a pandas data frame. The columns of the pandas data frame df are: sentence Book title (the book where the sentence comes
Backpropagation with Momentum using Scikit-Learn
I’m trying to use Scikit-Learn’s Neural Network to classify my dataset using a Backpropagation with Momentum. I need to specify these parameters: Hidden neurons, Hidden layers, Training set, Learning rate and Momentum. I found MLPClassifier in Sklearn.neural_network package. The problem is that this package is part of Scikit-learn V0.18 which is a dev version. Is there a way I could
Increase the size of /dev/shm in Azure ML Studio
I’m trying to execute the following code in Azure ML Studio notebook: and I’m getting this error: With n_jobs=1 it works fine. I think this is because joblib library tries to save my data to /dev/shm. The problem is that it has only 64M capacity: I can’t change this folder by setting JOBLIB_TEMP_FOLDER environment variable (export doesn’t work). Thanks for
How to use sklearn fit_transform with pandas and return dataframe instead of numpy array?
I want to apply scaling (using StandardScaler() from sklearn.preprocessing) to a pandas dataframe. The following code returns a numpy array, so I lose all the column names and indeces. This is not what I want. A “solution” I found online is: It appears to work, but leads to a deprecationwarning: /usr/lib/python3.5/site-packages/sklearn/preprocessing/data.py:583: DeprecationWarning: Passing 1d arrays as data is deprecated in
sklearn Logistic Regression “ValueError: Found array with dim 3. Estimator expected <= 2."
I attempt to solve this problem 6 in this notebook. The question is to train a simple model on this data using 50, 100, 1000 and 5000 training samples by using the LogisticRegression model from sklearn.linear_model. This is the code i trying to do and it give me the error. Any idea? Answer scikit-learn expects 2d num arrays for the