Tag: scikit-learn

Difference between Standard scaler and MinMaxScaler

data-science machine-learning python python-3.x scikit-learn

What is the difference between MinMaxScaler() and StandardScaler(). mms = MinMaxScaler(feature_range = (0, 1)) (Used in a machine learning model) sc = StandardScaler() (In another machine learning model they used standard-scaler and not min-max-scaler) Answer From ScikitLearn site: StandardScaler removes the mean and scales the data to unit variance. However, the outliers have an influence when computing the empirical mean

Interpreting logistic regression feature coefficient values in sklearn

coefficients feature-selection logistic-regression python scikit-learn

I have fit a logistic regression model to my data. Imagine, I have four features: 1) which condition the participant received, 2) whether the participant had any prior knowledge/background about the phenomenon tested (binary response in post-experimental questionnaire), 3) time spent on the experimental task, and 4) participant age. I am trying to predict whether participants ultimately chose option A

GridSearchCV.best_score not same as cross_val_score(GridSearchCV.best_estimator_)

cross-validation grid-search python scikit-learn

Consider the following gridsearch : grid = GridSearchCV(clf, parameters, n_jobs =-1, iid=True, cv =5) grid_fit = grid.fit(X_train1, y_train1) According to Sklearn’s ressource, grid_fit.best_score_ returns The mean cross-validated score of the best_estimator . To me that would mean that the average of : cross_val_score(grid_fit.best_estimator_, X_train1, y_train1, cv=5) should be exactly the same as: grid_fit.best_score_. However I am getting a 10% difference

Scaling / Normalizing pandas column

pandas python scikit-learn

I have a dataframe like: I’d like to create a newly scaled column in the dataframe called SIZE where SIZE is a number between 5 and 50. For Example: I’ve tried but got Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample. I’ve tried other things,

What does calling fit() multiple times on the same model do?

machine-learning python scikit-learn

After I instantiate a scikit model (e.g. LinearRegression), if I call its fit() method multiple times (with different X and y data), what happens? Does it fit the model on the data like if I just re-instantiated the model (i.e. from scratch), or does it keep into accounts data already fitted from the previous call to fit()? Trying with LinearRegression

Python SKLearn: How to Get Feature Names After OneHotEncoder?

machine-learning python scikit-learn

I would like to get the feature names of a data set after it has been transformed by SKLearn OneHotEncoder. In active_features_ attribute in OneHotEncoder one can see a very good explanation how the attributes n_values_, feature_indices_ and active_features_ get filled after transform() was executed. My question is: For e.g. DataFrame based input data: How does the code look like

LightGBMError “Check failed: num_data > 0” with Sklearn RandomizedSearchCV

lightgbm python scikit-learn

I’m trying LightGBMRegressor parameter tuning with Sklearn RandomizedSearchCV. I got an error with message below. error: I cannot tell why and the specific parameters caused this error. Any of params_dist below was not suitable for train_x.shape:(1630, 1565)? Please tell me any hints or solutions. Thank you. LightGBM version: ‘2.0.12’ function caused this error: Too long to put full stack trace,

pipeline for RandomOversampler, RandomForestClassifier & GridSearchCV

gridsearchcv imblearn python random-forest scikit-learn

I am working on a binary text classification problem. As the classes are highly imbalanced, I am using sampling techniques like RandomOversampler(). Then for classification I would use RandomForestClassifier() whose parameters need to be tuned using GridSearchCV(). I am trying to create a pipeline to do these in order but failed so far. It throws invalid parameters. Answer The parameters

ValueError: special directives must be the first entry

machine-learning numpy python scikit-learn

why this error appears and what does it mean exactly? It appears on this code (I put only the part of machine learning, because the code is so long): The error is the following: Thank you in advance! Answer Looking at what the code is doing I don’t think you should have np.c_ there at all. The model is trained

Kmean clustering top terms in cluster

cluster-analysis k-means python scikit-learn

I am using python Kmean clustering algorithm for cluster document. I have created a term-document matrix Then I applied Kmean clustering using following code My next task is to see the top terms in every cluster, searching on googole suggested that many of the people has used the km.cluster_centers_.argsort()[:, ::-1] for finding the top term in the clusters using the