Tag: scikit-learn

Can i have too many features in a logistic regression?

complexity-theory logistic-regression memory python scikit-learn

I’m building a model to predict pedestrian casualties on the streets of New York, from a data set of 1.7 million records. I decided to build dummy features out of the ON STREET NAME column, to see what predictive power that might provide. With that, I have approximately 7500 features. I tried running that, and I immediately get an alert

Machine Learning Classifier use past predictions as features

python scikit-learn time-series

I want to built a binary classifier machine learning model. I want to use the model’s previous predictions as features for the future predictions, to take into account that my training samples are not independent. Is there a framework to achieve this with scikit-learn, or any other python ML library? I know this problem could be solved with memory-based Neural

TypeError during resampling

countvectorizer python scikit-learn

I am trying to apply resampling for my dataset which has unbalanced classes. What I have done is the following: Unfortunately, I am having some problems at this step: X = pd.concat([X_train, y_train], axis=1), i.e. You can think of Text column as I hope you can help me to handle with it. Answer You have to convert X_train to a

Including Scaling and PCA as parameter of GridSearchCV

grid-search pipeline python regression scikit-learn

I want to run a logistic regression using GridSearchCV, but I want to contrast the performance when Scaling and PCA is used, so I don’t want to use it in all cases. I basically would like to include PCA and Scaling as “parameters” of the GridSearchCV I am aware I can make a pipeline like this: The thing is that,

sklearn.compose.make_column_transformer(): using SimpleImputer() and OneHotEncoder() in one step on one dataframe column

imputation one-hot-encoding pipeline python scikit-learn

I have a dataframe containing a column with categorical variables, which also includes NaNs. I’d like to to use sklearn.compose.make_column_transformer() to prepare the df in a clean way. I tried to impute nan values and OneHotEncode the column with the following code: Running the transformer on my training data raises ValueError: Input contains NaN The desired output would be something

OneHotEncoding Protein Sequences

bioinformatics one-hot-encoding python scikit-learn

I have an original dataframe of sequences listed below and am trying to use one-hot encoding and then store these in a new dataframe, I am trying to do it with the following code but am not able to store because I get the following output afterwards: Code: but get error Answer You get that strange array because it treats

Decision tree with a probability target

decision-tree python scikit-learn

I’m currently working on a model to predict a probability of fatality once a person is infected with the Corona virus. I’m using a Dutch dataset with categorical variables: date of infection, fatality or cured, gender, age-group etc. It was suggested to use a decision tree, which I’ve already built. Since I’m new to decision trees I would like some

I keep getting ValueError: Shapes (10, 1) and (10, 3) are incompatible when training my model

deep-learning machine-learning python scikit-learn tensorflow

Turning the number of inputs when I call makeModel from 3 to 1 allows the program to run without errors but no training actually happens and the accuracy doesn’t change. Answer LabelEncoder transforms the input to an array of encoded values. i.e if your input is [“paris”, “paris”, “tokyo”, “amsterdam”] then they can be encoded as [0, 0, 1, 2].

Isolation Forest vs Robust Random Cut Forest in outlier detection

amazon-sagemaker anomaly-detection outliers python scikit-learn

I am examining different methods in outlier detection. I came across sklearn’s implementation of Isolation Forest and Amazon sagemaker’s implementation of RRCF (Robust Random Cut Forest). Both are ensemble methods based on decision trees, aiming to isolate every single point. The more isolation steps there are, the more likely the point is to be an inlier, and the opposite is

How to get the centroids in DBSCAN sklearn?

cluster-analysis dbscan python scikit-learn

I am using DBSCAN for clustering. However, now I want to pick a point from each cluster that represents it, but I realized that DBSCAN does not have centroids as in kmeans. However, I observed that DBSCAN has something called core points. I am thinking if it is possible to use these core points or any other alternative to obtain