Tag: scikit-learn

Using PyTorch tensors with scikit-learn

numpy python pytorch scikit-learn tensor

Can I use PyTorch tensors instead of NumPy arrays while working with scikit-learn? I tried some methods from scikit-learn like train_test_split and StandardScalar, and it seems to work just fine, but is there anything I should know when I’m using PyTorch tensors instead of NumPy arrays? According to this question on https://scikit-learn.org/stable/faq.html#how-can-i-load-my-own-datasets-into-a-format-usable-by-scikit-learn : numpy arrays or scipy sparse matrices. Other

Determine whether the Columns of a Dataset are invariant under any given Scikit-Learn Transformer

python scikit-learn

Given an sklearn tranformer t, is there a way to determine whether t changes columns/column order of any given input dataset X, without applying it to the data? For example with t = sklearn.preprocessing.StandardScaler there is a 1-to-1 mapping between the columns of X and t.transform(X), namely X[:, i] -> t.transform(X)[:, i], whereas this is obviously not the case for

Get names of the most important features for Logistic Regression after transformation

machine-learning python scikit-learn

I want to get names of the most important features for Logistic regression after transformation. I know that I can do this: But with this I’m getting feature1, feature2, feature3…etc. And after transformation I have around 45k features. How can I get the list of most important features (before transformation)? I want to know what are the best features for

How to plot density estimation contours of a model with 20 features?

matplotlib python scikit-learn

I am following this sample to do density estimation for the Bayesian Gaussian mixture model below: in which data (as a dataframe) includes 20 columns of numeric data. I can simply plot the model for two features of bgmm by But, how can I plot all the clusters in the form of density contours? Answer I believe you need to

Suspect overfitting binary classification toy problem with scikit-learn RandomForestClassifier

classification machine-learning python random-forest scikit-learn

I’m trying to train a Random Forest to classify the species of a set of flowers from the iris dataset. However, the validation looks kind of weird to me, since it looks like the results are perfect, which is something I would not expect. Since I would like to perform a binary classification, I exclude from the training dataset the

Mismatch of manual computation of a evaluation metrics with Sklearn functions

classification machine-learning precision-recall python scikit-learn

I wanted to compare the manual computations of the precision and recall with scikit-learn functions. However, recall_score() and precision_score() of scikit-learn functions gave me different results. Not sure why! Could you please give me some advice why I am getting different results? Thanks! My confusion matrix: Answer It should be (check return value’s ordering): Please refer: here

What should be the format of one-hot-encoded features for scikit-learn?

one-hot-encoding python python-3.x scikit-learn

I am trying to use the regressor/classifiers of scikit-learn library. I am a bit confused about the format of the one-hot-encoded features since I can send dataframe or numpy arrays to the model. Say I have categorical features named ‘a’, ‘b’ and ‘c’. Should I give them in separate columns (with pandas.get_dummies()), like below: a b c 1 1 1

Cache only a single step in sklearn’s Pipeline

pickle python scikit-learn

I want to use UMAP in my sklearn’s Pipeline, and I would like to cache that step to speed things up. However, since I have custom Transformer, the suggested method doesn’t work. Example code: If you run this, you will get a PicklingError, saying it cannot pickle the custom transformer. But I only need to cache the UMAP step. Any

RandomizedSearchCV: All estimators failed to fit

gridsearchcv python scikit-learn

I am currently working on the “French Motor Claims Datasets freMTPL2freq” Kaggle competition (https://www.kaggle.com/floser/french-motor-claims-datasets-fremtpl2freq). Unfortunately I get a “NotFittedError: All estimators failed to fit” error whenever I am using RandomizedSearchCV and I cannot figure out why that is. Any help is much appreciated. The first five rows of the original dataframe data_freq look like this: The error I get is

GaussianProcessRegressor ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size

arrays gaussian-process python python-3.x scikit-learn

I am running the following code: The shape of my input is: (19142, 21) dtypes are each: float64 Added in Edit: X and y are Pandas Dataframes. After .values they’re each numpy arrays And I get the Error: ValueError: array is too big; `arr.size * arr.dtype.itemsize` is larger than the maximum possible size. I cant image a dataset of 20000