I want to use the class FunctionSampler
from imblearn
to create my own custom class for resampling my dataset.
I have a one-dimensional feature Series containing paths for each subject and a label Series containing the labels for each subject. Both come from a pd.DataFrame
. I know that I have to reshape the feature array first since it is one-dimensional.
When I use the class RandomUnderSampler
everything works fine, however if I pass both the features and labels first to the fit_resample
method of FunctionSampler
which then creates an instance of RandomUnderSampler
and then calls fit_resample
on this class, I get the following error:
ValueError: could not convert string to float: ‘path_1’
Here’s a minimal example producing the error:
import pandas as pd from imblearn.under_sampling import RandomUnderSampler from imblearn import FunctionSampler # create one dimensional feature and label arrays X and y # X has to be converted to numpy array and then reshaped. X = pd.Series(['path_1','path_2','path_3']) X = X.values.reshape(-1,1) y = pd.Series([1,0,0])
FIRST METHOD (works)
rus = RandomUnderSampler() X_res, y_res = rus.fit_resample(X,y)
SECOND METHOD (doesn’t work)
def resample(X, y): return RandomUnderSampler().fit_resample(X, y) sampler = FunctionSampler(func=resample) X_res, y_res = sampler.fit_resample(X, y)
Does anyone know what goes wrong here? It seems as the fit_resample
method of FunctionSampler
is not equal to the fit_resample
method of RandomUnderSampler
…
Advertisement
Answer
Your implementation of FunctionSampler
is correct. The problem is with your dataset.
RandomUnderSampler
seems to work for text data as well. There is no checking using check_X_y
.
But FunctionSampler()
has this check, see here
from sklearn.utils import check_X_y X = pd.Series(['path_1','path_2','path_2']) X = X.values.reshape(-1,1) y = pd.Series([1,0,0]) check_X_y(X, y)
This will throw an error
ValueError: could not convert string to float: ‘path_1’
The following example would work!
X = pd.Series(['1','2','2']) X = X.values.reshape(-1,1) y = pd.Series([1,0,0]) def resample(X, y): return RandomUnderSampler().fit_resample(X, y) sampler = FunctionSampler(func=resample) X_res, y_res = sampler.fit_resample(X, y) X_res, y_res # (array([[2.], # [1.]]), array([0, 1], dtype=int64))