Imbalanced-Learn’s FunctionSampler throws ValueError

Tags: , , ,



I want to use the class FunctionSampler from imblearn to create my own custom class for resampling my dataset.

I have a one-dimensional feature Series containing paths for each subject and a label Series containing the labels for each subject. Both come from a pd.DataFrame. I know that I have to reshape the feature array first since it is one-dimensional.

When I use the class RandomUnderSampler everything works fine, however if I pass both the features and labels first to the fit_resample method of FunctionSampler which then creates an instance of RandomUnderSampler and then calls fit_resample on this class, I get the following error:

ValueError: could not convert string to float: ‘path_1’

Here’s a minimal example producing the error:

import pandas as pd
from imblearn.under_sampling import RandomUnderSampler
from imblearn import FunctionSampler

# create one dimensional feature and label arrays X and y
# X has to be converted to numpy array and then reshaped. 
X = pd.Series(['path_1','path_2','path_3'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])

FIRST METHOD (works)

rus = RandomUnderSampler()
X_res, y_res = rus.fit_resample(X,y)

SECOND METHOD (doesn’t work)

def resample(X, y):
    return RandomUnderSampler().fit_resample(X, y)

sampler = FunctionSampler(func=resample)
X_res, y_res = sampler.fit_resample(X, y)

Does anyone know what goes wrong here? It seems as the fit_resample method of FunctionSampler is not equal to the fit_resample method of RandomUnderSampler

Answer

Your implementation of FunctionSampler is correct. The problem is with your dataset.

RandomUnderSampler seems to work for text data as well. There is no checking using check_X_y.

But FunctionSampler() has this check, see here

from sklearn.utils import check_X_y

X = pd.Series(['path_1','path_2','path_2'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])

check_X_y(X, y)

This will throw an error

ValueError: could not convert string to float: ‘path_1’

The following example would work!

X = pd.Series(['1','2','2'])
X = X.values.reshape(-1,1)
y = pd.Series([1,0,0])

def resample(X, y):
    return RandomUnderSampler().fit_resample(X, y)

sampler = FunctionSampler(func=resample)
X_res, y_res = sampler.fit_resample(X, y)

X_res, y_res 
# (array([[2.],
#        [1.]]), array([0, 1], dtype=int64))


Source: stackoverflow