I am trying to apply resampling for my dataset which has unbalanced classes. What I have done is the following:
from sklearn.utils import resample y = df.Label vectorizer = CountVectorizer() X = vectorizer.fit_transform(df['Text'].replace(np.NaN, "")) X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.30, stratify=y) # concatenate our training data back together X = pd.concat([X_train, y_train], axis=1) # separate minority and majority classes not_df = X[X.Label==0] df = X[X.Label==1] # upsample minority df_upsampled = resample(df, replace=True, n_samples=len(not_df), random_state=27) # combine majority and upsampled minority upsampled = pd.concat([not_df, df_upsampled])
Unfortunately, I am having some problems at this step: X = pd.concat([X_train, y_train], axis=1)
, i.e.
/anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py in concat(objs, axis, join, ignore_index, keys, levels, names, verify_integrity, sort, copy) 279 verify_integrity=verify_integrity, 280 copy=copy, --> 281 sort=sort, 282 ) 283 /anaconda3/lib/python3.7/site-packages/pandas/core/reshape/concat.py in __init__(self, objs, axis, join, keys, levels, names, ignore_index, verify_integrity, copy, sort) 355 "only Series and DataFrame objs are valid".format(typ=type(obj)) 356 ) --> 357 raise TypeError(msg) 358 359 # consolidate TypeError: cannot concatenate object of type '<class 'scipy.sparse.csr.csr_matrix'>'; only Series and DataFrame objs are valid
You can think of Text column as
Text Have a non-programming question? More helpful links I am trying to apply...
I hope you can help me to handle with it.
Advertisement
Answer
You have to convert X_train
to a Dataframe before use concat
X = pd.concat([pd.DataFrame(X_train), y_train], axis=1)