I am using smote to balanced the output (y) only for Model train but want to test the model with original data as it makes logic how we can test the model with smote created outputs. Please ask anything for clarification if I didn’t explained it well. It’s my starting on Stack overflow.
from imblearn.over_sampling import SMOTE oversample = SMOTE() X_sm, y_sm = oversample.fit_resample(X, y) # Splitting Dataset into Train and Test (Smote) from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm,test_size=0.2,random_state=42)
Here i applied the Random Forest Classifier on my data
import math from sklearn.metrics import accuracy_score, confusion_matrix import seaborn as sn import matplotlib.pyplot as plt from sklearn.ensemble import RandomForestClassifier from sklearn import metrics # RF = RandomForestClassifier(n_estimators=100) # RF.fit(X_train, y_train.values.ravel()) # y_pred = RF.predict(X) # print(metrics.classification_report(y,y_pred)) RF = RandomForestClassifier(n_estimators=10) RF.fit(X_train, y_train.values.ravel())
If i applied this but X also contains the data which we used for train. how we can remove the data which we already used for training the data.
y_pred = RF.predict(X) print(metrics.classification_report(y,y_pred))
Advertisement
Answer
I used SMOTE in the past, it is suboptimal. Lately, researchers have proven some flaws in the generated distribution of Synthetic Minority Oversample Technique (SMOTE). I know sometimes we don’t have a choice regarding the unbalanced classes, but you can use sklearn.ensemble.RandomForestClassifier
, where you can define a proper class_weight
to handle the unbalanced class problem.
Check scikit-learn
documentation: