How to properly use Smote in Classification models

I am using smote to balanced the output (y) only for Model train but want to test the model with original data as it makes logic how we can test the model with smote created outputs. Please ask anything for clarification if I didn’t explained it well. It’s my starting on Stack overflow.

from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X_sm, y_sm = oversample.fit_resample(X, y)

# Splitting Dataset into Train and Test (Smote)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm,test_size=0.2,random_state=42)

JavaScript
​x
 
from imblearn.over_sampling import SMOTE
oversample = SMOTE()
X_sm, y_sm = oversample.fit_resample(X, y)
​
# Splitting Dataset into Train and Test (Smote)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_sm, y_sm,test_size=0.2,random_state=42)
​

Here i applied the Random Forest Classifier on my data

import math
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics

# RF = RandomForestClassifier(n_estimators=100)
# RF.fit(X_train, y_train.values.ravel())
# y_pred = RF.predict(X)
# print(metrics.classification_report(y,y_pred))

RF = RandomForestClassifier(n_estimators=10)
RF.fit(X_train, y_train.values.ravel())

JavaScript
 
import math
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestClassifier
from sklearn import metrics
​
# RF = RandomForestClassifier(n_estimators=100)
# RF.fit(X_train, y_train.values.ravel())
# y_pred = RF.predict(X)
# print(metrics.classification_report(y,y_pred))
​
RF = RandomForestClassifier(n_estimators=10)
RF.fit(X_train, y_train.values.ravel())
​

If i applied this but X also contains the data which we used for train. how we can remove the data which we already used for training the data.

y_pred = RF.predict(X)
print(metrics.classification_report(y,y_pred))

JavaScript
 
y_pred = RF.predict(X)
print(metrics.classification_report(y,y_pred))
​

Answer

I used SMOTE in the past, it is suboptimal. Lately, researchers have proven some flaws in the generated distribution of Synthetic Minority Oversample Technique (SMOTE). I know sometimes we don’t have a choice regarding the unbalanced classes, but you can use sklearn.ensemble.RandomForestClassifier, where you can define a proper class_weight to handle the unbalanced class problem.

Check scikit-learn documentation:

Scikit-documentation

Advertisement

Answer