I am attempting to see how well I can classify books according to genre using TfidfVectorizer
. I am using five moderately imbalanced genre labels, and I want to use multilabel classification to assign each document one or more genres. Initially my performance was middling, so I tried to fix this by re-balancing the classes with RandomOverSampler
, and my cross validated f1_macro
score shot up from 0.415
to 0.842
.
I have read here that improperly combining resampling with cross validation can cause your model to overfit. So I want to make sure I’m not doing that here.
def preprocess_text(text): try: text = re.sub('[^a-zA-Z]', ' ', text) text = text.lower().split() text = [word for word in text if not word in set(nltk.corpus.stopwords.words('english'))] text = [nltk.stem.WordNetLemmatizer().lemmatize(word) for word in text if len(word) > 1] return ' '.join(text) except TypeError: return '' def preprocess_series(series): texts = [] for i in range(len(series)): texts.append(preprocess_text(series[i])) return pd.Series(texts) books_data = pd.DataFrame([ ["A_Likely_Story.txt", "fantasy fiction:science fiction", "If you discovered a fantastic power like thi..."], ["All_Cats_Are_Gray.txt", "science fiction", "An odd story, made up of oddly assorted elem..."] ],columns=["title", "genre", "text"]) X = pd.DataFrame(preprocess_series(books_data["text"]),columns = ["text"]) Y = pd.Series([genres.split(":")[0] for genres in books_data["genre"]]) oversampler = RandomOverSampler() x_ros, y_ros = oversampler.fit_resample(X, Y) column_trans = compose.make_column_transformer( (TfidfVectorizer(ngram_range=(1,3)), "text") ) ovr_svc_clf = multiclass.OneVsRestClassifier(svm.LinearSVC()) pipe = pipeline.make_pipeline(column_trans, ovr_svc_clf) print(cross_val_score( pipe, X, Y, cv=3, scoring="f1_macro" ).mean()) print(cross_val_score( pipe, x_ros, y_ros, cv=3, scoring="f1_macro" ).mean())
Here is the distribution of my class labels. Is it small and imbalanced enough to cause overfitting?
Advertisement
Answer
Oversampling doesn’t cause overfitting.
Oversampling before splitting for cross-validation causes data leakage, and the scores you’re seeing are indeed not usable as estimates of future performance. Your test folds (probably) contain copies of the same data points included in training folds.
You can add the oversampling as a first step in the pipeline (and use the imblearn
version of a pipeline, if you aren’t already) to alleviate this issue.
All that said, try modeling without balancing, using a custom decision threshold or a threshold-independent metric.