SVM working well on test subset fails on whole dataset

Question

I trained a SVM iterativly on large chunks of data using sklearn. Each csv file is a part of an image. I made those with a sliding window aproach. I used partial_fit() for fitting the SVM as well as the scaler. The features are the RGBN values of an image, I want to classify the image in two different groups

Accepted Answer

If the accuracy of our model is very high, that doesn&#8217;t necessarily mean that it&#8217;s doing very well all the time(it might on train_data but not on test_data).Example: Let&#8217;s say we have some 100 rows of data where label1 = 90 #rows and label0 = 10 #rows. We do the usual test_train_split and train our model(in our case, it&#8217;s a classification model) and see that the accuracy is 95%. Since our model is soo perfect, we deploy it and a few days later we notice that it&#8217;s not predicting as good as it did during our training. The reason is because is data is skewed. Even when the model predicts the &#8220;label0&#8221; wrong, the accuracy still increases as the &#8220;label1&#8221; prediction dominates the other and I think this is exactly what&#8217;s happening with your data.We can check this using Confusion Matrix once we did the prediction. Confusion matrix evaluates the performance of our model. We also have other metrics such as F1 score, Roc curve which are quite useful as well.We can solve this by getting our data into right distribution. But most of the time we are short of data and all the data which we have is this skewed data. In such cases we can do something called Oversampling and Undersampling.

Advertisement

Answer