SVM working well on test subset fails on whole dataset

I trained a SVM iterativly on large chunks of data using sklearn. Each csv file is a part of an image. I made those with a sliding window aproach. I used partial_fit() for fitting the SVM as well as the scaler. The features are the RGBN values of an image, I want to classify the image in two different groups 0 and 1.

def trainSVMIterative(directory):

    clf= SGDClassifier(learning_rate = 'constant', eta0 = 0.1, shuffle = False,n_iter_no_change = 5,  warm_start = True )
    sc = StandardScaler()
    firstIter = True
    iter = 0
    for filename in os.listdir(directory):
         if filename.endswith('.csv'):
            pixels = pd.read_csv(os.path.join(directory, filename),sep = ',')
            
            #drop columns containing irelevant information
            pixels = pixels.drop('x', axis = 1)
            pixels = pixels.drop('y', axis = 1)


            #dataset
            X = pixels.drop('label', axis = 1)
            #labels
            Y = pixels['label']

            #prepare training data
            X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size = 0.2, random_state = 42) 

            #fit scaler
            sc = sc.partial_fit(X)

            #scale input
            X_train = sc.transform(X_train)
            X_test = sc.transform(X_test)

            
            #train svm
            if firstIter:
                 clf.partial_fit(X_train,y_train, classes=np.unique(Y))
                 firstIter = False
            else:
                clf.partial_fit(X_train,y_train)
                testPred = clf.predict(X_test)

                print(classification_report(y_test,testPred))

                iter+=1
                print(iter) 

            

    return clf, sc

JavaScript
​x
 
def trainSVMIterative(directory):
​
    clf= SGDClassifier(learning_rate = 'constant', eta0 = 0.1, shuffle = False,n_iter_no_change = 5,  warm_start = True )
    sc = StandardScaler()
    firstIter = True
    iter = 0
    for filename in os.listdir(directory):
         if filename.endswith('.csv'):
            pixels = pd.read_csv(os.path.join(directory, filename),sep = ',')
            
            #drop columns containing irelevant information
            pixels = pixels.drop('x', axis = 1)
            pixels = pixels.drop('y', axis = 1)
​
​
            #dataset
            X = pixels.drop('label', axis = 1)
            #labels
            Y = pixels['label']
​
            #prepare training data
            X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size = 0.2, random_state = 42) 
​
            #fit scaler
            sc = sc.partial_fit(X)
​
            #scale input
            X_train = sc.transform(X_train)
            X_test = sc.transform(X_test)
​
            
            #train svm
            if firstIter:
                 clf.partial_fit(X_train,y_train, classes=np.unique(Y))
                 firstIter = False
            else:
                clf.partial_fit(X_train,y_train)
                testPred = clf.predict(X_test)
​
                print(classification_report(y_test,testPred))
​
                iter+=1
                print(iter) 
​
            
​
    return clf, sc
​

When i print the classification report after each iteration it looks fine, the accuracy goes up to 98%. I therefore assume my classifier is training properly. For testing, I extracted a new dataframe from my original image. This time, there is no column with the label. I pass the classifier as well as the scaler to my testing function

def testClassifier(path, classifier, scaler):
     #opening the original image, same process as in creating the training data
     raster = gdal.Open(path)
     array = tifToImgArray(raster, 'uint8')
     
     # select a part of the image to test on
     windowSize = 1000
     y = 19000
     x = 0
     window = array[y:y + windowSize,x:x + windowSize]

     #create the dataframe
     arrayData = []
 
     for i in range(window.shape[0]):
        for j in range(window.shape[1]):
           
            arrayData.append([i,j,array[i,j,0],array[i,j,1],array[i,j,2],array[i,j,3]])
                          
     dfData = pd.DataFrame(arrayData, columns=['x','y','R','G','B','N'])

     #again drop psoition information
     pixels = dfData.drop('x', axis = 1)
     pixels = pixels.drop('y', axis = 1)

     

     #use scaler 
     pixels = scaler.transform(pixels)
     
     #make prediction
     prediction = classifier.predict(pixels)
     

     image = visualizePrediction(prediction, window, dfData)     
            
            
     return image

JavaScript
 
def testClassifier(path, classifier, scaler):
     #opening the original image, same process as in creating the training data
     raster = gdal.Open(path)
     array = tifToImgArray(raster, 'uint8')
     
     # select a part of the image to test on
     windowSize = 1000
     y = 19000
     x = 0
     window = array[y:y + windowSize,x:x + windowSize]
​
     #create the dataframe
     arrayData = []
 
     for i in range(window.shape[0]):
        for j in range(window.shape[1]):
           
            arrayData.append([i,j,array[i,j,0],array[i,j,1],array[i,j,2],array[i,j,3]])
                          
     dfData = pd.DataFrame(arrayData, columns=['x','y','R','G','B','N'])
​
     #again drop psoition information
     pixels = dfData.drop('x', axis = 1)
     pixels = pixels.drop('y', axis = 1)
​
     
​
     #use scaler 
     pixels = scaler.transform(pixels)
     
     #make prediction
     prediction = classifier.predict(pixels)
     
​
     image = visualizePrediction(prediction, window, dfData)     
            
            
     return image
​
​

My problem now is, the classifier predicts label “1” for every pixel. Dataframe X I used for testing is the same I used in one of training runs, there is just no split in trainig an test data, I just used the whole set. I dont really get what im doing wrong since the classifier worked pretty good on a subset of X. I was thinking it might be a problem that there a more Datapoints labeled as “1” then there are labeled as “0” and I dont use any weight on the Data. But then again why does it work when i split the dataset into X_train and X_test since this is also the case there. I would apreciate help on this issue. Regards

Answer

If the accuracy of our model is very high, that doesn’t necessarily mean that it’s doing very well all the time(it might on train_data but not on test_data).

Example: Let’s say we have some 100 rows of data where label1 = 90 #rows and label0 = 10 #rows. We do the usual test_train_split and train our model(in our case, it’s a classification model) and see that the accuracy is 95%. Since our model is soo perfect, we deploy it and a few days later we notice that it’s not predicting as good as it did during our training. The reason is because is data is skewed. Even when the model predicts the “label0” wrong, the accuracy still increases as the “label1” prediction dominates the other and I think this is exactly what’s happening with your data.
We can check this using Confusion Matrix once we did the prediction. Confusion matrix evaluates the performance of our model. We also have other metrics such as F1 score, Roc curve which are quite useful as well.
We can solve this by getting our data into right distribution. But most of the time we are short of data and all the data which we have is this skewed data. In such cases we can do something called Oversampling and Undersampling.

Advertisement

Answer