I trained a SVM iterativly on large chunks of data using sklearn. Each csv file is a part of an image. I made those with a sliding window aproach. I used partial_fit() for fitting the SVM as well as the scaler. The features are the RGBN values of an image, I want to classify the image in two different groups 0 and 1.
def trainSVMIterative(directory): clf= SGDClassifier(learning_rate = 'constant', eta0 = 0.1, shuffle = False,n_iter_no_change = 5, warm_start = True ) sc = StandardScaler() firstIter = True iter = 0 for filename in os.listdir(directory): if filename.endswith('.csv'): pixels = pd.read_csv(os.path.join(directory, filename),sep = ',') #drop columns containing irelevant information pixels = pixels.drop('x', axis = 1) pixels = pixels.drop('y', axis = 1) #dataset X = pixels.drop('label', axis = 1) #labels Y = pixels['label'] #prepare training data X_train, X_test, y_train, y_test = train_test_split(X,Y,test_size = 0.2, random_state = 42) #fit scaler sc = sc.partial_fit(X) #scale input X_train = sc.transform(X_train) X_test = sc.transform(X_test) #train svm if firstIter: clf.partial_fit(X_train,y_train, classes=np.unique(Y)) firstIter = False else: clf.partial_fit(X_train,y_train) testPred = clf.predict(X_test) print(classification_report(y_test,testPred)) iter+=1 print(iter) return clf, sc
When i print the classification report after each iteration it looks fine, the accuracy goes up to 98%. I therefore assume my classifier is training properly. For testing, I extracted a new dataframe from my original image. This time, there is no column with the label. I pass the classifier as well as the scaler to my testing function
def testClassifier(path, classifier, scaler): #opening the original image, same process as in creating the training data raster = gdal.Open(path) array = tifToImgArray(raster, 'uint8') # select a part of the image to test on windowSize = 1000 y = 19000 x = 0 window = array[y:y + windowSize,x:x + windowSize] #create the dataframe arrayData = [] for i in range(window.shape[0]): for j in range(window.shape[1]): arrayData.append([i,j,array[i,j,0],array[i,j,1],array[i,j,2],array[i,j,3]]) dfData = pd.DataFrame(arrayData, columns=['x','y','R','G','B','N']) #again drop psoition information pixels = dfData.drop('x', axis = 1) pixels = pixels.drop('y', axis = 1) #use scaler pixels = scaler.transform(pixels) #make prediction prediction = classifier.predict(pixels) image = visualizePrediction(prediction, window, dfData) return image
My problem now is, the classifier predicts label “1” for every pixel. Dataframe X I used for testing is the same I used in one of training runs, there is just no split in trainig an test data, I just used the whole set. I dont really get what im doing wrong since the classifier worked pretty good on a subset of X. I was thinking it might be a problem that there a more Datapoints labeled as “1” then there are labeled as “0” and I dont use any weight on the Data. But then again why does it work when i split the dataset into X_train and X_test since this is also the case there. I would apreciate help on this issue. Regards
Advertisement
Answer
If the accuracy of our model is very high, that doesn’t necessarily mean that it’s doing very well all the time(it might on train_data but not on test_data).
Example: Let’s say we have some 100 rows of data where
label1 = 90 #rows
andlabel0 = 10 #rows
. We do the usualtest_train_split
and train our model(in our case, it’s a classification model) and see that the accuracy is 95%. Since our model is soo perfect, we deploy it and a few days later we notice that it’s not predicting as good as it did during our training. The reason is because is data is skewed. Even when the model predicts the “label0” wrong, the accuracy still increases as the “label1” prediction dominates the other and I think this is exactly what’s happening with your data.We can check this using Confusion Matrix once we did the prediction. Confusion matrix evaluates the performance of our model. We also have other metrics such as F1 score, Roc curve which are quite useful as well.
We can solve this by getting our data into right distribution. But most of the time we are short of data and all the data which we have is this skewed data. In such cases we can do something called Oversampling and Undersampling.