Skip to content
Advertisement

Cross Validation with coco data format json files

I am a newbie ML learner and trying semantic image segmentation on google colab with COCO data format json and lots of images on google drive.

update

I borrowed this code as a starting point. So my code on colab is pretty much like this. https://github.com/akTwelve/tutorials/blob/master/mask_rcnn/MaskRCNN_TrainAndInference.ipynb

/update

I am splitting an exported json file into 2 jsons (train/validate with 80/20 ratio) every time I receive new annotation data. But this is getting tiring since I have more than 1000 annotations in a file and I do it manually with replace function of VS code.

Is there a better way to do this programatically on google colab?

So what I like to do is rotating annotation data without spitting a json file manually.

Say, I have 1000 annotations in ONE json file on my google drive, I would like to use the 1-800 annotations for training and the 801-1000 annotations for validating for the 1st train session, then for the next train session I would like to use the 210-1000 annotations for training and 1-200 annotations for validating. Like selecting a part of data in json from code on colab.

Or if I can rotate the data during one train session (K-Fold Cross Validation?), that is even better but I have no clue to do this.

Here is parts of my code on the colab.

Loading json files

JavaScript

Initializing model

JavaScript

train

JavaScript

validate

JavaScript

json

JavaScript

FYI, My workflow is like

  1. Label images with VIA annotation tool

  2. Export annotations in coco format json

  3. Modify the json and save to my google drive

  4. Load the json on colab and start training

Advertisement

Answer

There’s a very good utility function in the sklearn library for doing exactly what you want here. It’s called train_test_split.

Now, it’s hard to understand what your data structures are, but I am assuming that this code:

JavaScript

populates dataset_train with some kind of array of images, or else an array of the paths to the images. sklearn’s train_test_split function is able to accept pandas DataFrames as well as numpy arrays.

I am usually very comfortable with pandas DataFrames, so I would suggest you combine the training and validation data into one DataFrame using the pandas function concat, then create a random split using the sklearn function train_test_split at the beginning of every training epoch. It would look something like the following:

JavaScript

Just one last note: ideally, you should have three sets – train, test, and validation. So separate out a testing set beforehand, and then do the train_test_split at the beginning of every iteration of the training loop to obtain your train-validation split from the remaining data.

User contributions licensed under: CC BY-SA
1 People found this is helpful
Advertisement