Cross Validation with coco data format json files

I am a newbie ML learner and trying semantic image segmentation on google colab with COCO data format json and lots of images on google drive.

update

I borrowed this code as a starting point. So my code on colab is pretty much like this. https://github.com/akTwelve/tutorials/blob/master/mask_rcnn/MaskRCNN_TrainAndInference.ipynb

/update

I am splitting an exported json file into 2 jsons (train/validate with 80/20 ratio) every time I receive new annotation data. But this is getting tiring since I have more than 1000 annotations in a file and I do it manually with replace function of VS code.

Is there a better way to do this programatically on google colab?

So what I like to do is rotating annotation data without spitting a json file manually.

Say, I have 1000 annotations in ONE json file on my google drive, I would like to use the 1-800 annotations for training and the 801-1000 annotations for validating for the 1st train session, then for the next train session I would like to use the 210-1000 annotations for training and 1-200 annotations for validating. Like selecting a part of data in json from code on colab.

Or if I can rotate the data during one train session (K-Fold Cross Validation?), that is even better but I have no clue to do this.

Here is parts of my code on the colab.

Loading json files

dataset_train = CocoLikeDataset()
dataset_train.load_data('PATH_TO_TRAIN_JSON', 'PATH_TO_IMAGES')
dataset_train.prepare()

dataset_val = CocoLikeDataset()
dataset_val.load_data('PATH_TO_VALIDATE_JSON', 'PATH_TO_IMAGES')
dataset_val.prepare()

JavaScript
​x
 
dataset_train = CocoLikeDataset()
dataset_train.load_data('PATH_TO_TRAIN_JSON', 'PATH_TO_IMAGES')
dataset_train.prepare()
​
dataset_val = CocoLikeDataset()
dataset_val.load_data('PATH_TO_VALIDATE_JSON', 'PATH_TO_IMAGES')
dataset_val.prepare()
​

Initializing model

model = modellib.MaskRCNN(mode="training", config=config, model_dir=MODEL_DIR)

init_with = "coco"

if init_with == "imagenet":
    model.load_weights(model.get_imagenet_weights(), by_name=True)
elif init_with == "coco":
    model.load_weights(COCO_MODEL_PATH, by_name=True,
                       exclude=["mrcnn_class_logits", "mrcnn_bbox_fc", 
                                "mrcnn_bbox", "mrcnn_mask"])
elif init_with == "last":
    model.load_weights(model.find_last(), by_name=True)

JavaScript
 
model = modellib.MaskRCNN(mode="training", config=config, model_dir=MODEL_DIR)
​
init_with = "coco"
​
if init_with == "imagenet":
    model.load_weights(model.get_imagenet_weights(), by_name=True)
elif init_with == "coco":
    model.load_weights(COCO_MODEL_PATH, by_name=True,
                       exclude=["mrcnn_class_logits", "mrcnn_bbox_fc", 
                                "mrcnn_bbox", "mrcnn_mask"])
elif init_with == "last":
    model.load_weights(model.find_last(), by_name=True)
​

train

start_train = time.time()
model.train(dataset_train, dataset_val, 
            learning_rate=config.LEARNING_RATE, 
            epochs=30, 
            layers='heads')
end_train = time.time()
minutes = round((end_train - start_train) / 60, 2)
print(f'Training took {minutes} minutes')

JavaScript
 
start_train = time.time()
model.train(dataset_train, dataset_val, 
            learning_rate=config.LEARNING_RATE, 
            epochs=30, 
            layers='heads')
end_train = time.time()
minutes = round((end_train - start_train) / 60, 2)
print(f'Training took {minutes} minutes')
​

validate

start_train = time.time()
model.train(dataset_train, dataset_val, 
            learning_rate=config.LEARNING_RATE / 10,
            epochs=10, 
            layers="all")
end_train = time.time()
minutes = round((end_train - start_train) / 60, 2)
print(f'Training took {minutes} minutes')

JavaScript
 
start_train = time.time()
model.train(dataset_train, dataset_val, 
            learning_rate=config.LEARNING_RATE / 10,
            epochs=10, 
            layers="all")
end_train = time.time()
minutes = round((end_train - start_train) / 60, 2)
print(f'Training took {minutes} minutes')
​

json

{
  "info": {
    "year": 2020,
    "version": "1",
    "description": "Exported using VGG Image Annotator (http://www.robots.ox.ac.uk/~vgg/software/via/)",
    "contributor": "",
    "url": "http://www.robots.ox.ac.uk/~vgg/software/via/",
    "date_created": "Tue Jan 21 2020 16:18:14"
  },
  "images": [
    {
      "id": 0,
      "width": 2880,
      "height": 2160,
      "file_name": "sample01.jpg",
      "license": 1,
      "flickr_url": "sample01.jpg",
      "coco_url": "sample01.jpg",
      "date_captured": ""
    }
  ],
   "annotations": [
    {
      "id": 0,
      "image_id": "0",
      "category_id": 1,
      "segmentation": [
        588,
        783,
        595,
        844,
        607,
        687,
        620,
        703,
        595,
        722,
        582,
        761
      ],
      "area": 108199,
      "bbox": [
        582,
        687,
        287,
        377
      ],
      "iscrowd": 0
    }
  ],
  "licenses": [
    {
      "id": 1,
      "name": "Unknown",
      "url": ""
    }
  ],
  "categories": [
    {
      "id": 1,
      "name": "nail",
      "supercategory": "type"
    }
  ]
}

JavaScript
 
{
  "info": {
    "year": 2020,
    "version": "1",
    "description": "Exported using VGG Image Annotator (http://www.robots.ox.ac.uk/~vgg/software/via/)",
    "contributor": "",
    "url": "http://www.robots.ox.ac.uk/~vgg/software/via/",
    "date_created": "Tue Jan 21 2020 16:18:14"
  },
  "images": [
    {
      "id": 0,
      "width": 2880,
      "height": 2160,
      "file_name": "sample01.jpg",
      "license": 1,
      "flickr_url": "sample01.jpg",
      "coco_url": "sample01.jpg",
      "date_captured": ""
    }
  ],
   "annotations": [
    {
      "id": 0,
      "image_id": "0",
      "category_id": 1,
      "segmentation": [
        588,
        783,
        595,
        844,
        607,
        687,
        620,
        703,
        595,
        722,
        582,
        761
      ],
      "area": 108199,
      "bbox": [
        582,
        687,
        287,
        377
      ],
      "iscrowd": 0
    }
  ],
  "licenses": [
    {
      "id": 1,
      "name": "Unknown",
      "url": ""
    }
  ],
  "categories": [
    {
      "id": 1,
      "name": "nail",
      "supercategory": "type"
    }
  ]
}
​

FYI, My workflow is like

Label images with VIA annotation tool
Export annotations in coco format json
Modify the json and save to my google drive
Load the json on colab and start training

Answer

There’s a very good utility function in the sklearn library for doing exactly what you want here. It’s called train_test_split.

Now, it’s hard to understand what your data structures are, but I am assuming that this code:

dataset_train = CocoLikeDataset()
dataset_train.load_data('PATH_TO_TRAIN_JSON', 'PATH_TO_IMAGES')
dataset_train.prepare()

JavaScript
 
dataset_train = CocoLikeDataset()
dataset_train.load_data('PATH_TO_TRAIN_JSON', 'PATH_TO_IMAGES')
dataset_train.prepare()
​

populates dataset_train with some kind of array of images, or else an array of the paths to the images. sklearn’s train_test_split function is able to accept pandas DataFrames as well as numpy arrays.

I am usually very comfortable with pandas DataFrames, so I would suggest you combine the training and validation data into one DataFrame using the pandas function concat, then create a random split using the sklearn function train_test_split at the beginning of every training epoch. It would look something like the following:

import pandas as pd
from sklearn.model_selection import train_test_split

# Convert the data into a DataFrame
master_df = pd.concat([pd.DataFrame(dataset_train), pd.DataFrame(dataset_val)], ignore_index=True)

# Separate out the data and targets DataFrames' (required by train_test_split)
data_df = master_df[['image_data_col_1','image_data_col_2','image_data_col_3']]
targets_df = master_df[['class_label']]

# Split the data into a random train/test (or train/val) split
data_train, data_val, targets_train, targets_val = train_test_split(data_df, targets_df, test_size=0.2)

# Training loop
# If the training function requires the targets to be present in the same DataFrame, you can do this before beginning training:
dataset_train_df = pd.concat([data_train, targets_train], axis=1)
dataset_val_df = pd.concat([data_val, targets_val], axis=1)
##################################
# Continue with training loop...
##################################

JavaScript
 
import pandas as pd
from sklearn.model_selection import train_test_split
​
# Convert the data into a DataFrame
master_df = pd.concat([pd.DataFrame(dataset_train), pd.DataFrame(dataset_val)], ignore_index=True)
​
# Separate out the data and targets DataFrames' (required by train_test_split)
data_df = master_df[['image_data_col_1','image_data_col_2','image_data_col_3']]
targets_df = master_df[['class_label']]
​
# Split the data into a random train/test (or train/val) split
data_train, data_val, targets_train, targets_val = train_test_split(data_df, targets_df, test_size=0.2)
​
# Training loop
# If the training function requires the targets to be present in the same DataFrame, you can do this before beginning training:
dataset_train_df = pd.concat([data_train, targets_train], axis=1)
dataset_val_df = pd.concat([data_val, targets_val], axis=1)
##################################
# Continue with training loop...
##################################
​
​

Just one last note: ideally, you should have three sets – train, test, and validation. So separate out a testing set beforehand, and then do the train_test_split at the beginning of every iteration of the training loop to obtain your train-validation split from the remaining data.

Advertisement

Answer