[target detection] RetinaNetFCOS loss optimization, multi scale training, mosaic data enhancement

Article directory

All codes have been uploaded to my [github] repository: https://github.com/zgcr/pytorch-ImageNet-CIFAR-COCO-VOC-training

If you find it useful, please click star!

The following codes have been tested in the pytorch1.4 version and are confirmed to be correct.

RetinaNet/FCOS loss optimization

In the previous implementation of RetinaNet/FCOS, the loss is calculated by first calculating the classification/regression loss of each image and adding up the average. In this way, the weights of each image are considered to be the same. However, what the target detection model actually learns is the box labeled with each picture, and the number of boxes labeled with each picture is different, so the weight of each picture is also different. Therefore, I changed each loss of RetinaNet/FCOS to the whole batch to calculate the loss and then averaged it. The denominator of the average is the number of all positive anchors/points in this batch, which is linearly and positively related to the number of annotation boxes. After this modification, the degree of parallelism when calculating loss is further improved, the calculated gradient is also more reasonable, the model training time for one epoch is reduced by 40%, and the performance of the trained model is better.

The training results of each model are as follows:

Network resize batch gpu-num apex syncbn epoch mAP-mAR-loss training-time(hours)
ResNet50-RetinaNet 667 20 2 RTX2080Ti yes no 12 0.305,0.421,0.56 17.43
ResNet101-RetinaNet 667 16 2 RTX2080Ti yes no 12 0.306,0.420,0.55 22.06
ResNet50-RetinaNet 1000 16 4 RTX2080Ti yes no 12 0.332,0.458,0.57 26.25
ResNet50-FCOS 667 24 2 RTX2080Ti yes no 12 0.318,0.452,1.09 14.17
ResNet101-FCOS 667 20 2 RTX2080Ti yes no 12 0.342,0.475,1.07 19.20
ResNet50-FCOS 1000 20 4 RTX2080Ti yes no 12 0.361,0.502,1.10 18.92
ResNet50-FCOS 1333 12 4 RTX2080Ti yes no 24 0.381,0.534,1.03 37.73

The resize method of yolov3 is used in the training here. According to the flops size, my resize=667 flops is equivalent to resize=400 in the RetinaNet article, and resize=1000 is equivalent to the Resize=600 in the RetinaNet article, resize= resize=800 in the 1333RetinaNet article. It can be seen that the mAP of FCOS at resize=1333 has reached 0.381, which is very close to the mAP0.386 after integrating all the improvements in the original text.

multi- [scale] training

Multi-scale training is widely used in object detection. The multi scale method in yolov3/yolov5 is used here. First determine a range of minimum to maximum size, such as resize=416 before multi scale is used, then a range of [0.5, 1.5] can be determined. After multiplying by 416, it is the range of minimum to maximum size. Then, choose a stride length, such as 32 for yolov3/yolov5, and find all sizes that are divisible by the stride length from the smallest to the largest size. Finally, randomly select a size, resize all images to this size, and then fill them into square images, which can be sent to the network for training.

The code implementation of multi scale is as follows (note that the resize method is no longer used in transform when multi scale is used):

class MultiScaleCollater():
    def __init__(self,
                 resize=512,
                 multi_scale_range=[0.5, 1.5],
                 stride=32,
                 use_multi_scale=False):
        self.resize = resize
        self.multi_scale_range = multi_scale_range
        self.stride = stride
        self.use_multi_scale = use_multi_scale

    def next(self, data):
        if self.use_multi_scale:
            min_resize = int(
                ((self.resize + self.stride) * self.multi_scale_range[0]) //
                self.stride * self.stride)
            max_resize = int(
                ((self.resize + self.stride) * self.multi_scale_range[1]) //
                self.stride * self.stride)

            final_resize = random.choice(
                range(min_resize, max_resize, self.stride))
        else:
            final_resize = self.resize

        imgs = [s['img'] for s in data]
        annots = [s['annot'] for s in data]
        scales = [s['scale'] for s in data]

        padded_img = torch.zeros((len(imgs), final_resize, final_resize, 3))

        for i, image in enumerate(imgs):
            height, width, _ = image.shape
            max_image_size = max(height, width)
            resize_factor = final_resize / max_image_size
            resize_height, resize_width = int(height * resize_factor), int(
                width * resize_factor)

            image = cv2.resize(image, (resize_width, resize_height))
            padded_img[i, 0:resize_height,
                       0:resize_width] = torch.from_numpy(image)

            annots[i][:, :4] *= resize_factor
            scales[i] = scales[i] * resize_factor

        max_num_annots = max(annot.shape[ 0 ] for annotation in annotations)

        if max_num_annots > 0:

            annot_padded = torch.ones((len(annots), max_num_annots, 5)) * (-1)

            if max_num_annots > 0:
                for idx, annot in enumerate(annots):
                    if annot.shape[0] > 0:
                        annot_padded[
                            idx, :annot.shape[0], :] = torch.from_numpy(annot)
        else:
            annot_padded = torch.ones((len(annots), 1, 5)) * (-1)

        padded_img = padded_img.permute(0, 3, 1, 2)

        return {'img': padded_img, 'annot': annot_padded, 'scale': scales}

mosaic [data augmentation]

In the previous article, I wrote a version of mosaic data augmentation code according to my understanding. But recently, when reading the code of yolov3/yolov5, I found that the official mosaic data enhancement is somewhat different. Assuming that the resize=416 when the mosaic data enhancement is not used, the size of the image finally sent to the network training is 416×416, then the size of the image finally sent to the network training after using the mosaic data enhancement is 832×832 (the length and width become twice 416) . First randomly take 4 images from the dataset, then resize them all to 416, then fill a new image of 832×832 with the value 114, randomly pick a center point in the range of 416×0.5 to 416×1.5, and divide the new image into Four blocks, namely upper left, upper right, lower right and lower left. The 4 pictures originally taken from the dataset are first resized to 416, and then they will be cropped according to the size of the corresponding part on the new picture by aligning the lower right corner, lower left corner, upper left corner, and upper right corner with the center point, and the cut out exceeds the new image. The area of ​​the corresponding part of the image, and then fill the image to the corresponding part of the new image. For the label box, the label box that exceeds the area is discarded.

The code implementation of mosaic data enhancement is as follows:

import us
import cv2
import torch
import numpy as np
import random
import math
from torch.utils.data import Dataset
from pycocotools.coco import COCO
import torch.nn.functional as F

class CocoDetection(Dataset):
    def __init__(self,
                 image_root_dir,
                 annotation_root_dir,
                 set='train2017',
                 resize=416,
                 use_mosaic=False,
                 mosaic_center_range=[0.5, 1.5],
                 transform=None):
        self.image_root_dir = image_root_dir
        self.annotation_root_dir = annotation_root_dir
        self.set_name = set
        self.resize = resize
        self.use_mosaic = use_mosaic
        self.mosaic_center_range = mosaic_center_range
        self.transform = transform

        self.coco = COCO(
            os.path.join(self.annotation_root_dir,
                         'instances_' + self.set_name + '.json'))

        self.load_classes()

    def load_classes(self):
        self.image_ids = self.coco.getImgIds()
        self.cat_ids = self.coco.getCatIds()
        self.categories = self.coco.loadCats(self.cat_ids)
        self.categories.sort(key=lambda x: x['id'])

        # category_id is an original id,coco_id is set from 0 to 79
        self.category_id_to_coco_label = {
            category['id']: i
            for i, category in enumerate(self.categories)
        }
        self.coco_label_to_category_id = {
            v: k
            for k, v in self.category_id_to_coco_label.items()
        }

    def __len__(self):
        return len(self.image_ids)

    def __getitem__(self, idx):
        if self.use_mosaic:
            # mosaic center x, y
            x_ctr, y_ctr = [
                int(
                    random.uniform(self.resize * self.mosaic_center_range[0],
                                   self.resize * self.mosaic_center_range[1]))
                for _ in range(2)
            ]
            # all 4 image indices
            imgs_indices = [idx] + [
                random.randint( 0 ,
                               len(self.image_ids) - 1) for _ in range(3)
            ]

            final_annots = []
            # combined image by 4 images
            combined_img = np.full((self.resize * 2, self.resize * 2, 3),
                                   114,
                                   dtype=np.uint8)

            for i, index in enumerate(imgs_indices):
                img = self.load_image(index)
                annot = self.load_annotations(index)

                origin_height, origin_width, _ = img.shape
                resize_factor = self.resize / max(origin_height, origin_width)
                resize_height, resize_width = int(
                    origin_height * resize_factor), int(origin_width *
                                                        resize_factor)

                img = cv2.resize(img, (resize_width, resize_height))
                annot[:, :4] *= resize_factor

                # top left img
                if i == 0:
                    # combined image coordinates
                    x1a, y1a, x2a, y2a = max(x_ctr - resize_width,
                                             0), max(y_ctr - resize_height,
                                                     0), x_ctr, y_ctr
                    # single img choosen area
                    x1b, y1b, x2b, y2b = resize_width - (
                        x2a - x1a), resize_height - (
                            y2a - y1a), resize_width, resize_height
                # top right img
                elif i == 1:
                    x1a, y1a, x2a, y2a = x_ctr, max(y_ctr - resize_height,
                                                    0), min(
                                                        x_ctr + resize_width,
                                                        self.resize * 2), y_ctr
                    x1b, y1b, x2b, y2b = 0, resize_height - (y2a - y1a), min(
                        resize_width, x2a - x1a), resize_height
                # bottom left img
                elif i == 2:
                    x1a, y1a, x2a, y2a = max(x_ctr - resize_width,
                                             0), y_ctr, x_ctr, min(
                                                 self.resize * 2,
                                                 y_ctr + resize_height)
                    x1b, y1b, x2b, y2b = resize_width - (x2a - x1a), 0, max(
                        x_ctr, resize_width), min(y2a - y1a, resize_height)
                # bottom right img
                elif i == 3:
                    x1a, y1a, x2a, y2a = x_ctr, y_ctr, min(
                        x_ctr + resize_width,
                        self.resize * 2), min(self.resize * 2,
                                              y_ctr + resize_height)
                    x1b, y1b, x2b, y2b = 0, 0, min(
                        resize_width, x2a - x1a), min(y2a - y1a, resize_height)

                # combined_img[ymin:ymax, xmin:xmax]
                combined_img[y1a:y2a, x1a:x2a] = img[y1b:y2b, x1b:x2b]
                padw, padh = x1a - x1b, y1a - y1b

                # annot coordinates transform 
                if annot.shape[ 0 ] > 0 :
                    annot[ : , 0 ] = annot[ : , 0 ] + padw
                    annot[ : , 1 ] = annot[ : , 1 ] + padh
                    annot[ : , 2 ] = annot[ : , 2 ] + padw
                    annot[ : , 3 ] = annot[ : , 3 ] + padh

                final_annots.append(annot)

            final_annots = np.concatenate(final_annots, axis=0)
            final_annots[:, 0:4] = np.clip(final_annots[:, 0:4], 0,
                                           self.resize * 2)

            final_annots = final_annots[final_annots[:, 2] -
                                        final_annots[:, 0] > 1]
            final_annots = final_annots[final_annots[:, 3] -
                                        final_annots[:, 1] > 1]

            sample = {'img': combined_img, 'annot': final_annots, 'scale': 1.}

        else:
            img = self.load_image(idx)
            annot = self.load_annotations(idx)
            scale = 1.

            origin_height, origin_width, _ = img.shape
            resize_factor = self.resize / max(origin_height, origin_width)
            resize_height, resize_width = int(
                origin_height * resize_factor), int(origin_width *
                                                    resize_factor)

            img = cv2.resize(img, (resize_width, resize_height))
            annot[:, :4] *= resize_factor
            scale *= resize_factor

            sample = {'img': img, 'annot': annot, 'scale': scale}

        if self.transform:
            sample = self.transform(sample)

        return sample

    def load_image(self, image_index):
        image_info = self.coco.loadImgs(self.image_ids[image_index])[0]
        path = os.path.join(self.image_root_dir, image_info['file_name'])
        img = cv2.imread(path)
        img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)

        return img.astype(np.float32) / 255.

    def load_annotations(self, image_index):
        # get ground truth annotations
        annotations_ids = self.coco.getAnnIds(
            imgIds=self.image_ids[image_index], iscrowd=False)
        annotations = np.zeros((0, 5))

        # some images appear to miss annotations
        if len(annotations_ids) == 0:
            return annotations

        # parse annotations
        coco_annotations = self.coco.loadAnns(annotations_ids)
        for _, a in enumerate(coco_annotations):
            # some annotations have basically no width / height, skip them
            if a['bbox'][2] < 1 or a['bbox'][3] < 1:
                continue

            annotation = np.zeros((1, 5))
            if a['bbox'][2] > 0 and a['bbox'][3] > 0:
                annotation[0, :4] = a['bbox']
                annotation[0, 4] = self.find_coco_label_from_category_id(
                    a['category_id'])

                annotations = np.append(annotations, annotation, axis=0)

        # transform from [x_min, y_min, w, h] to [x_min, y_min, x_max, y_max]
        annotations[:, 2] = annotations[:, 0] + annotations[:, 2]
        annotations[:, 3] = annotations[:, 1] + annotations[:, 3]

        return annotations

    def find_coco_label_from_category_id(self, category_id):
        return self.category_id_to_coco_label[category_id]

    def find_category_id_from_coco_label(self, coco_label):
        return self.coco_label_to_category_id[coco_label]

    def num_classes(self):
        return 80

    def image_aspect_ratio(self, image_index):
        image = self.coco.loadImgs(self.image_ids[image_index])[0]
        return float(image['width']) / float(image['height'])

Leave a Comment

Your email address will not be published. Required fields are marked *