YOLO study notes – the third YOLOv3 (including FPN network analysis)
Hits: 0
YOLO(You Only Look Once)——The third [YOLOv3]——The third [YOLOv3] (including FPN network analysis)
Paper address: YOLOv3: An Incremental Improvement. (readpaper.com)
Article directory
 YOLO(You Only Look Once)——The third YOLOv3 (including FPN network analysis)
1 Introduction
The YOLOv3 model is much more complicated than before. You can trade off speed and accuracy by changing the model structure , and retain many features of v2 and v1. Since the v3 paper is written very casually, it is necessary to master v1 and v2. The following are the papers I wrote about v1 and v2 algorithm parsing: ” YOLOv1 Parsing “, ” YOLOv2 Parsing “
2. YOLOv3 model analysis
2.1 Original version features retained
YOLOv3 retains the following features from v1 and v2:
 Since YOLOv1, the yolo algorithm isdivide the cell grid cellTo do the detection, but the number of divisions is different.
 use”ReLU leaks” as the activation function.
The Leaky ReLU function works by putting x x xThe very small linear component of the given negative input 0.01 x 0.01x 0.01xto adjust for the zero gradient problem with negative values.
Leaky helps to expand the scope of ReLU functions, usually α x \alpha x αxis around 0.01.
The function range of Leaky ReLU is negative infinity to positive infinity.
L e a k y R e l u (x) = { x , x > 0 α x , x ≤ 0 LeakyRelu(x)=
$$\begin{cases} x, & x>0 \ \alpha x, & x\le0 \end{cases}$$
LeakyRelu(x)={x,αx,x>0x≤0
 Train endtoend, unified into a regression problem. A loss function does the training, just focus on the input and output.
 Since yolo_v2, yolo uses batch normalization as a method of regularization, accelerating convergence and avoiding overfitting,Connect the BN layer and leaky relu layer to each convolutional layer。
 multiscale training. If you want to be faster, you can sacrifice accuracy; if you want higher accuracy, you can sacrifice a little speed.
 The method of border prediction in v2 is followed ( see sections 2.2 , 2.3, and 2.4 in YOLOv2 parsing for details )
2.2 v3 Improvements
The improvement of each generation of yolo is largely determined by the improvement of the backbone network, from the darknet19 of v2 to the darknet53 of v3. yolo_v3 also provides a lightweight backbone for speed – tiny darknet. The speed improvements are as follows:
The main improvements this time are the following three points:
 Multiscale prediction (introducing FPN)
 Better backbone (darknet53, similar to ResNet introducing residual structure)
 The classifier no longer uses softmax (used in darknet19), and the loss function uses binary crossentropy loss (twoclass cross loss entropy)
3. Multiscale prediction (introducing FPN)
3.1 Multiscale prediction
A set of borders of different sizes and aspect ratios are preset in each grid cell to cover different positions and multiple scales of the entire image. 3 boxes are predicted for each scale, and the anchor design method is still usedclustering( See Sections 2.2, 2.3, and 2.4 in YOLOv2 Analysis for details ), 9 cluster centers are obtained, and they are equally divided into 3 scales according to their size.Use the feature layers of these three scales to predict the bounding box。
The above picture is drawn by the yolo_v3 416 model, the input size is 416×416, and the predicted three feature layer sizes are 13, 26, and 52 respectively. ( Convolutional refers to Conv2d+BN+LeakyReLU )
Visualize the above image:
The three blue boxes in the figure above represent the three basic components of Yolov3 :
 CBL: The smallest component in the Yolov3 network structure, consisting of Conv+Bn+Leaky_relu activation function.
 Res unit: Drawing on the residual structure in the Resnet network, the network can be built deeper.
 ResX: Consists of a CBL and X residual components, which is a large component in Yolov3. The CBL in front of each Res module plays the role of downsampling, so after 5 times of Res modules, the obtained feature map is 416>208>104>52>26>13 size .
Other basic operations:
The feature maps of three scales obtained by the above operations are as follows:
 Scale 1: Add some convolutional layers after the base network, output 13 × \times ×13 size feature maps.
 Scale 2: Upsampling from the convolutional layer of the penultimate layer in scale 1 ( × \times ×2) again with the last 26 × \times ×The feature maps of size 26 are added, and after multiple convolutions again, the output is 26 × \times ×26 size feature maps.
 Scale 3: Upsampling from the convolutional layer of the penultimate layer in scale 2 ( × \times ×2) again with the last 52 × \times ×The feature maps of size 52 are added, and after multiple convolutions again, the output is 52 × \times ×Feature maps of size 52.
3.2 Introduction to FPN
Feature Pyramid Network FPN (Feature Pyramid Networks) is a network proposed in 2017. FPN mainly solves the multiscale problem in object detection. It is changed through simple network connections without increasing the calculation amount of the original model. This greatly improves the performance of small object detection.
The lowlevel feature semantic information is relatively small, but the target location is accurate; the highlevel feature semantic information is richer, but the target location is relatively rough. In addition, although there are some algorithms that use multiscale feature fusion, they generally use the fused features for prediction. The difference between the FPN in this paper is that the prediction is performed independently at different feature layers.

The first type of pyramid: use the image pyramid to create a feature pyramid, and its features are calculated separately for each scale of the image. That is to say, it is necessary to use the original image to create an image pyramid by multiple scaling, and then calculate the features of each scale image to generate a feature pyramid. The advantage of the featurized image pyramid is that it creates multiscale features that contain strong semantic features at all levels, including highresolution levels. The advantage of this method is that the accuracy is relatively high; but the disadvantage is also obvious, that is, it requires a lot of computing power and memory space. And image pyramids are usually used in the testing phase, which leads to inconsistencies between training and testing.

The second type of pyramid: For various reasons, the default configuration in Fast RCNN and Faster RCNN does not use image pyramids, butOnly the last layer of features is taken. Later artificial features are replaced by features computed through deep convolutional networks. Deep convolutional networks can not only represent higherlevel semantics, but also have better robustness to scale changes, so the features computed from the input of one scale can be used for recognition tasks. In the second pyramid, only the results of the last convolutional layer are used. Different layers in the convolutional network will generate feature maps of different spatial resolutions, but the feature maps obtained by different convolutional layers will have a large semantic gap. Highresolution features have good lowlevel features, but are not conducive to identifying objects, and lowresolution features have good highlevel features, but are not conducive to identifying small objects.

The third type of pyramid: The third type of pyramid is used in the SSD network. In SSD, the feature maps calculated by different layers in the convolutional network are formed into a feature pyramid. But in order to avoid using lowlevel features, the feature pyramid is built from a later layer and several new layers are added. In this way, highresolution feature maps are lost, which is unfavorable for detecting small targets.

The fourth type of pyramid: the protagonist FPN, the purpose is to use only one scale of input to create a feature pyramid with strong semantic features at all levels. It is mainly divided into a path from bottom to top and a path from top to bottom.Bottomup is the process of forward feature extraction of deep convolutional networks, topdown is the process of upsampling the feature map of the last convolutional layer, and horizontal connection is the fusion of deep convolutional layer features and shallow The process of layer convolution features. This is why it also has a good detection effect on small objects, it combines highlevel features of deep convolutional layers with lowlevel features of shallow convolutional layers.
3.3 FPN calculation process

The first step bottomup: bottomup path. Take the deep convolutional network, that is, the output of each stage in the backbone network as a layer of our pyramid. For example, with Resnet as the backbone, we take the last output of the residual blocks of conv2, conv3, conv4, and conv5 to form a feature pyramid , which is the pyramid on the left in the image above. We denote it as {C2 , C3 , C4 , C5}, and the corresponding stride is {4, 8, 16, 32}. Because it will take up a lot of memory, the output of the first residual block is not taken.

The second step topdown: the topdown path. First, perform 1×1 convolution on {C2 , C3 , C4 , C5} to reduce the dimension of the channel, and then upsample the output of the deep convolution layer with less spatial information but strong semantic information by 2 times (in the original image pixel On the basis, a suitable interpolation algorithm is used to insert new pixels between the pixels, in this paper, the nearest neighbor upsampling (interpolation) is used. Note the upsampling to get {P2, P3, P4, P5}.

The third step lateral connection: Use the lateral connection to merge the results of the first step and the second step together. Because the relationship between the feature maps output by each stage is 2 times, so the size of the feature map obtained by the upsampling of the previous layer P n + 1 P_{n+1} Pn+1and this layer C n C_n Cnare the same size, you can directly add the corresponding elements.

Step 4: A 3×3 convolution is followed by the result of the merge to reduce the aliasing effect of upsampling (the reason for the aliasing effect: the grayscale of the image generated by interpolation is discontinuous, and there may be obvious changes in the grayscale. jagged).
The above picture is the input image 256 × \times ×256 for pretraining to introduce, the commonly used size is 416 × \times ×416, all multiples of 32. This network is mainly composed of a series of 1×1 and 3×3 convolutional layers (Each convolutional layer is followed by a BN layer and a LeakyReLU layer), the author said that because there are 53 convolutional layers in the network, it is called Darknet53 (2 + 12 + 1 + 22 + 1 + 82 + 1 + 82 + 1 + 4*2 + 1 = 53 in order, the last Connected is a fully connected layer and a convolutional layer, a total of 53)
In the entire v3 structure, it isNo pooling and fully connected layersof. In the process of forward propagation, the size transformation of the tensor is achieved by changing the step size of the convolution kernel, such as stride=(2, 2), which is equivalent to reducing the side length of the image by half (that is, reducing the area to the original size). 1/4). In yolo_v3, 5 reductions are required, which reduces the feature map to 1/32 of the original input size. The input is 416×416, the output is 13×13 (416/32=13).
In yolo_v2, the tensor size transformation in the forward process is performed by maximum pooling , a total of 5 times. And v3 is performed by increasing the step size of the convolution kernel, which is also 5 times.
Note: Multichannel convolution
For multichannel image + multiconvolution kernel for convolution, the calculation method is as follows:
The input has 3 channels and 2 convolution kernels at the same time. For each convolution kernel, first convolve the input 3 channels separately, and then add the results of the 3 channels to obtain the convolution output.So for a convolutional layer, no matter how many channels the input image has, the number of output image channels is always equal to the number of convolution kernels！
rightDo 1×1 convolution on multichannel images, in fact, the input image is multiplied by the convolution coefficient of each channel and added together, which is equivalent to the original image in the original image.Each independent channel “connects”together.
5. The classification loss adopts binary crossentropy loss
5.1 Classifying each box without Softmax
 When predicting object classesChange to use the output of logistic for prediction. This can support multilabel objects (eg a person has two labels Woman and Person).
 Softmax assigns each box a class (the one with the largest score), and for
Open Images
this kind of dataset, the targets may have overlapping class labels (people and woman), so Softmax is not suitable for multilabel classification.  Softmax can be replaced by multiple independent logistic classifiers without loss of accuracy.
5.2 loss function
The loss function used is not clearly mentioned in the v3 paper. We can learn the loss function form of v3 from the analysis of the former version and source code.
A loss calculation method called sumsquare error is used in v1, which is simply the addition of squares. If you want to know more, you can read my ” YOLOv1 Analysis “. In the target detection task, there are several key pieces of information that need to be determined:
( x , y ) , ( w , h ) , c l a s s , c o n f i d e n c e (x,y),(w,h),class,confidence (x,y),(w,h),class,confidence
According to the characteristics of key information, it can be divided into the above four categories, and the loss function should be determined by their respective characteristics. Finally, adding them together can form the final loss_function, that is, a loss_function to get the endtoend training. The loss function of v3 can be analyzed from the code, which is also for the above four categories, but there are still some adjustments compared to the simple total square error in v1:
xy_loss = object_mask * box_loss_scale * K.binary_crossentropy(raw_true_xy, raw_pred[..., 0:2], from_logits=True) wh_loss = object_mask * box_loss_scale * 0.5 * K.square(raw_true_wh  raw_pred[..., 2:4]) confidence_loss = object_mask * K.binary_crossentropy(object_mask, raw_pred[..., 4:5], from_logits=True) + \ (1  object_mask) * K.binary_crossentropy(object_mask, raw_pred[..., 4:5], from_logits=True) * ignore_mask class_loss = object_mask * K.binary_crossentropy(true_class_probs, raw_pred[..., 5:], from_logits=True) xy_loss = K.sum(xy_loss) / mf wh_loss = K.sum(wh_loss) / mf confidence_loss = K.sum(confidence_loss) / mf class_loss = K.sum(class_loss) / mf loss += xy_loss + wh_loss + confidence_loss + class_loss
The above is the loss_function code of yolo_v3 described by the keras framework. Ignore the constant coefficients and don’t look, as can be seen from the above code: exceptThe loss function of w, h still uses the total square errorBesides,The other parts of the loss function use binary cross entropyAdd them together at the end. (binary_crossentropy is the simplest cross entropy, generally used for binary classification)
6. Training and prediction of YOLOv3
6.1 Training
During the training process, when the input is 416416, the model will output 10647 prediction boxes (feature maps of three sizes, three prediction boxes of each size in total). ( 13 ∗ 13 + 26 ∗ 26 + 52 ∗ 52 ) ∗ 3 = 10647 (1313+2626+5252)3=10647 (13∗13+26∗26+52∗52)∗3=10647), and label each prediction box according to the ground truth in the training set (positive example: IOU with ground truth is the largest; negative example: IOC < threshold 0.5; Ignore: the box with objects is predicted but the IOU is not the largest Dropped in NMS). Then use the loss function* to optimize and update the network parameters.
6.2 Prediction
As shown in the figure above: During the training process, for each input image, yolov3 will predict three 3D tensors of different sizes, corresponding to three different scales. The purpose of designing these three scales is to detect objects of different sizes. .
Here is an example of a 13 * 13 tensor. For this scale, the original input image will be divided into 13 × 13 grid cells, and each grid cell corresponds to a 1x1x255 voxel in the 3D tensor. 255 is derived from 3*(4+1+80). As can be seen from the above figure, the formula N × N × [ 3 × ( 4 + 1 + 80 ) ] N\times N\times [3\times (4+1+80)] N×N×[3×(4+1+80)]middle N × N N\times N N×NIndicates the scale size, such as the one mentioned above 13 × 13 13\times13 13×13. 3 means each grid cell predict 3 boxes. 4 represents the coordinate value that is ( t x , t y , t h , t w ) (t_x,t_y,t_h,t_w) (tx,ty,th,tw). 1 is the confidence level and 80 is the number of COCO classes.
 If the center of the bounding box corresponding to a ground truth in the training set happens to fall within a grid cell of the input image, then the grid cell is responsible for predicting the bounding box of the object, so the confidence level corresponding to the grid cell is 1. The confidence of other grid cells is 0. Each grid cell will be assigned 3 prior boxes of different sizes. During the learning process, the grid cell will learn how to choose which size of the prior box. The author definesSelect the prior box with the highest coincidence with the ground truth IOU。
 The three preset prior boxes of different sizes mentioned above, how are these three sizes calculated? First, before training, all bboxes in the COCO dataset are divided into 9 categories using kmeans clustering, each of which is 3 categories. Corresponding to one scale, so there are three scales in total. This prior information about the size of the box helps the network to accurately predict the offset and coordinate of each box. Intuitively, a box with a suitable size will make the network learn more accurately. ( See 2.4 Dimension Cluster in ” [YOLOv2 Analysis] ” )
Input the picture into the trained prediction network, first output the information of the prediction box ( o b j , t x , t y , t h , t w , c l s ) (obj,t_x,t_y,t_h,t_w,cls) (obj,tx,ty,th,tw,cls), after the classspecific confidence score (conf_score=objcls) of each prediction box, setthreshold, filter out prediction boxes with low scores, and performNMS processing, the final detection result is obtained. ( For details, please refer to 3. IOU and NMS processing flow in ” YOLOv1 Analysis* ” )

Thresholding: remove most of the background boxes that do not contain predicted objects

NMS processing: remove redundant bounding boxes to prevent repeated prediction of the same object
reference:
[(Convolutional Neural Network – FPN (Feature Pyramid Networks) Introduction – itlilyer’s blog – CSDN blog – fpn neural network] Introduction – itlilyer’s blog – CSDN blog – fpn neural network](/itlilyer/article/details/111321634?spm=1001.2101.3001.6650.2&utm_medium=distribute.pc_relevant.nonetaskblog2~default~CTRLIST~Rate2.pc_relevant_default&depth_1utm_source=distribute.pc_relevant.nonetaskblog2~default~CTRLIST~Rate2.pc_relevant_default&utm_relevant_index=5)
Analysis of YOLO v3 network structure – Thunderbolt Barla WzCSDN blog – yolov3 network structure
[Intensive reading of AI papers] YOLO V3 target detection (with YOLOV3 code reproduction)
yolo v3 [indepth analysis] of yolo series – Programmer Sought
The most comprehensive overview of the YOLO algorithm: from YOLOv1 to YOLOv5