Article directory
foreword
[The YOLOv5] version used in this article is v6.1 . Students who are not familiar with the YOLOv56.x network structure can move to: [YOLOv56.x] Network Model & Source Code Analysis
In addition, the experimental environment used in this paper is a GTX 1080 [GPU] , the data set is VOC2007, the hyperparameter is hyp.scratchlow.yaml, and the training is 200 epochs. Other parameters are the default values in the source code.
The general steps to modify the network structure in YOLOv5:
models/common.py
: In the common.py file, add the module code to be modifiedmodels/yolo.py
: Add the name of the new module to the function in the yolo.py fileparse_model
models/new_model.yaml
: Create the .yaml file corresponding to the new module in the models folder
1. Shufflenetv2
[Cite]Ma, Ningning, et al. “Shufflenet v2: Practical guidelines for efficient cnn architecture design.” Proceedings of the European conference on computer vision (ECCV). 2018.
Introduction to the paper
Despising the lightweight convolutional neural network Shufflenetv2, through a large number of experiments, four lightweight network design criteria are proposed . The input and output channels, the number of grouped convolution groups, the degree of network fragmentation, the speed of elementbyelement operations on different hardware and the amount of memory access The impact of MAC (Memory Access Cost) is analyzed in detail:
 Criterion 1: When the number of input and output channels is the same, the amount of memory access MAC is the smallest
 Mobilenetv2 is not satisfied, adopts a quasiresidual structure, and the number of input and output channels is not equal
 Rule 2: Packet convolution with too many packets will increase the MAC
 Shufflenetv1 is not satisfied, using group convolution (GConv)
 Criterion 3: Fragmented operations (multipass, making the network very wide) are not friendly to parallel acceleration
 Networks of the Inception family
 Rule 4: The memory and time consumption brought by elementwise operations (such as ReLU, Shortcutadd, etc.) cannot be ignored
 Shufflenetv1 is not satisfied, the add operation is used
In response to the above four criteria, the author proposes the Shufflenetv2 model, which replaces the grouped convolution with Channel Split, satisfies the four design criteria, and achieves the optimal tradeoff between speed and accuracy.
Model overview
Shufflenetv2 has two structures: basic unit and unit from spatial down sampling (2×)
 basic unit: the number of input and output channels remains the same, and the size does not change
 unit from spatial down sample : Double the number of output channels and double the size (downsampling)
The overall philosophy of Shufflenetv2 should be closely aligned with the four principles of lightweighting proposed in the paper, which are basically avoided except for the fourth principle.
In order to solve the problem of no information exchange between different groups caused by GConv (Group Convolution), and only feature extraction in the same group, Shufflenetv2 designed the Channel [Shuffle], and only feature extraction in the same group, Shufflenetv2 designed the Channel [Shuffle] operation to rearrange channels and exchange information across groups.
class ShuffleBlock(nn.Module): def __init__(self, groups=2): super(ShuffleBlock, self).__init__() self.groups = groups def forward(self, x): '''Channel shuffle: [N,C,H,W] > [N,g,C/g,H,W] > [N,C/g,g,H,W] > [N,C,H,W]''' N, C, H, W = x.size() g = self.groups return x.view(N, g, C//g, H, W).permute(0, 2, 1, 3, 4).reshape(N, C, H, W)
Join YOLOv5
 common.py file modification: add the following code directly at the bottom
#  ShuffleBlock start  # Channel rearrangement, crossgroup information exchange def channel_shuffle (x, groups) : batchsize, num_channels, height, width = x.data.size() channels_per_group = num_channels // groups # reshape x = x.view(batchsize, groups, channels_per_group, height, width) x = torch.transpose(x, 1, 2).contiguous() # flatten x = x.view(batchsize, 1, height, width) return x class conv_bn_relu_maxpool(nn.Module): def __init__(self, c1, c2): # ch_in, ch_out super(conv_bn_relu_maxpool, self).__init__() self.conv = nn.Sequential( nn.Conv2d(c1, c2, kernel_size=3, stride=2, padding=1, bias=False), nn.BatchNorm2d(c2), nn.ReLU(inplace= True ), ) self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False) def forward(self, x): return self.maxpool(self.conv(x)) class Shuffle_Block(nn.Module): def __init__(self, inp, oup, stride): super(Shuffle_Block, self).__init__() if not (1 <= stride <= 3): raise ValueError('illegal stride value') self.stride = stride branch_features = oup // 2 assert (self.stride != 1) or (inp == branch_features << 1) if self.stride > 1: self.branch1 = nn.Sequential( self.depthwise_conv(inp, inp, kernel_size=3, stride=self.stride, padding=1), nn.BatchNorm2d(inp), nn.Conv2d(inp, branch_features, kernel_size=1, stride=1, padding=0, bias=False), nn.BatchNorm2d(branch_features), nn.ReLU(inplace= True ), ) self.branch2 = nn.Sequential( nn.Conv2d(inp if (self.stride > 1) else branch_features, branch_features, kernel_size=1, stride=1, padding=0, bias=False), nn.BatchNorm2d(branch_features), nn.ReLU(inplace= True ), self.depthwise_conv(branch_features, branch_features, kernel_size=3, stride=self.stride, padding=1), nn.BatchNorm2d(branch_features), nn.Conv2d(branch_features, branch_features, kernel_size=1, stride=1, padding=0, bias=False), nn.BatchNorm2d(branch_features), nn.ReLU(inplace= True ), ) @staticmethod def depthwise_conv(i, o, kernel_size, stride=1, padding=0, bias=False): return nn.Conv2d(i, o, kernel_size, stride, padding, bias=bias, groups=i) def forward(self, x): if self.stride == 1: x1, x2 = x.chunk( 2 , dim= 1 ) # split by dimension 1 out = torch.cat((x1, self.branch2(x2)), dim= 1 ) else : out = torch.cat((self.branch1(x), self.branch2(x)), dim=1) out = channel_shuffle(out, 2) return out #  ShuffleBlock end 

yolo.py file modification: In the
parse_model
function of yolo.py, addconv_bn_relu_maxpool, Shuffle_Block
two modules (as shown in the red box in the figure below) 
Create a new yaml file: Create a new file under the model file
yolov5shufflenetv2.yaml
and copy the following code
# YOLOv5 🚀 by Ultralytics, GPL3.0 license # Parameters nc: 20 # number of classes depth_multiple: 1.0 # model depth multiple width_multiple: 1.0 # layer channel multiple anchors:  [ 10 , 13 , 16 , 30 , 33 , 23 ] # P3 / 8  [ 30 , 61 , 62 , 45 , 59 , 119 ] # P4 / 16  [ 116 , 90 , 156 , 198 , 373 , 326 ] # P5 / 32 # YOLOv5 v6.0 backbone backbone: # [from, number, module, args] # Shuffle_Block: [out, stride] [[ 1, 1, conv_bn_relu_maxpool, [ 32 ] ], # 0P2/4 [ 1, 1, Shuffle_Block, [ 128, 2 ] ], # 1P3/8 [ 1, 3, Shuffle_Block, [ 128, 1 ] ], # 2 [ 1, 1, Shuffle_Block, [ 256, 2 ] ], # 3P4/16 [ 1, 7, Shuffle_Block, [ 256, 1 ] ], # 4 [ 1, 1, Shuffle_Block, [ 512, 2 ] ], # 5P5/32 [ 1, 3, Shuffle_Block, [ 512, 1 ] ], # 6 ] # YOLOv5 v6.0 head head: [[1, 1, Conv, [256, 1, 1]], [1, 1, nn.Upsample, [None, 2, 'nearest']], [[1, 4], 1, Concat, [1]], # cat backbone P4 [1, 1, C3, [256, False]], # 10 [1, 1, Conv, [128, 1, 1]], [1, 1, nn.Upsample, [None, 2, 'nearest']], [[1, 2], 1, Concat, [1]], # cat backbone P3 [1, 1, C3, [128, False]], # 14 (P3/8small) [1, 1, Conv, [128, 3, 2]], [[1, 11], 1, Concat, [1]], # cat head P4 [1, 1, C3, [256, False]], # 17 (P4/16medium) [1, 1, Conv, [256, 3, 2]], [[1, 7], 1, Concat, [1]], # cat head P5 [1, 1, C3, [512, False]], # 20 (P5/32large) [[14, 17, 20], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5) ]
[Cite]Howard, Andrew, et al. “Searching for mobilenetv3.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.
Introduction to the paper
[MobileNetV3 is a lightweight network architecture] proposed by Google on March 21, 2019. On the basis of the previous two versions, the neural network architecture search (NAS) and hswish activation function are added, and the SE channel attention mechanism is introduced. Excellent performance and speed, sought after by academia and industry.
main feature:
 The paper has two versions: Large and Small, which are suitable for different scenarios
 The architecture of the network is based on MnasNet implemented by NAS (the effect is better than MobileNetV2), and the parameters are obtained by NAS search
 Introducing the depthwise separable convolution of MobileNetV1
 Inverted residual structure with linear bottleneck introduced in MobileNetV2
 Introduce a lightweight attention model (SE) based on squeeze and excitation structure
 A new activation function hswish(x) is used
 In the network structure search, two technologies are combined: resourceconstrained NAS (platformaware NAS) and NetAdapt
 Modified MobileNetV2 network end final stage
Model overview
[Depthwise] Separable Convolution
Mobilenetv1 proposes a depthwise separable convolution, which is to split ordinary convolutions into a depthwise convolution (Depthwise Convolutional Filters) and a pointwise convolution (Pointwise Convolution):
 Depthwise Convolutional Filters : Split the convolution kernel into a singlechannel form, and perform a convolution operation on each channel without changing the depth of the input feature image , so that the output features consistent with the number of channels in the input feature map are obtained. In this way, there will be a problem . The number of channels is too small, and the dimension of the feature map is too small. Can we obtain enough effective information?
 Pointwise Convolution : Pointwise convolution is 1×1 convolution. The main function is to increase and reduce the dimension of the feature map. In the process of deep convolution, assuming that the output feature map of 8×8×3 is obtained, we use 256 A 1×1×3 convolution kernel performs a convolution operation on the input feature map, and the output feature map is 8×8×256 like the standard convolution operation.
Inverse residual structure
The depth convolution itself does not have the ability to change the channel. The output of the channel is the number of channels. If there are few channels, the DW depth convolution can only work in low dimensions, so the effect is not very good, so we To “expand” the channel.
Since we already know that PW pointbypoint convolution, that is, 1×1 convolution, can be used to increase and reduce dimensions, we can use PW convolution to increase dimension before DW depth convolution (the multiplier of dimension increase is t, t=6 ), and then perform a convolution operation in a higherdimensional space to extract features, so that regardless of the number of input channels, after the first PW pointbypoint convolution upscaling, the depth convolution is relatively higher. Work in 6x dimensions .
Inverted residuals: In order to reuse features like Resnet, a shortcut structure is introduced , and the mode of 1 × 1 > 3 × 3 > 1 × 1 is adopted, but the difference is:
 ResNet first reduces the dimension (0.25 times), convolution, and then increases the dimension
 Mobilenetv2 is the first dimension increase (6 times), convolution, and then dimension reduction
SE channel attention
SE channel attention comes from the paper: “Squeezeandexcitation networks.” It mainly discusses the construction of information features in convolutional neural networks, and the author proposes a component called “SqueezeExcitation (SE)” :
 The role of the SE component is to enhance the channellevel feature response by explicitly modeling the interdependence between channels (in other words, learning a set of weights and assigning this set of weights to each channel to further improve the feature representation ) , so that important features are strengthened and nonimportant features are weakened
 Specifically, it is to automatically obtain the importance of each feature channel through learning, and then according to this importance to enhance useful features and suppress features that are not useful for the current task
class SELayer(nn.Module): def __init__(self, channel, reduction=4): super(SELayer, self).__init__() self.avg_pool = nn.AdaptiveAvgPool2d(1) self.fc = nn.Sequential( nn.Linear(channel, channel // reduction), nn.ReLU(inplace=True), nn.Linear(channel // reduction, channel), h_sigmoid() ) def forward(self, x): b, c, _, _ = x.size() y = self.avg_pool(x) y = y.view(b, c) y = self .fc(y).view(b, c, 1 , 1 ) # learn the weight of each channel return x * y
hswish [activation function]
The approximate operation simulates swish and relu with the following formulas:
h _ s w i s h (x) = x ∗ R e L U 6 ( x + 3 ) 6 h_swish(x)=x*\frac{ReLU6(x+3)}{6} h_swish(x)=x∗6ReLU6(x+3)、 h _ s i g m o i d (x) = R e L U 6 ( x + 3 ) 6 h_sigmoid(x)=\frac{ReLU6(x+3)}{6} h_sigmoid(x)=6ReLU6(x+3)
class h_sigmoid(nn.Module): def __init__(self, inplace=True): super(h_sigmoid, self).__init__() self.relu = nn.ReLU6(inplace=inplace) def forward(self, x): return self.relu(x + 3) / 6 class h_swish(nn.Module): def __init__(self, inplace=True): super(h_swish, self).__init__() self.sigmoid = h_sigmoid(inplace=inplace) def forward(self, x): return x * self.sigmoid(x)
Join YOLOv5
 common.py file modification: add the following code directly at the bottom
#  MobileBlock start  class h_sigmoid(nn.Module): def __init__(self, inplace=True): super(h_sigmoid, self).__init__() self.relu = nn.ReLU6(inplace=inplace) def forward(self, x): return self.relu(x + 3) / 6 class h_swish(nn.Module): def __init__(self, inplace=True): super(h_swish, self).__init__() self.sigmoid = h_sigmoid(inplace=inplace) def forward(self, x): return x * self.sigmoid(x) class SELayer(nn.Module): def __init__(self, channel, reduction=4): super(SELayer, self).__init__() # Squeeze operation self.avg_pool = nn.AdaptiveAvgPool2d( 1 ) # Excitation operation (FC+ReLU+FC+Sigmoid) self.fc = nn.Sequential( nn.Linear(channel, channel // reduction), nn.ReLU(inplace= True ), nn.Linear(channel // reduction, channel), h_sigmoid() ) def forward(self, x): b, c, _, _ = x.size() y = self.avg_pool(x) y = y.view(b, c) y = self.fc(y).view(b, c, 1 , 1 ) # learn the weight of each channel return x * y class conv_bn_hswish(nn.Module): """ This equals to def conv_3x3_bn(inp, oup, stride): return nn.Sequential( nn.Conv2d(inp, oup, 3, stride, 1, bias=False), nn.BatchNorm2d(oup), h_swish() ) """ def __init__(self, c1, c2, stride): super(conv_bn_hswish, self).__init__() self.conv = nn.Conv2d(c1, c2, 3, stride, 1, bias=False) self.bn = nn.BatchNorm2d(c2) self.act = h_swish() def forward(self, x): return self.act(self.bn(self.conv(x))) def fuseforward(self, x): return self.act(self.conv(x)) class MobileNet_Block(nn.Module): def __init__(self, inp, oup, hidden_dim, kernel_size, stride, use_se, use_hs): super(MobileNet_Block, self).__init__() assert stride in [1, 2] self.identity = stride == 1 and inp == oup # If the number of input channels = the number of expansion channels, channel expansion will not be performed if inp == hidden_dim: self.conv = nn.Sequential( # dw nn.Conv2d(hidden_dim, hidden_dim, kernel_size, stride, (kernel_size  1) // 2, groups=hidden_dim, bias= False ), nn.BatchNorm2d(hidden_dim), h_swish() if use_hs else nn.ReLU(inplace=True), # SqueezeandExcite SELayer(hidden_dim) if use_se else nn.Sequential(), # pwlinear nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), nn.BatchNorm2d(oup), ) else : # else do channel expansion first self.conv = nn.Sequential( # pw nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False), nn.BatchNorm2d(hidden_dim), h_swish() if use_hs else nn.ReLU(inplace=True), # dw nn.Conv2d(hidden_dim, hidden_dim, kernel_size, stride, (kernel_size  1) // 2, groups=hidden_dim, bias= False ), nn.BatchNorm2d(hidden_dim), # SqueezeandExcite SELayer(hidden_dim) if use_se else nn.Sequential(), h_swish() if use_hs else nn.ReLU(inplace=True), # pwlinear nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False), nn.BatchNorm2d(oup), ) def forward(self, x): y = self.conv(x) if self.identity: return x + y else: return y #  MobileBlock end 

yolo.py file modification: In the
parse_model
function of yolo.py, addh_sigmoid, h_swish, SELayer, conv_bn_hswish, MobileNet_Block
five modules 
Create a new yaml file: Create a new file under the model file
yolov5mobilenetv3small.yaml
and copy the following code
# YOLOv5 🚀 by Ultralytics, GPL3.0 license # Parameters nc: 20 # number of classes depth_multiple: 1.0 # model depth multiple width_multiple: 1.0 # layer channel multiple anchors:  [ 10 , 13 , 16 , 30 , 33 , 23 ] # P3 / 8  [ 30 , 61 , 62 , 45 , 59 , 119 ] # P4 / 16  [ 116 , 90 , 156 , 198 , 373 , 326 ] # P5 / 32 # YOLOv5 v6.0 backbone backbone: # MobileNetV3small 11 layers # [from, number, module, args] # MobileNet_Block: [out_ch, hidden_ch, kernel_size, stride, use_se, use_hs] # hidden_ch indicates the number of expansion channels in Inverted residuals # use_se indicates whether to use SELayer, use_hs indicates whether to use h_swish or ReLU [ [1, 1, conv_bn_hswish, [16, 2 ]], # 0p1/2 [ 1, 1, MobileNet_Block, [16, 16, 3, 2, 1, 0 ]] , # 1p2/4 [ 1, 1, MobileNet_Block, [24, 72, 3, 2, 0, 0 ]], # 2p3/8 [ 1, 1, MobileNet_Block, [24, 88, 3 , 1, 0, 0 ]], # 3p3/8 [ 1, 1, MobileNet_Block, [40, 96, 5, 2, 1, 1 ]], # 4p4/16 [1, 1, MobileNet_Block, [40, 240, 5, 1, 1, 1]], # 5p4/16 [1, 1, MobileNet_Block, [40, 240, 5, 1, 1, 1]], # 6p4/16 [1, 1, MobileNet_Block, [48, 120, 5, 1, 1, 1]], # 7p4/16 [1, 1, MobileNet_Block, [48, 144, 5, 1, 1, 1]], # 8p4/16 [1, 1, MobileNet_Block, [96, 288, 5, 2, 1, 1]], # 9p5/32 [1, 1, MobileNet_Block, [96, 576, 5, 1, 1, 1]], # 10p5/32 [1, 1, MobileNet_Block, [96, 576, 5, 1, 1, 1]], # 11p5/32 ] # YOLOv5 v6.0 head head: [[1, 1, Conv, [256, 1, 1]], [1, 1, nn.Upsample, [None, 2, 'nearest']], [[1, 8], 1, Concat, [1]], # cat backbone P4 [1, 1, C3, [256, False]], # 15 [1, 1, Conv, [128, 1, 1]], [1, 1, nn.Upsample, [None, 2, 'nearest']], [[1, 3], 1, Concat, [1]], # cat backbone P3 [1, 1, C3, [128, False]], # 19 (P3/8small) [1, 1, Conv, [128, 3, 2]], [[1, 16], 1, Concat, [1]], # cat head P4 [1, 1, C3, [256, False]], # 22 (P4/16medium) [1, 1, Conv, [256, 3, 2]], [[1, 12], 1, Concat, [1]], # cat head P5 [1, 1, C3, [512, False]], # 25 (P5/32large) [[19, 22, 25], 1, Detect, [nc, anchors]], # Detect(P3, P4, P5) ]
3. Ghostnet
Han, Kai, et al. “Ghostnet: More features from cheap operations.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.
Introduction to the paper
Ghostnet comes from Huawei’s Noah’s Ark Lab . The author found that there is a lot of redundancy in traditional deep learning networks, but feature maps that are critical to the accuracy of the model . These feature maps are obtained by the convolution operation, and then input to the next convolution layer for operation. This process contains a large number of network parameters and consumes a lot of computing resources.
The author considers that the redundant information in these feature map layers may be an important part of a successful model. It is precisely because of this redundant information that a comprehensive understanding of the input data can be guaranteed , so the author did not try to remove it when designing a lightweight model. These redundant feature maps instead try to obtain these redundant feature maps using a lower computational cost .
Model overview
The Ghost convolution part divides the traditional convolution operation into two parts:
 The first step is to use a small number of convolution kernels for convolution operations (for example, 64 are used normally, and 32 are used here, thereby reducing the amount of computation by half)
 The second step is to use 3×3 or 5×5 convolution kernels for channelbychannel convolution operations (Cheap operations)
Finally, the first part is used as an identity map (Identity), and the Concat operation is performed with the result of the second step
The GhostBottleneck section has two structures:
 stride=1, when downsampling is not performed, two Ghost convolution operations are performed directly
 stride=2, when downsampling, there is an additional depth convolution operation with a step size of 2
Join YOLOv5
In the latest version of YOLOv56.1 [source code] , the author has added the Ghost module, and the models/hub/
file is given under the yolov5sghost.yaml
folder, so it can be used directly.
class GhostConv(nn.Module): # Ghost Convolution https://github.com/huaweinoah/ghostnet def __init__(self, c1, c2, k=1, s=1, g=1, act=True): # ch_in, ch_out, kernel, stride, groups super().__init__() c_ = c2 // 2 # hidden channels self.cv1 = Conv(c1, c_, k, s, None , g, act) # First half of the convolution to reduce the amount of computation self.cv2 = Conv(c_, c_, 5 , 1 , None , c_, act) # Then perform feature map convolution def forward(self, x): y = self.cv1(x) return torch.cat([y, self.cv2(y)], 1 ) # Finally concat the two parts class GhostBottleneck(nn.Module): # Ghost Bottleneck https://github.com/huaweinoah/ghostnet def __init__(self, c1, c2, k=3, s=1): # ch_in, ch_out, kernel, stride super().__init__() c_ = c2 // 2 self.conv = nn.Sequential(GhostConv(c1, c_, 1 , 1 ), # pw # dw DWConv is only enabled when stride=2 (c_, c_, k, s, act= False ) if s == 2 else nn.Identity(), GhostConv(c_, c2, 1, 1, act=False)) # pwlinear self.shortcut = nn.Sequential(DWConv(c1, c1, k, s, act=False), Conv(c1, c2, 1, 1, act=False)) if s == 2 else nn.Identity() def forward(self, x): class C3Ghost (C3) : # C3 module with GhostBottleneck() def __init__ (self, c1, c2, n= 1 , shortcut=True, g= 1 , e= 0.5 ) : super().__init__(c1, c2, n , shortcut, g, e) # Introduce the attributes of C3 (parent class) c_ = int(c2 * e) # hidden channels self.m = nn.Sequential(*(GhostBottleneck(c_, c_) for _ in range(n) ))
References
[Intensive Reading AI Papers] Megvii Lightweight Network ShuffleNet V2 – Algorithm Intensive
Lightweight neural network “tour” (1) – ShuffleNetV2
Lightweight neural network “tour” (2) – MobileNet, from V1 to V3
Yolov5 replaces backbone, with model compression (pruning, quantization, distillation)