# [Magic change YOLOv5-6.x (on)] Combining lightweight network Shufflenetv2, Mobilenetv3 and Ghostnet

## foreword

[The YOLOv5] version used in this article is v6.1 . Students who are not familiar with the YOLOv5-6.x network structure can move to: [YOLOv5-6.x] Network Model & Source Code Analysis

In addition, the experimental environment used in this paper is a GTX 1080 [GPU] , the data set is VOC2007, the hyperparameter is hyp.scratch-low.yaml, and the training is 200 epochs. Other parameters are the default values ​​in the source code.

The general steps to modify the network structure in YOLOv5:

• models/common.py: In the common.py file, add the module code to be modified
• models/yolo.pyparse_model: Add the name of the new module to the function in the yolo.py file
• models/new_model.yaml: Create the .yaml file corresponding to the new module in the models folder

## 1. Shufflenetv2

[Cite]Ma, Ningning, et al. “Shufflenet v2: Practical guidelines for efficient cnn architecture design.” Proceedings of the European conference on computer vision (ECCV). 2018.

thesis code

### Introduction to the paper

Despising the lightweight convolutional neural network Shufflenetv2, through a large number of experiments, four lightweight network design criteria are proposed . The input and output channels, the number of grouped convolution groups, the degree of network fragmentation, the speed of element-by-element operations on different hardware and the amount of memory access The impact of MAC (Memory Access Cost) is analyzed in detail:

• Criterion 1: When the number of input and output channels is the same, the amount of memory access MAC is the smallest
• Mobilenetv2 is not satisfied, adopts a quasi-residual structure, and the number of input and output channels is not equal
• Rule 2: Packet convolution with too many packets will increase the MAC
• Shufflenetv1 is not satisfied, using group convolution (GConv)
• Criterion 3: Fragmented operations (multi-pass, making the network very wide) are not friendly to parallel acceleration
• Networks of the Inception family
• Rule 4: The memory and time consumption brought by element-wise operations (such as ReLU, Shortcut-add, etc.) cannot be ignored
• Shufflenetv1 is not satisfied, the add operation is used

In response to the above four criteria, the author proposes the Shufflenetv2 model, which replaces the grouped convolution with Channel Split, satisfies the four design criteria, and achieves the optimal trade-off between speed and accuracy.

### Model overview

Shufflenetv2 has two structures: basic unit and unit from spatial down sampling (2×)

• basic unit: the number of input and output channels remains the same, and the size does not change
• unit from spatial down sample : Double the number of output channels and double the size (downsampling)

The overall philosophy of Shufflenetv2 should be closely aligned with the four principles of lightweighting proposed in the paper, which are basically avoided except for the fourth principle.

In order to solve the problem of no information exchange between different groups caused by GConv (Group Convolution), and only feature extraction in the same group, Shufflenetv2 designed the Channel [Shuffle], and only feature extraction in the same group, Shufflenetv2 designed the Channel [Shuffle] operation to rearrange channels and exchange information across groups.

class ShuffleBlock(nn.Module):
def __init__(self, groups=2):
super(ShuffleBlock, self).__init__()
self.groups = groups

def forward(self, x):
'''Channel shuffle: [N,C,H,W] -> [N,g,C/g,H,W] -> [N,C/g,g,H,W] -> [N,C,H,W]'''
N, C, H, W = x.size()
g = self.groups
return x.view(N, g, C//g, H, W).permute(0, 2, 1, 3, 4).reshape(N, C, H, W)

### Join YOLOv5

• common.py file modification: add the following code directly at the bottom

# ---------------------------- ShuffleBlock start -------------------------------

# Channel rearrangement, cross-group information exchange
def  channel_shuffle (x, groups) :
batchsize, num_channels, height, width = x.data.size()
channels_per_group = num_channels // groups

# reshape
x = x.view(batchsize, groups,
channels_per_group, height, width)

x = torch.transpose(x, 1, 2).contiguous()

# flatten
x = x.view(batchsize, -1, height, width)

return x

class conv_bn_relu_maxpool(nn.Module):
def __init__(self, c1, c2):  # ch_in, ch_out
super(conv_bn_relu_maxpool, self).__init__()
self.conv = nn.Sequential(
nn.Conv2d(c1, c2, kernel_size=3, stride=2, padding=1, bias=False),
nn.BatchNorm2d(c2),
nn.ReLU(inplace= True ),
)
self.maxpool = nn.MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)

def forward(self, x):
return self.maxpool(self.conv(x))

class Shuffle_Block(nn.Module):
def __init__(self, inp, oup, stride):
super(Shuffle_Block, self).__init__()

if not (1 <= stride <= 3):
raise ValueError('illegal stride value')
self.stride = stride

branch_features = oup // 2
assert (self.stride != 1) or (inp == branch_features << 1)

if self.stride > 1:
self.branch1 = nn.Sequential(
nn.BatchNorm2d(inp),
nn.Conv2d(inp, branch_features, kernel_size=1, stride=1, padding=0, bias=False),
nn.BatchNorm2d(branch_features),
nn.ReLU(inplace= True ),
)

self.branch2 = nn.Sequential(
nn.Conv2d(inp if (self.stride > 1) else branch_features,
nn.BatchNorm2d(branch_features),
nn.ReLU(inplace= True ),
nn.BatchNorm2d(branch_features),
nn.Conv2d(branch_features, branch_features, kernel_size=1, stride=1, padding=0, bias=False),
nn.BatchNorm2d(branch_features),
nn.ReLU(inplace= True ),
)

@staticmethod
def depthwise_conv(i, o, kernel_size, stride=1, padding=0, bias=False):
return nn.Conv2d(i, o, kernel_size, stride, padding, bias=bias, groups=i)

def forward(self, x):
if self.stride == 1:
x1, x2 = x.chunk( 2 , dim= 1 )   # split by dimension 1
out = torch.cat((x1, self.branch2(x2)), dim= 1 )
else :
out = torch.cat((self.branch1(x), self.branch2(x)), dim=1)

out = channel_shuffle(out, 2)

return out

# ---------------------------- ShuffleBlock end --------------------------------

• yolo.py file modification: In the parse_modelfunction of yolo.py, add conv_bn_relu_maxpool, Shuffle_Blocktwo modules (as shown in the red box in the figure below)

• Create a new yaml file: Create a new file under the model file yolov5-shufflenetv2.yamland copy the following code

# YOLOv5 🚀 by Ultralytics, GPL-3.0 license

# Parameters
nc: 20  # number of classes
depth_multiple: 1.0  # model depth multiple
width_multiple: 1.0  # layer channel multiple
anchors:
- [ 10 , 13 , 16 , 30 , 33 , 23 ]   # P3 / 8
- [ 30 , 61 , 62 , 45 , 59 , 119 ]   # P4 / 16
- [ 116 , 90 , 156 , 198 , 373 , 326 ]   # P5 / 32

# YOLOv5 v6.0 backbone
backbone:
# [from, number, module, args]
# Shuffle_Block: [out, stride]
[[ -1, 1, conv_bn_relu_maxpool, [ 32 ] ], # 0-P2/4
[ -1, 1, Shuffle_Block, [ 128, 2 ] ],  # 1-P3/8
[ -1, 3, Shuffle_Block, [ 128, 1 ] ],  # 2
[ -1, 1, Shuffle_Block, [ 256, 2 ] ],  # 3-P4/16
[ -1, 7, Shuffle_Block, [ 256, 1 ] ],  # 4
[ -1, 1, Shuffle_Block, [ 512, 2 ] ],  # 5-P5/32
[ -1, 3, Shuffle_Block, [ 512, 1 ] ],  # 6
]

[[-1, 1, Conv, [256, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 4], 1, Concat, [1]],  # cat backbone P4
[-1, 1, C3, [256, False]],  # 10

[-1, 1, Conv, [128, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 2], 1, Concat, [1]],  # cat backbone P3
[-1, 1, C3, [128, False]],  # 14 (P3/8-small)

[-1, 1, Conv, [128, 3, 2]],
[[-1, 11], 1, Concat, [1]],  # cat head P4
[-1, 1, C3, [256, False]],  # 17 (P4/16-medium)

[-1, 1, Conv, [256, 3, 2]],
[[-1, 7], 1, Concat, [1]],  # cat head P5
[-1, 1, C3, [512, False]],  # 20 (P5/32-large)

[[14, 17, 20], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
]

[Cite]Howard, Andrew, et al. “Searching for mobilenetv3.” Proceedings of the IEEE/CVF International Conference on Computer Vision. 2019.

thesis code

### Introduction to the paper

[MobileNetV3 is a lightweight network architecture] proposed by Google on March 21, 2019. On the basis of the previous two versions, the neural network architecture search (NAS) and h-swish activation function are added, and the SE channel attention mechanism is introduced. Excellent performance and speed, sought after by academia and industry.

main feature:

1. The paper has two versions: Large and Small, which are suitable for different scenarios
2. The architecture of the network is based on MnasNet implemented by NAS (the effect is better than MobileNetV2), and the parameters are obtained by NAS search
3. Introducing the depthwise separable convolution of MobileNetV1
4. Inverted residual structure with linear bottleneck introduced in MobileNetV2
5. Introduce a lightweight attention model (SE) based on squeeze and excitation structure
6. A new activation function h-swish(x) is used
7. In the network structure search, two technologies are combined: resource-constrained NAS (platform-aware NAS) and NetAdapt
8. Modified MobileNetV2 network end final stage

### Model overview

#### [Depthwise] Separable Convolution

Mobilenetv1 proposes a depthwise separable convolution, which is to split ordinary convolutions into a depthwise convolution (Depthwise Convolutional Filters) and a pointwise convolution (Pointwise Convolution):

• Depthwise Convolutional Filters : Split the convolution kernel into a single-channel form, and perform a convolution operation on each channel without changing the depth of the input feature image , so that the output features consistent with the number of channels in the input feature map are obtained. In this way, there will be a problem . The number of channels is too small, and the dimension of the feature map is too small. Can we obtain enough effective information?
• Pointwise Convolution : Pointwise convolution is 1×1 convolution. The main function is to increase and reduce the dimension of the feature map. In the process of deep convolution, assuming that the output feature map of 8×8×3 is obtained, we use 256 A 1×1×3 convolution kernel performs a convolution operation on the input feature map, and the output feature map is 8×8×256 like the standard convolution operation.

#### Inverse residual structure

The depth convolution itself does not have the ability to change the channel. The output of the channel is the number of channels. If there are few channels, the DW depth convolution can only work in low dimensions, so the effect is not very good, so we To “expand” the channel.

Since we already know that PW point-by-point convolution, that is, 1×1 convolution, can be used to increase and reduce dimensions, we can use PW convolution to increase dimension before DW depth convolution (the multiplier of dimension increase is t, t=6 ), and then perform a convolution operation in a higher-dimensional space to extract features, so that regardless of the number of input channels, after the first PW point-by-point convolution upscaling, the depth convolution is relatively higher. Work in 6x dimensions .

Inverted residuals: In order to reuse features like Resnet, a shortcut structure is introduced , and the mode of 1 × 1 -> 3 × 3 -> 1 × 1 is adopted, but the difference is:

• ResNet first reduces the dimension (0.25 times), convolution, and then increases the dimension
• Mobilenetv2 is the first dimension increase (6 times), convolution, and then dimension reduction

#### SE channel attention

SE channel attention comes from the paper: “Squeeze-and-excitation networks.” It mainly discusses the construction of information features in convolutional neural networks, and the author proposes a component called “Squeeze-Excitation (SE)” :

• The role of the SE component is to enhance the channel-level feature response by explicitly modeling the interdependence between channels (in other words, learning a set of weights and assigning this set of weights to each channel to further improve the feature representation ) , so that important features are strengthened and non-important features are weakened
• Specifically, it is to automatically obtain the importance of each feature channel through learning, and then according to this importance to enhance useful features and suppress features that are not useful for the current task

class SELayer(nn.Module):
def __init__(self, channel, reduction=4):
super(SELayer, self).__init__()
self.fc = nn.Sequential(
nn.Linear(channel, channel // reduction),
nn.ReLU(inplace=True),
nn.Linear(channel // reduction, channel),
h_sigmoid()
)

def forward(self, x):
b, c, _, _ = x.size()
y = self.avg_pool(x)
y = y.view(b, c)
y = self .fc(y).view(b, c, 1 , 1 )   # learn the weight of each channel
return x * y

#### h-swish [activation function]

The approximate operation simulates swish and relu with the following formulas:

h _ s w i s h (x) = x ∗ R e L U 6 ( x + 3 ) 6 h_swish(x)=x*\frac{ReLU6(x+3)}{6} h_swish(x)=x∗6ReLU6(x+3)​、 h _ s i g m o i d (x) = R e L U 6 ( x + 3 ) 6 h_sigmoid(x)=\frac{ReLU6(x+3)}{6} h_sigmoid(x)=6ReLU6(x+3)​

class h_sigmoid(nn.Module):
def __init__(self, inplace=True):
super(h_sigmoid, self).__init__()
self.relu = nn.ReLU6(inplace=inplace)

def forward(self, x):
return self.relu(x + 3) / 6

class h_swish(nn.Module):
def __init__(self, inplace=True):
super(h_swish, self).__init__()
self.sigmoid = h_sigmoid(inplace=inplace)

def forward(self, x):
return x * self.sigmoid(x)

### Join YOLOv5

• common.py file modification: add the following code directly at the bottom

# ---------------------------- MobileBlock start -------------------------------
class h_sigmoid(nn.Module):
def __init__(self, inplace=True):
super(h_sigmoid, self).__init__()
self.relu = nn.ReLU6(inplace=inplace)

def forward(self, x):
return self.relu(x + 3) / 6

class h_swish(nn.Module):
def __init__(self, inplace=True):
super(h_swish, self).__init__()
self.sigmoid = h_sigmoid(inplace=inplace)

def forward(self, x):
return x * self.sigmoid(x)

class SELayer(nn.Module):
def __init__(self, channel, reduction=4):
super(SELayer, self).__init__()
# Squeeze operation
# Excitation operation (FC+ReLU+FC+Sigmoid)
self.fc = nn.Sequential(
nn.Linear(channel, channel // reduction),
nn.ReLU(inplace= True ),
nn.Linear(channel // reduction, channel),
h_sigmoid()
)

def forward(self, x):
b, c, _, _ = x.size()
y = self.avg_pool(x)
y = y.view(b, c)
y = self.fc(y).view(b, c, 1 , 1 )   # learn the weight of each channel
return x * y

class conv_bn_hswish(nn.Module):
"""
This equals to
def conv_3x3_bn(inp, oup, stride):
return nn.Sequential(
nn.Conv2d(inp, oup, 3, stride, 1, bias=False),
nn.BatchNorm2d(oup),
h_swish()
)
"""

def __init__(self, c1, c2, stride):
super(conv_bn_hswish, self).__init__()
self.conv = nn.Conv2d(c1, c2, 3, stride, 1, bias=False)
self.bn = nn.BatchNorm2d(c2)
self.act = h_swish()

def forward(self, x):
return self.act(self.bn(self.conv(x)))

def fuseforward(self, x):
return self.act(self.conv(x))

class MobileNet_Block(nn.Module):
def __init__(self, inp, oup, hidden_dim, kernel_size, stride, use_se, use_hs):
super(MobileNet_Block, self).__init__()
assert stride in [1, 2]

self.identity = stride == 1 and inp == oup

# If the number of input channels = the number of expansion channels, channel expansion will not be performed
if inp == hidden_dim:
self.conv = nn.Sequential(
# dw
nn.Conv2d(hidden_dim, hidden_dim, kernel_size, stride, (kernel_size - 1) // 2, groups=hidden_dim,
bias= False ),
nn.BatchNorm2d(hidden_dim),
h_swish() if use_hs else nn.ReLU(inplace=True),
# Squeeze-and-Excite
SELayer(hidden_dim) if use_se else nn.Sequential(),
# pw-linear
nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
nn.BatchNorm2d(oup),
)
else :
# else do channel expansion first
self.conv = nn.Sequential(
# pw
nn.Conv2d(inp, hidden_dim, 1, 1, 0, bias=False),
nn.BatchNorm2d(hidden_dim),
h_swish() if use_hs else nn.ReLU(inplace=True),
# dw
nn.Conv2d(hidden_dim, hidden_dim, kernel_size, stride, (kernel_size - 1) // 2, groups=hidden_dim,
bias= False ),
nn.BatchNorm2d(hidden_dim),
# Squeeze-and-Excite
SELayer(hidden_dim) if use_se else nn.Sequential(),
h_swish() if use_hs else nn.ReLU(inplace=True),
# pw-linear
nn.Conv2d(hidden_dim, oup, 1, 1, 0, bias=False),
nn.BatchNorm2d(oup),
)

def forward(self, x):
y = self.conv(x)
if self.identity:
return x + y
else:
return y

# ---------------------------- MobileBlock end ---------------------------------

• yolo.py file modification: In the parse_modelfunction of yolo.py, add h_sigmoid, h_swish, SELayer, conv_bn_hswish, MobileNet_Blockfive modules

• Create a new yaml file: Create a new file under the model file yolov5-mobilenetv3-small.yamland copy the following code

# YOLOv5 🚀 by Ultralytics, GPL-3.0 license

# Parameters
nc: 20  # number of classes
depth_multiple: 1.0  # model depth multiple
width_multiple: 1.0  # layer channel multiple
anchors:
- [ 10 , 13 , 16 , 30 , 33 , 23 ]   # P3 / 8
- [ 30 , 61 , 62 , 45 , 59 , 119 ]   # P4 / 16
- [ 116 , 90 , 156 , 198 , 373 , 326 ]   # P5 / 32

# YOLOv5 v6.0 backbone
backbone:
# MobileNetV3-small 11 layers
# [from, number, module, args]
# MobileNet_Block: [out_ch, hidden_ch, kernel_size, stride, use_se, use_hs]
# hidden_ch indicates the number of expansion channels in Inverted residuals
# use_se indicates whether to use SELayer, use_hs indicates whether to use h_swish or ReLU
[ [-1, 1, conv_bn_hswish, [16, 2 ]],                  # 0-p1/2
[ -1, 1, MobileNet_Block, [16, 16, 3, 2, 1, 0 ]] ,   # 1-p2/4
[ -1, 1, MobileNet_Block, [24, 72, 3, 2, 0, 0 ]],   # 2-p3/8
[ -1, 1, MobileNet_Block, [24, 88, 3 , 1, 0, 0 ]],   # 3-p3/8
[ -1, 1, MobileNet_Block, [40, 96, 5, 2, 1, 1 ]],   # 4-p4/16
[-1, 1, MobileNet_Block, [40, 240, 5, 1, 1, 1]],  # 5-p4/16
[-1, 1, MobileNet_Block, [40, 240, 5, 1, 1, 1]],  # 6-p4/16
[-1, 1, MobileNet_Block, [48, 120, 5, 1, 1, 1]],  # 7-p4/16
[-1, 1, MobileNet_Block, [48, 144, 5, 1, 1, 1]],  # 8-p4/16
[-1, 1, MobileNet_Block, [96, 288, 5, 2, 1, 1]],  # 9-p5/32
[-1, 1, MobileNet_Block, [96, 576, 5, 1, 1, 1]],  # 10-p5/32
[-1, 1, MobileNet_Block, [96, 576, 5, 1, 1, 1]],  # 11-p5/32
]

[[-1, 1, Conv, [256, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 8], 1, Concat, [1]],  # cat backbone P4
[-1, 1, C3, [256, False]],  # 15

[-1, 1, Conv, [128, 1, 1]],
[-1, 1, nn.Upsample, [None, 2, 'nearest']],
[[-1, 3], 1, Concat, [1]],  # cat backbone P3
[-1, 1, C3, [128, False]],  # 19 (P3/8-small)

[-1, 1, Conv, [128, 3, 2]],
[[-1, 16], 1, Concat, [1]],  # cat head P4
[-1, 1, C3, [256, False]],  # 22 (P4/16-medium)

[-1, 1, Conv, [256, 3, 2]],
[[-1, 12], 1, Concat, [1]],  # cat head P5
[-1, 1, C3, [512, False]],  # 25 (P5/32-large)

[[19, 22, 25], 1, Detect, [nc, anchors]],  # Detect(P3, P4, P5)
]

## 3. Ghostnet

Han, Kai, et al. “Ghostnet: More features from cheap operations.” Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2020.

thesis code

### Introduction to the paper

Ghostnet comes from Huawei’s Noah’s Ark Lab . The author found that there is a lot of redundancy in traditional deep learning networks, but feature maps that are critical to the accuracy of the model . These feature maps are obtained by the convolution operation, and then input to the next convolution layer for operation. This process contains a large number of network parameters and consumes a lot of computing resources.

The author considers that the redundant information in these feature map layers may be an important part of a successful model. It is precisely because of this redundant information that a comprehensive understanding of the input data can be guaranteed , so the author did not try to remove it when designing a lightweight model. These redundant feature maps instead try to obtain these redundant feature maps using a lower computational cost .

### Model overview

The Ghost convolution part divides the traditional convolution operation into two parts:

• The first step is to use a small number of convolution kernels for convolution operations (for example, 64 are used normally, and 32 are used here, thereby reducing the amount of computation by half)
• The second step is to use 3×3 or 5×5 convolution kernels for channel-by-channel convolution operations (Cheap operations)

Finally, the first part is used as an identity map (Identity), and the Concat operation is performed with the result of the second step

The GhostBottleneck section has two structures:

• stride=1, when downsampling is not performed, two Ghost convolution operations are performed directly
• stride=2, when downsampling, there is an additional depth convolution operation with a step size of 2

### Join YOLOv5

In the latest version of YOLOv5-6.1 [source code] , the author has added the Ghost module, and the models/hub/file is given under the yolov5s-ghost.yamlfolder, so it can be used directly.

class GhostConv(nn.Module):
# Ghost Convolution https://github.com/huawei-noah/ghostnet
def __init__(self, c1, c2, k=1, s=1, g=1, act=True):  # ch_in, ch_out, kernel, stride, groups
super().__init__()
c_ = c2 // 2   # hidden channels
self.cv1 = Conv(c1, c_, k, s, None , g, act)   # First half of the convolution to reduce the amount of computation
self.cv2 = Conv(c_, c_, 5 , 1 , None , c_, act)   # Then perform feature map convolution

def forward(self, x):
y = self.cv1(x)

class GhostBottleneck(nn.Module):
# Ghost Bottleneck https://github.com/huawei-noah/ghostnet
def __init__(self, c1, c2, k=3, s=1):  # ch_in, ch_out, kernel, stride
super().__init__()
c_ = c2 // 2
self.conv = nn.Sequential(GhostConv(c1, c_, 1 , 1 ),   # pw
# dw DWConv is only enabled when stride=2
(c_, c_, k, s, act= False ) if s == 2  else nn.Identity(),
GhostConv(c_, c2, 1, 1, act=False))  # pw-linear
self.shortcut = nn.Sequential(DWConv(c1, c1, k, s, act=False),
Conv(c1, c2, 1, 1, act=False)) if s == 2 else nn.Identity()

def forward(self, x):

class  C3Ghost (C3) :
# C3 module with GhostBottleneck()
def  __init__ (self, c1, c2, n= 1 , shortcut=True, g= 1 , e= 0.5 ) :
super().__init__(c1, c2, n , shortcut, g, e)   # Introduce the attributes of C3 (parent class)
c_ = int(c2 * e)   # hidden channels
self.m = nn.Sequential(*(GhostBottleneck(c_, c_) for _ in range(n) ))

## References

[Intensive Reading AI Papers] Megvii Lightweight Network ShuffleNet V2 – Algorithm Intensive

Lightweight neural network “tour” (1) – ShuffleNetV2

Lightweight neural network “tour” (2) – MobileNet, from V1 to V3

Yolov5 replaces backbone, with model compression (pruning, quantization, distillation)

Target detection YOLOv5 custom network structure