YOLOv5 Small Target Detection, UAV Perspective Small Target Detection

1. Brief description

In recent years, with the rapid development of drones, general-purpose drones have been widely used in photography, agriculture, surveillance and other fields. Here is an example. For example, if we want to monitor the traffic situation of the main road in the city, we can monitor it in real time by sending back the picture through the drone. We can analyze the returned images through artificial intelligence technology to count the circulation of pedestrians and cars.

However, there are also difficulties: (1) some targets are too small, when the picture taken by the drone is far away, and the pedestrian appears very small in the distant view, and it is easy to miss the detection; (2) in the aerial video picture, there are a large number of When detecting objects, dozens or hundreds of targets may appear at the same time, and the targets are blocked or overlapped, which also causes great difficulty.

Here I use the [YOLOv5] algorithm and the VisDrone2021 dataset to implement my own small target detection task.

2. Dataset processing

(1) Data set download

The VisDrone2021 dataset is a dataset for UAV visual target detection. The VisDrone2021 version and VisDrone2019 are the same dataset, which contains many small targets. This dataset does not require registration and can be downloaded for free. The author provides the download methods of Baidu network disk and Google network disk.

Download address: Object Detection – VisDrone

(2) Dense area filtering

When sorting out the dataset, I found that there is a category called “ignored regions”, which are ignored regions, because some regions contain dense and small targets, which cannot be labeled, so we have to ignore this region. Here I can directly use opencv to cover it.

(before occlusion)

(After dense small target occlusion)

(3) Image segmentation and label generation

Because the resolution of the image we want to detect is very large, for example, the size of the image taken by the drone is 5630×4314, but some targets are very small. If the image is directly scaled to 640×640 for training, the effect is not good, and many small targets cannot be detected. .

Because: yolov5 uses 5 downsampling, the final output feature map size is 20 * 20, 40 * 40, 80 * 80 .

8080 is responsible for detecting small targets, corresponding to 640640, and the size of the receptive field corresponding to each feature map is 640/80=8*8. Corresponding to the original image, taking the long side as an example, 5630/640 * 8 = 71, that is, the target whose target is less than 71 pixels in the original image cannot learn effective features.

Therefore, we need to divide the original image into multiple small images for detection. Here, I divide the image into 2 rows and 3 columns, that is, 6 small images.

It is worth noting that some objects are located right in the middle of the two small images and are just truncated, which may lead to undetectable objects. In order to avoid this situation, we set an overlap area between the two small images, and the area of ​​the overlap area I set here accounts for 20% of the total area.

Then the coordinates of the boxes in the label are also transformed accordingly, and saved to the VisDrone_chip folder. At the same time, the label format required for yolov5 training is generated.

The training directory generated after the original image is segmented:

3. Model training

(1) Create your own data

We first use git to pull the code of YOLOv5, github address:

GitHub – ultralytics/yolov5: YOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLiteYOLOv5 🚀 in PyTorch > ONNX > CoreML > TFLite. Contribute to ultralytics/yolov5 development by creating an account on GitHub.https://github.com/ultralytics/yolov5.git

Then  create your own dataset configuration file VisDrone_data.yaml in the yolov5/data/ directory :

# VisDrone2019-DET dataset https://github.com/VisDrone/VisDrone-Dataset

# Train/val/test sets as 1) dir: path/to/imgs, 2) file: path/to/imgs.txt, or 3) list: [path/to/imgs1, path/to/imgs2, ..]
train: ./data/VisDrone_chip/images/train  # train images (relative to 'path')  6471 images
val: ./data/VisDrone_chip/images/val  # val images (relative to 'path')  548 images
test: ./data/VisDrone_chip/images/test  # test images (optional)  1610 images

# Classes
nc: 10  # number of classes
names: ['pedestrian', 'people', 'bicycle', 'car', 'van', 'truck', 'tricycle', 'awning-tricycle', 'bus', 'motor']

Here we remove the category “others” , there are 10 categories in total.

(2) Create your own model

Go to the yolov5/models/ directory and copy a model as your own model. Here I tried the small version and the large version of YOLOv5 respectively. Here I will use the small version to explain.

cp  yolov5s .yaml  yolov5s_visdrone .yaml

Then modify the number of categories, there are 10 categories here, and change the parameter of nc to 10: 

(3) Model training

Download the pretrain model model YOLOv5s and store it in the yolov5 directory.

Start training:

python train.py --img 640 --batch 16 --epochs 100 --data ./data/VisDrone_data.yaml --weights yolov5s.pt

Here I train 100 epochs. After more than 9 hours of training, the training results are as follows:

mAP@0.5 reached 0.591, which is very good.

The YOLOv5l version has undergone a long 26 hours of training, and the mAP@0.5 has reached 0.648, which is very effective.

4. Inference merge

In the model inference, the input is an original image taken by a drone. We also need to cut the original image into multiple small images for inference, and then merge the results of the inference of the small images into the original image, and then Then do the nms operation uniformly.


(1) Run the model inference with a small image and get the inference result pred;

(2) Perform coordinate transformation on the position of boxes in the result of pred, and convert it to the position corresponding to the original image;

(3) Use torch.cat to merge the results of each small graph inference;

(4) Use nms non-maximum suppression to filter out duplicate boxes.

Model detection effect after training:

Display the effect of the label:

It can be seen that the effect is very good. Whether it is for large targets or small targets such as pedestrians, whether at night or during the day, the accuracy performance is quite good.

Please indicate the source of the reprinted article: [YOLOv5 Small Target Detection_liguiyuan’s Blog – CSDN Blog]


A complete explanation of the core basic knowledge of Yolov5 of the Yolo series

The realization of the slicing of the large image of the target detection task – zengwb’s blog – CSDN blog

Leave a Comment

Your email address will not be published. Required fields are marked *