Yolov1 principle and implementation

Hits: 0

The following part comes from the translation of the original paper

I. Introduction

Current detection systems perform detection by reusing classifiers. To detect an object, these systems provide a classifier for that object, evaluating it at different locations and at different scales on the test image. Systems like deformable parts models (DPM) use a sliding window approach, with their classifiers operating on evenly spaced locations across the image. Recent methods such as R-CNN use a region proposal strategy to first generate latent bounding boxes in an image, and then run a classifier on these boxes. After classification, post-processing is performed to refine the bounding box, eliminate duplicate detections, and re-score the bounding box based on other objects in the scene. These complex pipelines are slow and difficult to optimize because each individual part must be trained individually.

We treat object detection as a single regression problem, obtaining bounding box coordinates and class probabilities directly from image pixels. Using our system – You Only Look Once (YOLO), you can get what the object on the image is and the specific location of the object. YOLO is very simple (see figure below), it can predict multiple bounding boxes and their class probabilities simultaneously with only a single convolutional network. YOLO is trained on the entire image and can directly optimize detection performance. Some of the advantages of this unified model compared to traditional object detection methods are listed below.

First, YOLO is very fast . Since we treat detection as a regression problem, we don’t need complicated pipelines. For testing, we simply run our neural network on a new image to predict detections. Without batching on the Titan X GPU, the base version of YOLO runs at 45 frames per second, while the fast version runs at over 150fps. This means we can process streaming video in real-time with less than 25ms of latency. In addition, YOLO achieves more than twice the average accuracy of other real-time systems.

Second, YOLO infers on the whole image . Unlike techniques based on sliding windows and candidate boxes, YOLO takes into account the entire image during training and testing, so it implicitly includes contextual information about classes and their appearance. Fast R-CNN is a good detection method, but since it cannot see the larger context, it will falsely detect background patches as objects. Compared with Fast R-CNN, YOLO has half the number of background false detections.

Third, YOLO can learn a generalized representation of the target . When the model trained on natural images is tested on artistic images, YOLO significantly outperforms top detection methods such as DPM and R-CNN. Because YOLO is highly generalizable, it is less likely to fail when applied to new domains or encounter unexpected inputs.

YOLO still lags behind current state-of-the-art detection systems in accuracy. Although it can quickly identify objects in images, it is not very accurate in locating certain objects, especially small ones. We further explore the accuracy/time tradeoff in our experiments. All of our training and testing code is open source, and various pretrained models are also available for download.

Paper address: click to open the link

2. Detection

We integrate the separate parts of object detection into a single neural network. Our network uses features from the entire image to predict each bounding box. It can also predict all bounding boxes for all classes in one image at the same time. This means that our network infers the entire image and all objects in the image. The YOLO design enables end-to-end training and real-time speed while maintaining high average accuracy. Our system divides the input image into an S×S grid. If the center of the target falls within a grid cell, that grid cell is responsible for detecting the target.

Each grid cell predicts B bounding boxes and confidence scores for those boxes. These confidence scores reflect the model’s confidence in whether that box contains the target, as well as its measure of the accuracy of its predictions. Formally, we define confidence as. If the target does not exist in that cell, the confidence score should be zero. Otherwise, we want the confidence score to be equal to the intersection of the union (IOU) between the predicted box and the ground truth. Each bounding box contains 5 predictions: x, y, w, h and confidence. The ( x , y ) coordinates represent the value of the center of the bounding box relative to the bounds of the grid cell, while the width and height are predicted relative to the entire image. Confidence predictions represent the IOU between the predicted box and any actual bounding box.

Each grid cell also predicts C conditional class probabilities![], these probabilities are conditioned on the grid cell containing the target. Regardless of the number B of bounding boxes, we only predict one set of class probabilities per grid cell. At test time, we multiply the conditional class probability by the predicted confidence value for each box,

It gives a confidence score for a specific class of each box. These scores reflect the probability of the class appearing in the box and how well the predicted box fits the target. To evaluate YOLO on Pascal VOC, we use S = 7, B = 2. Pascal VOC has 20 labeled classes, so C = 20. Our final prediction is a 7×7×30 tensor.

2.1 Network Design

We implement this model as a convolutional neural network and evaluate it on the Pascal VOC detection dataset. The initial convolutional layers of the network extract features from the image, while the fully connected layers are responsible for predicting output probabilities and coordinates. Our network architecture is inspired by the image classification model GoogLeNet . Our network has 24 convolutional layers followed by 2 fully connected layers. Instead of the Inception module used by GoogLeNet, we only use a 1×1 dimensionality reduction layer followed by a 3×3 convolutional layer, which is similar to Lin et al. The complete network is shown in the figure. We also train a fast version of YOLO that aims to push the boundaries of fast object detection. Fast YOLO uses a neural network with fewer convolutional layers (9 instead of 24), using fewer filters in those layers. Except for the network size, all training and testing parameters are the same for the base and fast YOLO versions. The final output of our network is a 7×7×30 prediction tensor.

2.2 Training

We pretrain our convolutional layers on the ImageNet 1000-class competition dataset. For pretraining, we use the first 20 convolutional layers in Figure 3, followed by average pooling and fully connected layers. We trained this network for about a week and achieved a top-5 accuracy of 88% on a single cropped image on the ImageNet 2012 validation set, comparable to the GoogLeNet model in the Caffe model pool. We use the Darknet framework for all training and inference. Then we transform the model to perform detection training. Ren et al. showed that adding convolutional and connected layers to a pretrained network can improve performance [29]. Following their approach, we add four convolutional layers and two fully connected layers whose weights are initialized with random values. Detection usually requires fine-grained visual information, so we change the input resolution of the network from 224×224 to 448×448 .

The last layer of the model predicts class probabilities and bounding box coordinates. We normalize the bounding box width and height by the image width and height so that they fall between 0 and 1. We parameterize the bounding box x and y coordinates as offsets from specific grid cell locations, so their values ​​are bounded between 0 and 1. The last layer of the model uses a linear activation function, while all other layers use the following leaky rectified activation:

We optimize for the sum of squared error of the model output. We chose to use the sum of squared error because it is easy to optimize, but it does not quite meet the goal of maximizing average precision. It weights the classification error the same as the localization error, which may not be ideal. Additionally, each image has many grid cells that do not contain any targets, which pushes the “confidence” scores of these cells towards zero, often suppressing the gradient of cells containing targets. This can lead to model instability, causing training to diverge early on. To compensate for the shortcoming of sum-of-squares error, we increase the loss for bounding box coordinate predictions and decrease the loss for confidence predictions for boxes that do not contain objects. We do this using two parameters λcoord and λnoobj. We set λcoord = 5 and λnoobj = .5.

The sum of squared error has the same trade-off for the error of the large box and the small box, and our error indicator should reflect that the importance of the small deviation of the large box is not as important as the importance of the small deviation of the small box. To partially solve this problem, we directly predict the square root of the bounding box width and height instead of width and height. YOLO predicts multiple bounding boxes for each grid cell. At training time, we only need one bounding box predictor for each object. A predictor is designated as “responsible” for predicting a target if its predicted value has the highest IOU value compared to the target’s actual value. This leads to the specialization of bounding box predictors. Each predictor can better predict a specific size, orientation angle, or class of objects, thereby improving overall recall. During training, we optimize the following multipart loss function:

in![]Indicates whether the target appears in grid cell i,Indicates that the jth bounding box predictor in cell i is “responsible for” the prediction. Note that the loss function only penalizes misclassification if the target exists in that grid cell (conditional class probability discussed earlier). It also only penalizes bounding box coordinate errors if the predictor is “responsible for” the actual bounding box (ie the predictor with the highest IOU in that grid cell).

We train the network for approximately 135 iterations with Pascal VOC 2007 and 2012 training and validation datasets. Since we are only testing on Pascal VOC 2012, our training set includes Pascal VOC 2007 test data. Throughout the training process, we used a batch size of 64, momentum of 0.9, and decay rate of 0.0005. Our learning rate schedule is as follows: in the first iteration cycle, we change the learning rate from![]slowly increase to. If we start training with a large learning rate, our model often diverges due to unstable gradients. we continue to75 epochs of training, thentraining for 30 epochs, ending withDo 30 cycles of training. To avoid overfitting, we use dropout and extensive data augmentation. The dropout rate of the dropout layers after the first connection layer is set to 0.5 to prevent mutual adaptation between layers. For data augmentation, we introduce random scaling and translation of up to 20% of the original image size. We also randomly adjust the exposure and saturation of the image by a factor of up to 1.5 in the HSV color space.

2.3 Inference

Just like in training, predicting detection on test images requires only one network evaluation. On Pascal VOC, the network predicts 98 bounding boxes and class probabilities for each box on each image. YOLO is very fast at test time because it requires only one network evaluation, unlike classifier-based methods. The grid design reinforces the spatial diversity in bounding box prediction. Usually it is obvious in which grid cell an object falls, and the network can only predict one bounding box for each object. However, some large objects or objects close to the boundaries of multiple grid cells can be located by multiple grid cells. Non-maximum suppression can be used to correct for these multiple detections. The impact of non-maximal suppression on the performance of YOLO is not as important as for R-CNN or DPM, but it can also increase mAP by 2−3%.

2.4 Defects

YOLO imposes spatial constraints on bounding box prediction, as each grid cell predicts only two boxes and only one class. This spatial constraint limits the number of nearby objects that our model can predict. Our model has difficulty predicting small objects (such as flocks of birds) that appear in groups. Since our model learns to predict bounding boxes from the data, it is difficult to generalize to targets of new, uncommon aspect ratios or configurations. Our model also uses relatively coarse features to predict bounding boxes, as the input image goes through multiple downsampling layers in our architecture.

Finally, our training is based on a loss function that approximates detection performance, which indiscriminately handles the error of small bounding boxes versus large bounding boxes. Small errors for large bounding boxes are usually irrelevant, but small errors for small bounding boxes have a much larger impact on IOU. Our main mistakes come from incorrect positioning.

  1. Experiment

This time we reproduced the yolo-small model, which is dark-net19. Without further ado, the code is as follows:

# -*- coding: utf-8 -*-

import tensorflow as tf
import numpy as np
import cv2


# leaky_relu activation function 
def  leaky_relu (x, alpha= 0.1 ) : 
    return tf.maximum(alpha * x, x)


class  Yolo (object) : 
    def  __init__ (self, weights_file, input_image, verbose=True) : 
        # The following program prints the flag bit describing the function
        self.verbose = verbose

        # Detect hyperparameters 
        self.S = 7   # number of cells 
        self.B = 2   # number of bounding boxes per grid 
        self.classes = [ "aeroplane" , "bicycle" , "bird" , "boat" , "bottle" ,
                         "bus" , "car" , "cat" , "chair" , "cow" , "diningtable" ,
                         "dog" , "horse" , "motorbike" , "person", "pottedplant",
                        "sheep", "sofa", "train", "tvmonitor"]
        self.C = len(self.classes)   # number of classes

        self.x_offset = np.transpose(np.reshape(np.array([np.arange(self.S)] * self.S * self.B),
                                                [self.B, self.S, self.S]), [1, 2, 0])
        self.y_offset = np.transpose(self.x_offset, [ 1 , 0 , 2 ])   # Change the shape of the array

        self.threshold = 0.2   # Category confidence score threshold 
        self.iou_threshold = 0.4   # IOU threshold, those less than 0.4 will be filtered out

        self.max_output_size = 10   # maximum number of bounding boxes selected by NMS

        self.sess = tf.Session()
        self._build_net()   # [1] Build a network model (prediction): the main network part of the model, this network will output a tensor of [batch, 7*7*30] 
        self._build_detector()   # [2] Parse the network Prediction result: first determine the category of the prediction box, then NMS 
        self._load_weights(weights_file)   # [3] Import the weight file 
        self.detect_from_file(image_file=input_image)   # [4] Input the picture from the prediction, and visually detect the bounding box, convert the obj's The classification results and coordinates are saved as txt.

    # [1] Build a network model (prediction): the main network part of the model, this network will output a tensor of [batch, 7*7*30] 
    def  _build_net (self) : 
        # Print status information 
        if self.verbose:
            print("Start to build the network ...")

        # Use placeholders for input and output, because the size generally does not change 
        self.images = tf.placeholder(tf.float32, [ None , 448 , 448 , 3 ])   # None means uncertain, in order to adapt to batchsize

        # Build the network model 
        net = self._conv_layer(self.images, 1 , 64 , 7 , 2 )
        net = self._maxpool_layer(net, 1, 2, 2)
        net = self._conv_layer(net, 2, 192, 3, 1)
        net = self._maxpool_layer(net, 2, 2, 2)
        net = self._conv_layer(net, 3, 128, 1, 1)
        net = self._conv_layer(net, 4, 256, 3, 1)
        net = self._conv_layer(net, 5, 256, 1, 1)
        net = self._conv_layer(net, 6, 512, 3, 1)
        net = self._maxpool_layer(net, 6, 2, 2)
        net = self._conv_layer(net, 7, 256, 1, 1)
        net = self._conv_layer(net, 8, 512, 3, 1)
        net = self._conv_layer(net, 9, 256, 1, 1)
        net = self._conv_layer(net, 10, 512, 3, 1)
        net = self._conv_layer(net, 11, 256, 1, 1)
        net = self._conv_layer(net, 12, 512, 3, 1)
        net = self._conv_layer(net, 13, 256, 1, 1)
        net = self._conv_layer(net, 14, 512, 3, 1)
        net = self._conv_layer(net, 15, 512, 1, 1)
        net = self._conv_layer(net, 16, 1024, 3, 1)
        net = self._maxpool_layer(net, 16, 2, 2)
        net = self._conv_layer(net, 17, 512, 1, 1)
        net = self._conv_layer(net, 18, 1024, 3, 1)
        net = self._conv_layer(net, 19, 512, 1, 1)
        net = self._conv_layer(net, 20, 1024, 3, 1)
        net = self._conv_layer(net, 21, 1024, 3, 1)
        net = self._conv_layer(net, 22, 1024, 3, 2)
        net = self._conv_layer(net, 23, 1024, 3, 1)
        net = self._conv_layer(net, 24, 1024, 3, 1)
        net = self._flatten(net)
        net = self._fc_layer(net, 25, 512, activation=leaky_relu)
        net = self._fc_layer(net, 26, 4096, activation=leaky_relu)
        net = self._fc_layer(net, 27, self.S * self.S * (self.B * 5 + self.C))

        # Network output, a tensor of [batch, 7*7*30]
        self.predicts = net

    # [2] Analyze the prediction results of the network: first determine the category of the prediction frame, then NMS 
    def  _build_detector (self) : 
        # The width and height of the original image 
        self.width = tf.placeholder(tf.float32, name= 'img_w' )
        self.height = tf.placeholder(tf.float32, name='img_h')

        # Network regression [batch, 7*7*30]:
        idx1 = self.S * self.S * self.C
        idx2 = idx1 + self.S * self.S * self.B
        # 1. Class probability[:,:7*7*20] 20-dimensional 
        class_probs = tf.reshape(self.predicts[ 0 , :idx1], [self.S, self.S, self.C])
         # 2. Confidence[:,7*7*20:7*7*(20+2)] 2D 
        confs = tf.reshape(self.predicts[ 0 , idx1:idx2], [self.S, self.S, self .B])
         # 3. Bounding Box[:,7*7*(20+2):] 8D -> (x,y,w,h) 
        boxes = tf.reshape(self.predicts[ 0 , idx2: ], [self.S, self.S, self.B, 4 ])

        # Convert x, y to coordinates relative to the upper left corner of the image 
        # w, h prediction is the square root times the width and height of the image 
        boxes = tf.stack([(boxes[:, :, :, 0 ] + tf. constant(self.x_offset, dtype=tf.float32)) / self.S * self.width,
                          (boxes[:, :, :, 1] + tf.constant(self.y_offset, dtype=tf.float32)) / self.S * self.height,
                          tf.square(boxes[:, :, :, 2]) * self.width,
                          tf.square(boxes[:, :, :, 3]) * self.height], axis=3)

        # Category confidence score: [S,S,B,1]*[S,S,1,C]=[S,S,B,category confidence C] 
        scores = tf.expand_dims(confs, -1 ) * tf.expand_dims(class_probs, 2 )

        scores = tf.reshape(scores, [-1, self.C])  # [S*S*B, C]
        boxes = tf.reshape(boxes, [-1, 4])  # [S*S*B, 4]

        # Only select the value with the largest category confidence as the category and score of the box 
        box_classes = tf.argmax(scores, axis= 1 )   # The category of the bounding box box 
        box_class_scores = tf.reduce_max(scores, axis= 1 )   # The bounding box box Fraction

        # Use the category confidence threshold self.threshold to filter out low category confidence
        filter_mask = box_class_scores >= self.threshold
        scores = tf.boolean_mask(box_class_scores, filter_mask)
        boxes = tf.boolean_mask(boxes, filter_mask)
        box_classes = tf.boolean_mask(box_classes, filter_mask)

        # NMS (does not distinguish between different categories) 
        # Center coordinates + width and height box (x, y, w, h) -> xmin=xw/2 -> upper left + lower right box (xmin, ymin, xmax, ymax), because The NMS function is calculated like this 
        _boxes = tf.stack([boxes[:, 0 ] - 0.5 * boxes[:, 2 ], boxes[:, 1 ] - 0.5 * boxes[:, 3 ],
                           boxes[:, 0] + 0.5 * boxes[:, 2], boxes[:, 1] + 0.5 * boxes[:, 3]], axis=1)
        nms_indices = tf.image.non_max_suppression(_boxes, scores,
                                                   self.max_output_size, self.iou_threshold)
        self.scores = tf.gather(scores, nms_indices)
        self.boxes = tf.gather(boxes, nms_indices)
        self.box_classes = tf.gather(box_classes, nms_indices)

    # [3] Import the weights file 
    def  _load_weights (self, weights_file) : 
        # Print status information 
        if self.verbose:
            print("Start to load weights from file:%s" % (weights_file))

        # Import weight 
        saver = tf.train.Saver()   # Initialize 
        saver.restore(self.sess, weights_file)   # saver.restore import/saver.save save

    # [4] Input the picture from the prediction, and visually detect the bounding box, save the classification result and coordinates of obj as txt. 
    # image_file is the input image file path; 
    # deteted_boxes_file="boxes.txt" is the last coordinate txt; detected_image_file="detected_image.jpg" is the detection result visualization image 
    def  detect_from_file (self, image_file, imshow=True, deteted_boxes_file= "boxes.txt" " ,
                         detected_image_file="detected_image.jpg"):
        # read image
        image = cv2.imread(image_file)
        img_h, img_w, _ = image.shape
        scores, boxes, box_classes = self._detect_from_image(image)
        predict_boxes = []
        for i in range(len(scores)):
             # The prediction box data is: [probability,x,y,w,h,category confidence] 
            predict_boxes.append((self.classes[box_classes[i]], boxes[i , 0 ],
                                  boxes[i, 1], boxes[i, 2], boxes[i, 3], scores[i]))
        self.show_results(image, predict_boxes, imshow, deteted_boxes_file, detected_image_file)

    ################# Corresponding to [1]: define conv/maxpool/flatten/fc layer #################### ############################################ 
    # convolutional layer: x input; id: layer index; num_filters: number of convolution kernels; filter_size: convolution kernel size; stride: stride 
    def  _conv_layer (self, x, id, num_filters, filter_size, stride) :

        # Number of channels 
        in_channels = x.get_shape().as_list()[ -1 ]
         # Normal distribution with mean 0 and standard deviation 0.1, initialize weight w; shape=row*column*number of channels*number of convolution kernels
        weight = tf.Variable(
            tf.truncated_normal([filter_size, filter_size, in_channels, num_filters], mean=0.0, stddev=0.1))
        bias = tf.Variable(tf.zeros([num_filters, ]))   # column vector

        # padding, note: do not use padding="SAME", otherwise it may cause coordinate calculation errors 
        pad_size = filter_size // 2   # division operation, keep the integer part of the 
        quotient pad_mat = np.array([[ 0 , 0 ], [pad_size, pad_size], [pad_size, pad_size], [ 0 , 0 ]])
        x_pad = tf.pad(x, pad_mat)
        conv = tf.nn.conv2d(x_pad, weight, strides=[1, stride, stride, 1], padding="VALID")
        output = leaky_relu(tf.nn.bias_add(conv, bias))

        # Print the layer information 
        if self.verbose:
            print('Layer%d:type=conv,num_filter=%d,filter_size=%d,stride=%d,output_shape=%s'
                  % (id, num_filters, filter_size, stride, str(output.get_shape())))

        return output

    # Pooling layer: x input; id: layer index; pool_size: pooling size; stride: stride 
    def  _maxpool_layer (self, x, id, pool_size, stride) :
        output = tf.layers.max_pooling2d(inputs=x,
                                         pool_size=pool_size,
                                         strides=stride,
                                         padding='SAME')
        if self.verbose:
            print('Layer%d:type=MaxPool,pool_size=%d,stride=%d,out_shape=%s'
                  % (id, pool_size, stride, str(output.get_shape())))
        return output

    # Flatten layer: Because the fully connected layer will be connected next, for example [n_samples, 7, 7, 32] -> [n_samples, 7*7*32] 
    def  _flatten (self, x) : 
        tran_x = tf.transpose(x, [ 0 , 3 , 1 , 2 ])   # [batch, row, column, channel number channels] -> [batch, channel number channels, column, row] 
        nums = np.product(x.get_shape().as_list() [ 1 :])   # Calculates the total number of neurons, the first one represents the number of batches so remove the 
        return tf.reshape(tran_x, [ -1 , nums])   # [batch, channels, columns, rows] - > [batch, channel number channels*column*row], -1 represents the number of adaptive batches

    # Fully connected layer: x input; id: layer index; num_out: output size; activation: activation function 
    def  _fc_layer (self, x, id, num_out, activation=None) : 
        num_in = x.get_shape().as_list() [ -1 ]   # Number of channels/dimensions 
        # A normal distribution with a mean of 0 and a standard deviation of 0.1, initialization weight w; shape=row*column*number of channels*number of convolution kernels 
        weight = tf.Variable(tf.truncated_normal( shape=[num_in, num_out], mean= 0.0 , stddev= 0.1 ))
        bias = tf.Variable(tf.zeros(shape=[num_out, ]))   # column vector
        output = tf.nn.xw_plus_b(x, weight, bias)

        # The normal fully connected layer is the leak_relu activation function; but the last layer is the liner function 
        if activation:
            output = activation(output)

        # Print the layer information 
        if self.verbose:
            print('Layer%d:type=Fc,num_out=%d,output_shape=%s'
                  % (id, num_out, str(output.get_shape())))
        return output

    ######################## Corresponding to [4]: ​​Visually detect the bounding box, save the classification results and coordinates of obj as txt###### ################################### 
    def  _detect_from_image (self, image) : 
        """Do detection given a cv image"""
        img_h, img_w, _ = image.shape
        img_resized = cv2.resize(image, (448, 448))
        img_RGB = cv2.cvtColor(img_resized, cv2.COLOR_BGR2RGB)
        img_resized_np = np.asarray(img_RGB)
        _images = np.zeros((1, 448, 448, 3), dtype=np.float32)
        _images[0] = (img_resized_np / 255.0) * 2.0 - 1.0
        scores, boxes, box_classes = self.sess.run([self.scores, self.boxes, self.box_classes],
                                                   feed_dict={self.images: _images, self.width: img_w,
                                                              self.height: img_h})
        return scores, boxes, box_classes

    def show_results(self, image, results, imshow=True, deteted_boxes_file=None,
                     detected_image_file=None):
        """Show the detection boxes"""
        img_cp = image.copy()
        if deteted_boxes_file:
            f = open(deteted_boxes_file, "w")
        # draw boxes
        for i in range(len(results)):
            x = int(results[i][1])
            y = int(results[i][2])
            w = int(results[i][3]) // 2
            h = int(results[i][4]) // 2
            if self.verbose:
                print("class: %s, [x, y, w, h]=[%d, %d, %d, %d], confidence=%f"
                      % (results[i][0], x, y, w, h, results[i][-1]))

                # Center coordinates + width and height box(x, y, w, h) -> xmin = x - w / 2 -> upper left + lower right box(xmin, ymin, xmax, ymax) 
                cv2.rectangle(img_cp, (x - w, y - h), (x + w, y + h), ( 0 , 255 , 0 ), 2 )

                # Display class, score (class confidence) on bounding box 
                cv2.rectangle(img_cp, (x - w, y - h - 20 ), (x + w, y - h), ( 125 , 125 , 125 ), -1 )   # background of puttext function 
                cv2.putText(img_cp, results[i][ 0 ] + ' : %.2f' % results[i][ 5 ], (x - w + 5 , y - h - 7 ) ,
                            cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1)

            if deteted_boxes_file:
                 # Save the obj detection result as a txt file 
                f.write(results[i][ 0 ] + ',' + str(x) + ',' + str(y) + ',' +
                        str(w) + ',' + str(h) + ',' + str(results[i][5]) + '\n')
        if imshow:
            cv2.imshow('YOLO_small detection', img_cp)
            cv2.waitKey(1)
        if detected_image_file:
            cv2.imwrite(detected_image_file, img_cp)
        if deteted_boxes_file:
            f.close()


if __name__ == '__main__':
    yolo_net = Yolo(weights_file='D:/Python/YOLOv1-Tensorflow-master/YOLO_small.ckpt',
                    input_image='D:/Python/YOLOv1-Tensorflow-master/car.jpg')

I have commented the above code in more detail, and it is easy to understand if you look at it slowly. The results are as follows:

Judging from the results, it is still very good. Let’s try it out for yourself. Welcome everyone to criticize and correct.

You may also like...

Leave a Reply

Your email address will not be published.