A Comparative Study of Custom Object Detection Algorithms

12 min readDec 13, 2019

Object Detection is a technique associated with computer vision and image processing that performs the task of detecting instances of certain objects such as a human, vehicle, banner, building from a digital image or a video. Object detection combined with other advanced technology integrations allows us to perform face detection or pedestrian detection, popularly known as person tracking from a video. Object detection is being used in a plethora of areas such as security, human resource, healthcare, marketing, logistics and so on. This includes a variety of applications like detecting a broken bone from X-ray images, detect brand logo from image or video, and player/ball tracking in a football match.

Research in computer vision focusing on object detection is growing rapidly. Initially, traditional image processing techniques like pixel-by-pixel object matching was used. Gradually, with the advancement in technology, machine learning, and deep learning techniques like Region-based Convolution Neural Network (R-CNN) were developed for object detection. To overcome the limitation of R-CNN, Fast R-CNN, and Faster R-CNN has invented. Faster R-CNN was the best algorithm out of all of the above by evaluating its performance on the COCO dataset which is a benchmark dataset for object detection in terms of accuracy and training time. The limitation of Faster R-CNN was it’s inference time, and in order to overcome this limitation, SSD (Single-Shot Detectron) was developed which outperformed all the existing algorithms on this aspect. However, that compromised the accuracy of SSD than the Faster R-CNN. Going further, YOLO (You Only Look Once) object detection algorithm was developed using the darknet framework and the latest version i.e. the V3 of YOLO outperformed the Faster R-CNN and SSD in terms of accuracy and inference time on the benchmark dataset.

In this blog, we are going to emphasize three different object detection algorithms, their implementation details and comparative analysis of our case-study on the custom dataset.

The algorithms that we are going to highlight are:

1. You Only Look Once (YOLO)

2. Faster Region-based Convolution Neural Networks (Faster R-CNN)

3. Single Shot Detector (SSD)

YOLO (You Only Look Once) Algorithm:

YOLO (You Only Look Once) takes only one forward propagation pass through the network to make the predictions. It was developed in 2015 byJoseph Redmon, Santosh Divvala, Ross Girshick, and Ali Farhadi. YOLO has outperformed other object detection models in terms of real-time inference.

YOLO V3 was implemented using a darknet framework, which originally has a 53 layer network trained on Imagenet. For the task of detection, 53 more layers are added to it, giving us a 106 layer fully convolutional underlying architecture for YOLO V3. Because of this reason, YOLO V3 is slower compared to YOLO V2 but works better in all other aspects.

Image Reference: YOLOv3: An Incremental Improvement

YOLO detects objects in an image very well unlike sliding window and region proposal-based techniques because it used to see the entire image during training and testing time so it gets every detail about the whole image and the object and also it’s appearance.

The algorithm divides the image into grids and runs the image classification and localization algorithm on each of the grid cells. It predicts N bounding boxes and confidence scores in each grid. The confidence score reflects the accuracy of the bounding box of that class. As most of these boxes have low confidence scores so we can avoid unnecessary bound boxes or object detected by setting a threshold.

Faster R-CNN (Faster Region-based Convolution Neural Network):

In order to get a better understanding of Faster R-CNN, it is better to read R-CNN, Fast R-CNN first, to know the evolution of objection detection, especially why region proposal network (RPN) exists in this approach.

Image Reference: Smart Acc.: Web application based on Faster-RCNN architecture

In R-CNN and Fast R-CNN we were used to extracting the region of the image using Selective Search while in Faster R-CNN, Selective Search is replaced with Region Proposal Network (RPN). It first takes an input image to ConvNet and returns feature maps of the image. From there it applies RPN on these feature maps and we get the object proposals of the input image. It then resizes all proposals to the same size and passes it to the fully connected convolution layer in order to classify the bounding boxes of the image.

SSD (Single Shot Detector):

SSD, as its name says, runs a convolution network for an input image and computes the feature map for that image. Then it runs a small 3x3 sized convolution network on the feature map to predict the bounding boxes of its relative category.

Image Source: Source: SSD: Single Shot MultiBox Detector

SSD also uses anchor boxes at a variety of aspect ratios comparable to Faster R-CNN and learns the off-set to a certain extent than learning the box. In order to hold the scale, SSD predicts bounding boxes after multiple convolutional layers. Since every convolutional layer functions at a diverse scale, it is able to detect objects of a mixture of scales.

Implementation Details:

Here, we are going to explain the implementation steps for three different object detection models i.e., (1) YOLO V3, (2) Faster R-CNN, (3) SSD. To train the model on the custom dataset, standard implementation steps would be as below:

1. Data Acquisition

2. Data Labelling

3. Data Preparation

4. Model Training

5. Evaluation / Results analysis

Data Acquisition

Data acquisition is a required step when you want to train the model on the custom dataset. It will require more than 100 images with the object for which you want to train the model. Each image should have good quality, contains object/objects at least once (to be detected). No image should be repetitive within the dataset, the more the variety of images, the more extensive the training becomes.

Single Shot Detectron (SSD) and Faster R-CNN require images with objects only, but in the case of YOLO, negative samples are also being used for the training and validation.

Here, we are going to showcase the demo for the model training, and prediction of the custom object which is the logo of the brand VIVO. We have collected ~200 images for the training.

Data Labelling

For the custom dataset, we need to label objects from each of the images. For object detection, the model requires certain details of the objects that are to be detected from the image and those are the X-axis, y-axis, height, and width of the object within the image.

We have added images that contains logo with different angles, different background, and a few other variations.

For the labeling, we have used labelImg tool that allows user to draw bounding box and it captures x-min, y-min, x-max, y-max and save the values in an XML file for each image.

Run below command to install labelImg

$ pip install labelImg

You can also refer to https://github.com/tzutalin/labelImg for the reference. The labelImg tool provides a user interface that allows a user to draw bounding boxes and captures the x-axis and y-axis along with the height and width of the bounding box. This also provides a feature to label multiple objects within a single image. labelImg stores axis values in a separate XML file at the same location where the image is present.

Data Preparation & Model Training

YOLO v3:

Data Preparation:

1. Create file obj.names in the directory build\darknet\x64\data\, with objects names — each in new line.

2. Put image-files (.jpg) of your objects in the directory build\darknet\x64\data\obj\

3. For YOLO we need to get the annotation of data in labelImg in .txt file instead of PascalVOC format which will be required by YOLO as input for training images.

4. For YOLO we need to provide almost 50% of images of negative samples which has none of the logos in it with empty .txt file of same name which will help the model to prevent in guessing false logos while testing and mix it with the training images.

5. Create file train.txt in directory build\darknet\x64\data\, with filenames of your images, each filename in a new line, with a path relative to darknet.exe, for example containing:

data/obj/img1.jpg
data/obj/img2.jpg
data/obj/img3.jpg

6. Create file obj.data in the directory build\darknet\x64\data\, containing:

classes= 1 (Number of classes to train)
train = data/train.txt (Path to train.txt file)
valid = data/test.txt (Path to test.txt file)
names = data/obj.names (Path to obj.names file)
backup = backup/ (Path to folder where model is saved after each  \ checkpoint)

Training

1. In order to implement the object detection using YOLO first we need to clone this GitHub repository:

git clone https://github.com/AlexeyAB/darknet

2. Copy the content of the config file of yolov3.cfg to yolo-obj.cfg change the following things in the yolo-obj.cfg file:

Change line batch to batch=64
Change line subdivision to subdivision=64
Change line max_batches to (classes*2000), i.e max_batches=6000 if you train for 3 classes
Change line steps to 80% and 90% of max_batches, i.e steps=4800,5400
Change line classes=80 to your number of classes to detect in each of 3 [yolo] layers.
Change [filter=255] to filter=[(classes +5)x3] in the 3[convolutional] before each [yolo] layer. i.e. filters=24 if you train for 3 classes.

3. Start training by using the command line:

darknet.exe detector train data/obj.data yolo-obj.cfg / darknet53.conv.74

Inference:

1. Custom object detection in yolo can be done using the following command:

darknet.exe detector test data/obj.data yolo-obj.cfg yolov3- \ final.weights -dont_show test_images/1.jpg

Faster R-CNN:

Data Preparation:

1. Create a text file from the xml file which is generated from the labelImg where txt file where each image path and it’s relative bounding boxes co-ordinate in new line which is in the following format and move it to the cloned repository :

filepath,xmin,ymin,xmax,ymax,class_name

Training

1. In order to implement the object detection using Faster R-CNN first we need to clone this GitHub repository:

git clone https://github.com/kbardool/keras-frcnn

2. Move the train_images and test_images folder to the cloned repository

3. Now open the terminal in the repository folder and install requirements.txt file.

4. We can train the model using Faster R-CNN through executing this line in the terminal:

python train_frcnn.py -o simple -p (path to txt file)

Inference:

1. Now put the testing images in the folder named test_images and in the same folder of Faster R-CNN and execute the following line in the terminal:

python test_frcnn.py -p test_images

SSD:

We have referred the following link for the implementation of object detection using SSD:

https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html

Data Preparation:

1. Created .pbtxt file containing label names and it’s id.

2. Then we converted all labelImg xml file to csv file using xml_to_csv.py.

3. We then converted the CSV file to .record file format using generate_tfrecord.py.

Training:

1. Then we had downloaded the pretrained model of ssd_inception and also created a config file according to tutorial with the following changes in the config file:

num_classes: 1
type: 'ssd_inception_v2' # Set to the name of your chosen pre- \ trained
model
fine_tune_checkpoint: "pre-trained-model/model.ckpt" # Path to \ extracted
files of pre-trained model
train_input_reader: { \
tf_record_input_reader { \
input_path: "annotations/train.record" # Path to training \
TFRecord file
}
label_map_path: "annotations/label_map.pbtxt" # Path to label \ map
file
}

2. We can train the model through executing this line in the terminal:

python train.py --logtostderr --train_dir=training -- \ pipeline_config_path=training/ssd_inception_v2_coco.config

Inference:

1. Copy the TensorFlow/models/research/object_detection/export_inference_graph.py script and paste it straight into your training_demo folder.

2. Then using the highest checkpoint file of the model and generate .pb file by executing this line in the terminal:

python export_inference_graph.py --input_type image_tensor -- \ pipeline_config_path training/ssd_inception_v2_coco.confi -- \ trained_checkpoint_prefix training/model.ckpt-13302 - \ output_directory trained-inference- \ graphs/output_inference_graph_v1.pb

3. Using this object_detection code we can predict the test images for SSD:

import os
import cv2
import numpy as np
import tensorflow as tf
import sys
sys.path.append("..")
from object_detection.utils import label_map_util
from object_detection.utils import visualization_utils as vis_util

# Path to frozen detection graph .pb file
PATH_TO_CKPT = '.../frozen_inference_graph.pb'

# Path to label map file
PATH_TO_LABELS = '.../label_map.pbtxt' 

# Path to image
PATH_TO_IMAGE = '.../test_images/394.jpg'

# Number of classes the object detector can identify
NUM_CLASSES = 1

# Load the label map.
label_map = label_map_util.load_labelmap(PATH_TO_LABELS)
categories = \ label_map_util.convert_label_map_to_categories(label_map, \ max_num_classes=NUM_CLASSES, use_display_name=True)
category_index = label_map_util.create_category_index(categories)

# Load the Tensorflow model into memory.
detection_graph = tf.Graph()
with detection_graph.as_default():
    od_graph_def = tf.GraphDef()
    with tf.gfile.GFile(PATH_TO_CKPT, 'rb') as fid:
        serialized_graph = fid.read()
        od_graph_def.ParseFromString(serialized_graph)
        tf.import_graph_def(od_graph_def, name='')
    sess = tf.Session(graph=detection_graph)
    
# Input tensor is the image
image_tensor = detection_graph.get_tensor_by_name('image_tensor:0')

# Each box represents a part of the image where a particular \ object was detected
detection_boxes = \ detection_graph.get_tensor_by_name('detection_boxes:0')

# Each score represents level of confidence for each of the objects.
detection_scores = \ detection_graph.get_tensor_by_name('detection_scores:0')
detection_classes = \ detection_graph.get_tensor_by_name('detection_classes:0')
num_detections = \ detection_graph.get_tensor_by_name('num_detections:0')

image = cv2.imread(PATH_TO_IMAGE)
image_expanded = np.expand_dims(image, axis=0)

# Perform the actual detection by running the model with the image \ as input
(boxes, scores, classes, num) = sess.run(
    [detection_boxes, detection_scores, detection_classes, num_detections],
    feed_dict={image_tensor: image_expanded})
    
vis_util.visualize_boxes_and_labels_on_image_array( \
    image, \
    np.squeeze(boxes), \
    np.squeeze(classes).astype(np.int32), \
    np.squeeze(scores), \
    category_index, \
    use_normalized_coordinates=True, \
    line_thickness=8, \
    min_score_thresh=0.60)

# All the results have been drawn on image. Now display the image.
cv2.imshow('Object detector', image)
cv2.waitKey(0)
cv2.destroyAllWindows()

Result Analysis

In our use case, we have focused on the detection of the logo of the brand VIVO. We have annotated 150 images for the training. Each image has an average of 3 objects. We have captured training images from the different sports events in order to keep the variety of background within the train images. Also, we have captured 50 of the images, with quite a similar background. However, Faster R-CNN and SSD do not consider negative samples (samples without objects) for the training.

For the comparative analysis as a first observation, we have trained the object detection model for 2K steps/ iteration on 150 images.

Sample training images are as shown below:

YOLOv3

We have tested YOLO V3 trained on our custom dataset and tested on 15 test images.

Below are sample test images with annotated detected objects from the trained model:

From the analysis, we have observed that the YOLO V3 trained on custom images is performing well on various sizes of the object presented in the images. The model is also able to detect multiple overlapped objects.

Faster R-CNN

We have tested Faster R-CNN trained on our custom dataset with a similar number of iterations. Also, we have tested the same 15 test images.

Below are a few sample test images with annotated detected objects from the trained model:

We have observed that the Faster R-CNN has failed to detect very small-sized objects though the detection rate of the larger sized images was almost similar to the YOLOV3 detection rate.

SSD

We have tested SSD trained on our custom dataset with a similar number of iterations. Also, we have tested the same 15 test images.

We have observed that SSD failed to detect objects in any of the test images.

We have observed the loss value for SSD which was 1.3 which is way larger than the loss value of YOLO V3 and Faster R-CNN i.e., 0.17 and 0.23 respectively.

Even though we have trained the SSD model for 3X times the iterations than the other models, it was not able to detect the objects afterward as well.

Comparative Analysis

Conclusion

In this blog, we have highlighted the comparative analysis of YOLO V3, Faster R-CNN, and SSD on our custom dataset. From our analysis, YOLO V3 worked well on our custom dataset for specific parameters. The performance of the object detection algorithm depends on multiple parameters, dataset, data quality, quantity, object size, algorithm parameters i.e., training steps, weights, biases, training rate, anchor size, anchor ratio and so on. You can choose the algorithm based on your requirements and data availability.

If you are looking for more details on Object Detection Solution; Feel Free to reach out at info@intellica.

References:

1. https://pjreddie.com/media/files/papers/YOLOv3.pdf

2. https://pjreddie.com/media/files/papers/YOLOv3.pdf

3. https://arxiv.org/pdf/1506.01497.pdf

4. https://arxiv.org/pdf/1512.02325.pdf

5. https://github.com/AlexeyAB/darknet

6. https://www.analyticsvidhya.com/blog/2018/11/implementation-faster-r-cnn-python-object-detection/

7. https://tensorflow-object-detection-api-tutorial.readthedocs.io/en/latest/training.html#creating-tensorflow-records