End to End Object Detection theory and implementation.

https://images.unsplash.com/photo-1565018968331-61145555526b?ixlib=rb-1.2.1&ixid=eyJhcHBfaWQiOjEyMDd9&auto=format&fit=crop&w=1955&q=80

Introduction

Computer vision has gained quite a prominence in the industry with the advent of GPUs. In particular object recognition, detection, segmentation plays a pivotal role in a self-driving car 🚘 , automated identification 👮‍♀️ , information retrieval. Sometimes for image classification one needs to first detect individual objects and pass it to some classifier. Over a period of time, different algorithms were proposed for object detection such as R-CNN, Fast R-CNN, Faster R-CNN, YOLO, and many more. In this blog, we will primarily focus on these region-based algorithms, for YOLO one can see this blog.

R-CNN

The algorithm as proposed by Ross Girshick can be broadly divided into six parts.

Extraction of the regions from an input image using selective search, called region proposals. In other words, region proposals are the regions where there is a possibility of the object we are looking for in the image.
Label the category and ground truth bounding box of each proposed region.
Region of proposed is warped to have the same shape as the input of CNN
Computing the features of those regions with a Convolutional Neural Network (CNN).
The features and category label is used for classification with a support vector machine (SVM).
Finally, regression is performed on features and labeled bounding box to predict the ground truth bounding box.

The intuitive idea is the same as how a human will do the task of locating objects in an image. It will create a lot of boxes inside the image and validate which one corresponds to the object.

The big question here is how we create boxes and how many?

There are different methods to perform the above task.

Constrained Parametric Min-Cuts
Category Independent Object Proposals
Randomized Prim
Selective Search

The most frequently used is Selective Search for its fast and efficient implementation.

Selective Search

Step I: R-CNN uses Felsenszwalb’s efficient graph-based image segmentation to create initial segmentation/regions. You can read more about the Felsenszwalb approach here.

Original

Initial Segmentation Output

Step II: Combining smaller regions into larger based on similarity. In a way, it generates a hierarchy of bounding boxes. The four commonly used similarity matrix are color, texture, size, and fill/shape.

for region r1,r2 similarity will be a linear combination of all these four ones.

s_final(r1,r2) = a1*s_color(r1,r2) + a2*s_texture(r1,r2) + a3*s_size(r1,r2) + a4*s_fill(r1,r2)

Where ai belongs to {0,1} depending on whether we are considering that matrix or not. The detail of all these matrices can be read here. In a nutshell, based on the similarity matrix and initial over segmented regions from Step I region proposal is generated.

Select Search output

Feature Extraction

The pre-trained CNN is applied to the proposed region after warping to the dimension of the network. The features extracted from these regions along with labels are used for classification and bounding box predictions.

Consider our case with the above image for simplicity lets us assume there were only three proposed region from selective search.

Region of proposal

Next, each of the proposed region is warped to the input shape of pre-trained CNN.

End to End RCNN Flow

For each proposal, binary SVM is trained for classification.

Bounding Box Regression

A simple linear regression is trained on the region proposal to generate tighter bounding box coordinates.

Cons of R-CNN

For each image, we have approx 2k regions based on selective search.
Feature Extraction for each 2k region
Train SVM for each such region.
Bounding Box regression

Clearly, this process increases linearly with the number of images making it quite slow.

Fast R-CNN

Similar to the original R-CNN, Fast R-CNN still utilizes Selective Search to obtain region proposals; however, the novel contribution from the paper was Region of Interest (ROI) Pooling module.

ROI Pooling works by extracting a fixed-size window from the feature map and using these features to obtain the final class label and bounding box. The primary benefit here is that the network is now, effectively, end-to-end trainable:

We input an image and associated ground-truth bounding boxes
Extract the feature map
Apply ROI pooling and obtain the ROI feature vector
And finally, use the two sets of fully-connected layers to obtain (1) the class label predictions and (2) the bounding box locations for each proposal.

Faster R-CNN

While the network is now end-to-end trainable, performance suffered dramatically at inference (i.e., prediction) by being dependent on Selective Search.

To make the R-CNN architecture even faster we need to incorporate the region proposal directly into the R-CNN:

The Faster R-CNN paper by Girshick et al. introduced the Region Proposal Network (RPN) that bakes region proposal directly into the architecture, alleviating the need for the Selective Search algorithm.

As a whole, the Faster R-CNN architecture is capable of running at approximately 7–10 FPS, a huge step towards making real-time object detection with deep learning a reality.

Mask R-CNN

The Mask R-CNN algorithm builds on the Faster R-CNN architecture with two major contributions:

Replacing the ROI Pooling module with a more accurate ROI Align module
Inserting an additional branch out of the ROI Align module

This additional branch accepts the output of the ROI Align and then feeds it into two CONV layers.

The output of the CONV layers is the mask itself.

As we know, the Faster R-CNN/Mask R-CNN architectures leverage a Region Proposal Network (RPN) to generate regions of an image that potentially contain an object.

Each of these regions is ranked based on their “objectness score” (i.e., how likely it is that a given region could potentially contain an object) and then the top N’s most confident objectness regions are kept.

In the original Faster R-CNN publication, Girshick et al. set N=2,000, but in practice, we can get away with a much smaller N, such as N={10, 100, 200, 300} and still obtain good results.

He et al. set N=300 in their publication which is the value we’ll use here as well.

The Mask R-CNN framework for instance segmentation.

Each of the 300 selected ROIs go through three parallel branches of the network:

Label prediction
Bounding box prediction
Mask prediction

Detail view of Mask-RCNN architecture

References