Introduction
Computer vision has gained quite a prominence in the industry with the advent of GPUs. In particular object recognition, detection, segmentation plays a pivotal role in a self-driving car 🚘 , automated identification 👮♀️ , information retrieval. Sometimes for image classification one needs to first detect individual objects and pass it to some classifier. Over a period of time, different algorithms were proposed for object detection such as R-CNN, Fast R-CNN, Faster R-CNN, YOLO, and many more. In this blog, we will primarily focus on these region-based algorithms, for YOLO one can see this blog.
R-CNN
The algorithm as proposed by Ross Girshick can be broadly divided into six parts.
- Extraction of the regions from an input image using selective search, called region proposals. In other words, region proposals are the regions where there is a possibility of the object we are looking for in the image.
- Label the category and ground truth bounding box of each proposed region.
- Region of proposed is warped to have the same shape as the input of CNN
- Computing the features of those regions with a Convolutional Neural Network (CNN).
- The features and category label is used for classification with a support vector machine (SVM).
- Finally, regression is performed on features and labeled bounding box to predict the ground truth bounding box.
The intuitive idea is the same as how a human will do the task of locating objects in an image. It will create a lot of boxes inside the image and validate which one corresponds to the object.
The big question here is how we create boxes and how many?
There are different methods to perform the above task.
- Constrained Parametric Min-Cuts
- Category Independent Object Proposals
- Randomized Prim
- Selective Search
The most frequently used is Selective Search for its fast and efficient implementation.
Selective Search
Step I: R-CNN uses Felsenszwalb’s efficient graph-based image segmentation to create initial segmentation/regions. You can read more about the Felsenszwalb approach here.
Step II: Combining smaller regions into larger based on similarity. In a way, it generates a hierarchy of bounding boxes. The four commonly used similarity matrix are color, texture, size, and fill/shape.
for region r1,r2 similarity will be a linear combination of all these four ones.
s_final(r1,r2) = a1*s_color(r1,r2) + a2*s_texture(r1,r2) + a3*s_size(r1,r2) + a4*s_fill(r1,r2)
Where ai belongs to {0,1} depending on whether we are considering that matrix or not. The detail of all these matrices can be read here. In a nutshell, based on the similarity matrix and initial over segmented regions from Step I region proposal is generated.
Feature Extraction
The pre-trained CNN is applied to the proposed region after warping to the dimension of the network. The features extracted from these regions along with labels are used for classification and bounding box predictions.
Consider our case with the above image for simplicity lets us assume there were only three proposed region from selective search.
Next, each of the proposed region is warped to the input shape of pre-trained CNN.
For each proposal, binary SVM is trained for classification.
Bounding Box Regression
A simple linear regression is trained on the region proposal to generate tighter bounding box coordinates.
Cons of R-CNN
- For each image, we have approx 2k regions based on selective search.
- Feature Extraction for each 2k region
- Train SVM for each such region.
- Bounding Box regression
Clearly, this process increases linearly with the number of images making it quite slow.
Fast R-CNN
Similar to the original R-CNN, Fast R-CNN still utilizes Selective Search to obtain region proposals; however, the novel contribution from the paper was Region of Interest (ROI) Pooling module.
ROI Pooling works by extracting a fixed-size window from the feature map and using these features to obtain the final class label and bounding box. The primary benefit here is that the network is now, effectively, end-to-end trainable:
- We input an image and associated ground-truth bounding boxes
- Extract the feature map
- Apply ROI pooling and obtain the ROI feature vector
- And finally, use the two sets of fully-connected layers to obtain (1) the class label predictions and (2) the bounding box locations for each proposal.
Faster R-CNN
While the network is now end-to-end trainable, performance suffered dramatically at inference (i.e., prediction) by being dependent on Selective Search.
To make the R-CNN architecture even faster we need to incorporate the region proposal directly into the R-CNN:
The Faster R-CNN paper by Girshick et al. introduced the Region Proposal Network (RPN) that bakes region proposal directly into the architecture, alleviating the need for the Selective Search algorithm.
As a whole, the Faster R-CNN architecture is capable of running at approximately 7–10 FPS, a huge step towards making real-time object detection with deep learning a reality.
Mask R-CNN
The Mask R-CNN algorithm builds on the Faster R-CNN architecture with two major contributions:
- Replacing the ROI Pooling module with a more accurate ROI Align module
- Inserting an additional branch out of the ROI Align module
This additional branch accepts the output of the ROI Align and then feeds it into two CONV layers.
The output of the CONV layers is the mask itself.
As we know, the Faster R-CNN/Mask R-CNN architectures leverage a Region Proposal Network (RPN) to generate regions of an image that potentially contain an object.
Each of these regions is ranked based on their “objectness score” (i.e., how likely it is that a given region could potentially contain an object) and then the top N’s most confident objectness regions are kept.
In the original Faster R-CNN publication, Girshick et al. set N=2,000, but in practice, we can get away with a much smaller N, such as N={10, 100, 200, 300} and still obtain good results.
He et al. set N=300 in their publication which is the value we’ll use here as well.
Each of the 300 selected ROIs go through three parallel branches of the network:
- Label prediction
- Bounding box prediction
- Mask prediction
Detail view of Mask-RCNN architecture
References
- Efficient Graph-Based Image Segmentation
- http://fcv2011.ulsan.ac.kr/files/announcement/413/IJCV(2004)%20Efficient%20Graph-Based%20Image%20Segmentation.pdf
- R-CNN
- J. R. R. Uijlings et al., Selective Search for Object Recognition, IJCV, 2013
- Felzenszwalb, P. F. et al., Efficient Graph-based Image Segmentation, IJCV, 2004
- Segmentation as Selective Search for Object Recognition
- https://arxiv.org/pdf/1703.06870.pdf