Recent new methods (2019~2020)
Fortunately, the answer is YES to the above question. Recently, people have been developing new one-stage methods that make object detection easier than before. The main ideas are twofold:
- Do not use anchors, use per-pixel prediction instead
- Do not use NMS post-processing, use one-to-one training instead
Instead of using anchors that vary according to spatial ratios and object sizes, people tend to reduce the complexity by using per-pixel prediction like semantic segmentation. A typical method is FCOS, in which every pixel in the final feature map is predicted with an object box, making it a fully convolutional network (FCN). The FCN style of object detection not only simplifies the task itself, but also unifies it with other FCN tasks like semantic segmentation, key point detection etc. for multi-task applications.
We can see that for each pixel inside the ground truth box, there can be assigned a label: (l, r , t, b), indicating the distances between the pixel to the left, right, top, bottom boundaries of the ground truth box. Therefore, the training is still many-to-one and NMS post-processing is still necessary to get the final prediction results. Although FCOS simplifies object detection and performs well, it is still not end-to-end.
In order to make object detection tasks end-to-end, people have to think differently. Since 2020, with the prevalence of transformers, people tend to do object detection with visual transformers and the results are also good. A typical method is DETR which will not be discussed in this article. What I will talk about here is another parallel work: OneNet, which extends FCOS to be an end-to-end FCN for object detection.
As discussed above, the main reason why NMS is necessary is that many-to-one paradigm is used in training, in which many boxes with high confidence are predicted for one object. In order to make it end-to-end without NMS, one-to-one training paradigm should be used instead.
Recall that in early methods, the predictions and ground truths are matched and only geometric losses (IoU and L1) between them are used for back-propagation. Therefore, many-to-one matching is necessary to increase the variance of training data because many candidates with similar geometric losses could be found and matched to the corresponding ground truth. The candidate is not unique! On the other hand, if we insist on one-to-one matching using the candidate with the lowest geometric loss, the model would possibly be overfitted and cannot generalize at all.
The authors of OneNet realized this problem and used two kinds of losses: geometric loss and classification loss, to match candidates with ground truths. Different with geometric loss, classification loss is somewhat unique to the corresponding ground truth. For example, in the high-level deep feature maps of an object, we can find a unique pixel that best represents the object class. Although many pixels have similar geometric losses to the corresponding ground truth, the pixel with best classification loss is unique. Thus we can combine the two losses to get the only one candidate with the lowest combined loss for one-to-one matching in training. As described in original paper, only candidate with minimum loss is used to match the corresponding object, others are all negative and matched with background.