This is a survey/benchmarking paper focused on localization methods. This paper is a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance by investigating various ways to trade accuracy for speed and memory usage in object detection CNN methods. In that purpose, they implemented three “meta-architectures”:
- SSD (Single Shot Detector)
- Faster R-CNN
different feature extractors :
- Inception v2
- Inception Resnet v2
tested the impact of using various number of box proposals (between 10 and 300), and the input image size (300 or 600).
- Fig.4a shows that Faster-RCNN + inception ResNet v2 is the top performing configuration.
- Fig.4b shows that an input resolution of 600 is better than 300, especially when dealing with small objects.
- Fig.6 shows that using more than 50 box proposals does not improve mAP while requiring more processing power.
- Table 4 shows that their top method achieves a mAP of 0.347 on COCO while ensemble methods reach 0.416, the best result ever.