This is a survey/benchmarking paper focused on localization methods. This paper is a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance by investigating various ways to trade accuracy for speed and memory usage in object detection CNN methods. In that purpose, they implemented three “meta-architectures”:

  • SSD (Single Shot Detector)
  • Faster R-CNN
  • R-FCN

different feature extractors :

  • Vgg16
  • Resnet-101
  • Inception v2
  • Inception Resnet v2
  • MobileNet

tested the impact of using various number of box proposals (between 10 and 300), and the input image size (300 or 600).

Experimental results


  • Fig.4a shows that Faster-RCNN + inception ResNet v2 is the top performing configuration.
  • Fig.4b shows that an input resolution of 600 is better than 300, especially when dealing with small objects.
  • Fig.6 shows that using more than 50 box proposals does not improve mAP while requiring more processing power.
  • Table 4 shows that their top method achieves a mAP of 0.347 on COCO while ensemble methods reach 0.416, the best result ever.