Overfeat : Integrated Recognition, Localization and Detection using Convolutional Networks
Winner of the ILSVRC 2013 localization challenge, Overfeat is a method that takes as input an image with one salient object and output the class of that object as well as its bounding box.
data:image/s3,"s3://crabby-images/5363b/5363b3f7ef31c90e707c771b1a193a574d05129e" alt=""
The network is a simple CNN but with 2 outputs : one for predicting the class score (softmax with cross-entropy loss) and one for predicting the bounding box coordinates (L2 loss).
data:image/s3,"s3://crabby-images/ee056/ee0566489ffd8f77a9cf5c3f1a88b8c60cf82493" alt=""
In order to improve precision, the network processes several sliding windows (at multiple resolution), each sliding window having a class score and a bounding box. The end result is obtained by combining all of these bounding boxes and scores :
data:image/s3,"s3://crabby-images/d87c4/d87c418df49c7460c0b49c187be6caf8fe540517" alt=""
In order to speedup the process, the fully-connected layers are converted into 1x1 convolution layers:
data:image/s3,"s3://crabby-images/93b36/93b365e4f5b1a42a197b1d776685184ed3da4da8" alt=""