Winner of the ILSVRC 2013 localization challenge, Overfeat is a method that takes as input an image with one salient object and output the class of that object as well as its bounding box.
The network is a simple CNN but with 2 outputs : one for predicting the class score (softmax with cross-entropy loss) and one for predicting the bounding box coordinates (L2 loss).
In order to improve precision, the network processes several sliding windows (at multiple resolution), each sliding window having a class score and a bounding box. The end result is obtained by combining all of these bounding boxes and scores :
In order to speedup the process, the fully-connected layers are converted into 1x1 convolution layers: