One of the first localization paper using a deep convolutional neural network. As opposed to Overfeat, it can localize an arbitrary number of objects per image. The method implements a 3 stage process. First, a region proposal extracts a large number of bounding boxes likely to contain an object of interest. Second, the image region of each bounding box is fed to an AlexNet CNN (typically pre-trained on imagenet) in order to have a feature vector associated to each bounding box (4096 Dim). Third, the feature vectors are then classified with an multiclass linear SVM.

Other stuff

Nice presentation here