Starting from weak supervision in the form of bounding box detection annotations, they proposed a new approach that does not require modification of the segmentation training procedure. The method consist in recursive training of convnets for weakly supervised semantic labelling where the GT at each round of training is the segmentation results of the method obtained at the previous round.


They proposed 4 variants :

  1. a naive method that does not work (as in Fig.2)
  2. a box method similar to naive but that forces the results of the method at the end of each round to certain conditions :
    • any pixel outside the box annotation should be reset as background
    • when the result is 50% times smaller than the initial bounding box, the result is set back to the initial bounding box
    • they fine tune results with a CRF.
  3. a box-i method identical to box but with ignored pixels between the boundary and the middle of the bounding boxes (c.f. Fig.3).
  4. a grab-cut/MCG method that gives a better initialization (c.f. Fig.3)


They used the DeepLab network for segmentation.

Results on PascalVOC and COCO datasets are surprisingly good