An important sub-domain of localization is text detection. While methods in localization use anchors, it’s often tricky to use those in text detection because the boxes are rotated and the labels are tiny. This problem is shown in fig. 3 where small objects are not detected by a standard RPN.

Model

To resolve this issue, the authors propose a multi-scale network based on FPN without anchor. The network is shown in fig. 2. The detection module is the same as in Faster R-CNN, but it outputs 8 coordinates per box.

To train this network, they compute the loss of all labels for all 3 stages. The localization loss is a Smooth L1 as in Faster R-CNN.

Results

The authors show great results on ICDAR-2017, 2015 and 2013.