The main contribution of this paper is a network that allows to do segmentation and geolocalization of aerial images at the same time. As shown in fig.1, the multi-stage multi-task (MSMT) network has 2 stages and makes 3 predictions.

Stage 1

This part is first trained independantly from the rest of the system. The goal is to segment roads

Stage 2

Once stage 1 is trained, the output feature maps and then used to train 2 branches, one for geolocalization and one for building segmentation (which they call localization map).

The segmentation loss is a combination of a cross-entropy and a dice

while the geolocalization loss is a simple l2 regression:

Results

Results show that their method is more accurate and faster than previous methods. Here is an illustration of their Arial image localization dataset.