Rethinking the Faster R-CNN Architecture for Temporal Action Localization

TAL-Net is an extension of Faster R-CNN to perform action recognition on sequences. This work proposes two main contributions:

Variable receptive field
Late feature fusion

Variable receptive field

To better handle long sequences, they reuse the system of anchor but for different temporal spans. They have \(K\) 3D CNN for each temporal anchor. To get a matching dimension, they use a variable number of dilated 3D CNNs. At the end, there is a classification task on which anchor to select.

Late feature fusion

Most action recognition techniques use the RGB image and the optical flow as their input. Previous methods concatenated those streams where this method uses two distinct models where each model handles one stream before merging them.

Results

They test their method on THUMOS14 and test multiple temporal IoUs (tIoUs). On tIoU = 0.5, they are better than the previous SotA by 11.8%.