Monotonic Chunkwise Attention

Attention is widely used for offline models like Sequence-to-sequence. A huge drawback from the common soft attention algorithm is the huge time consumption, making it unusable for online tasks. This papers aims at bringing attention to real-time tasks.

Monotonic attention

In 2017, Raffael & al. proposed a monotonic attention. At each timestep, a reader could move or attend a node along the memory axis. The feature vector would then be all of the attended nodes. This enables the use of attention for real-time applications, but the performance were not good.

Monotonic chunkwise attention

The authors propose a simple solution to improve the monotonic attention. They compute a soft attention on a chunk at each timestep. They still use monotonic attention, but the feature vector is the weighted average of the chunk instead of the value of the node to attend.

Experiments

Metrics

Word error rate (WER), lower is better
ROUGE-1, overlap between prediction and groundtruth for 1-gram
ROUGE-2, overlap between prediction and groundtruth for bigram

Notes :

Almost as good as offline methods
Huge performance improvement compared to CTC methods

Talk

A great presentation from the author. Youtube