Attention is widely used for offline models like Sequence-to-sequence. A huge drawback from the common soft attention algorithm is the huge time consumption, making it unusable for online tasks. This papers aims at bringing attention to real-time tasks.

Monotonic attention

In 2017, Raffael & al. proposed a monotonic attention. At each timestep, a reader could move or attend a node along the memory axis. The feature vector would then be all of the attended nodes. This enables the use of attention for real-time applications, but the performance were not good.

Monotonic chunkwise attention

The authors propose a simple solution to improve the monotonic attention. They compute a soft attention on a chunk at each timestep. They still use monotonic attention, but the feature vector is the weighted average of the chunk instead of the value of the node to attend.



  • Word error rate (WER), lower is better
  • ROUGE-1, overlap between prediction and groundtruth for 1-gram
  • ROUGE-2, overlap between prediction and groundtruth for bigram

Notes :

  1. Almost as good as offline methods
  2. Huge performance improvement compared to CTC methods


A great presentation from the author. Youtube