available at


  • Proposal of a densely-connected architecture applied to bi-directional LSTMs
  • Comparison against state of the art methods for sentence classification on five datasets


The authors point out two main advantages of their approach

  • Easy trainability even of deep architectures.
  • Good parameter efficiency.

Proposed model

Conceptual summary of architecture

Detailed comparison with stacked RNN architecture



MR - Movie review data; positive / negative
SST-1 - extension of MR; more fine-graned labels
SST-2 - SST-1 with binary labels
Subj - sentences; subjective / objective
TREC - questions for 6 classes person, location, …

c - number of classes.
l - average sentence length.
train / dev / test - size of train / validation / test set (‘CV’ means 10-fold cross validation)

Main experiment

They achieve state-of-the-art performance superior to simple (Bi-)LSTM and CNN approaches.

Further experiments

The authors conduct further experiments varying three of their hyperparameters:

  • nb_last units on last layer
  • nb_layers, i.e. the stacked LSTMs
  • nb_hidden units in all layers except the last

Parameter efficiency: Increasing nb_layers while keeping number of parameters constant might improve accuracy.

Increasing depth: Increasing nb_layers while keeping nb_last and nb_hidden constant improves accuracy.

Increasing width: Increasing nb_hidden while keeping nb_last and nb_hidden constant improves accuracy.

Comment - In the last two settings also the number of parameters increases. Therefore, the improved accuracy could also be explained by that. The conclusions to the effects of nb_layers and nb_hidden are not ultimately convincing.


  • The authors claim that ‘the application of DenseNet to RNN’ in NLP was novel. However, in Godin (2017) the same idea was used for RNNs.