On Multiplicative Integration with Recurrent Neural Networks

Main idea

Replace the RNN core additive block with a multiplicative block using the Hadamard product.

[Reminder] Hadamard product: \((A \odot B)_{ij} = a_{ij} \times b_{ij}\)

Core block of Vanilla RNN vs Proposed Multiplicative-Integration RNN (MI-RNN):

\(\rightarrow\)

In a more general form, where each matrix has its own bias term:

Finally, with an added “gate” on the second order term:

(If \(\alpha = 0\), we get back to the original additive block)

NOTES:

Gradient for vanilla-RNN:

Gradient for MI-RNN:

(For the simple case. In the general case, we have: diag\((\alpha \odot W x_k + \beta_1)\))

NOTES:

Activations over the validation set, using tanh as the nonlinearity

[Reminder] Tanh derivative: \(\nabla \tanh(x) = 1 - \tanh^2(x)\)

For saturated activations, \(\text{diag}(\phi'_k) \approx 0\) (no gradient flow)
For non-saturated, \(\text{diag}(\phi'_k) \approx 1\)

Pre-activation term : \(Wx_k + Uh_{k-1}\)
For one-hot vectors, \(Wx_k\) is much smaller than \(Uh_{k-1}\); initialization matters a lot! (See top-left of Fig.1, where \(r_w\) is the uniform initialization range)