On Multiplicative Integration with Recurrent Neural Networks
Main idea
Replace the RNN core additive block with a multiplicative block using the Hadamard product.
[Reminder] Hadamard product: \((A \odot B)_{ij} = a_{ij} \times b_{ij}\)
Core block of Vanilla RNN vs Proposed Multiplicative-Integration RNN (MI-RNN):
\(\rightarrow\)
In a more general form, where each matrix has its own bias term:
Finally, with an added “gate” on the second order term:
(If \(\alpha = 0\), we get back to the original additive block)
NOTES:
- Number of parameters is about the same
- Second-order term shares parameters with first-order terms
- Can be easily added to existing architectures (e.g. LSTM/GRU)
Gradients
Gradient for vanilla-RNN:
Gradient for MI-RNN:
(For the simple case. In the general case, we have: diag\((\alpha \odot W x_k + \beta_1)\))
NOTES:
- Gradient is now “gated” by \(\text{diag}(Wx_k)\)
- Gradient propagation is easier with \(Wx_k\) involved
Experiments using Penn-Treebank (text) dataset
Activations problem
Activations over the validation set, using tanh as the nonlinearity
[Reminder] Tanh derivative: \(\nabla \tanh(x) = 1 - \tanh^2(x)\)
- For saturated activations, \(\text{diag}(\phi'_k) \approx 0\) (no gradient flow)
- For non-saturated, \(\text{diag}(\phi'_k) \approx 1\)
Scaling problem
- Pre-activation term : \(Wx_k + Uh_{k-1}\)
- For one-hot vectors, \(Wx_k\) is much smaller than \(Uh_{k-1}\); initialization matters a lot! (See top-left of Fig.1, where \(r_w\) is the uniform initialization range)