Summary

Goals

• Better weight efficiency
• Better conditioning of the recurrent weight matrix (prevent vanishing/exploding gradients)

Core idea

Factorize the NxN recurrent matrix using a Kronecker factorization. This allows some flexibility in the actual number of parameters through the size of the factors.

For example, using 2x2 factors, there are $$O(\log(N))$$ parameters, whereas it can go up to $$O(N^2)$$ parameters when using a single NxN factor (the original RNN description).

The vanishing and exploding gradient problem is tackled through a soft unitary constraint, which is easier to apply when using small factors.

NOTE: This soft constraint is not needed when using an LSTM, since the gating mechanism is already designed to prevent exploding/vanishing gradients.

Reminder: Unitary matrix

In real space, a unitary matrix is simply an orthogonoal matrix, i.e. its transpose is also its inverse:

$Q^T Q = QQ^t = I \Leftrightarrow Q^T = Q^{-1}$

In complex space, a matrix is unitary if its conjugate transpose is also its inverse:

$U^* U = UU^* = I$

(Remember, the conjugate transpose is the transpose matrix where the imaginary parts have their sign reversed.)

Factorization

If factors are 2 x 2, then there are $$\log_2 N$$ factors, the number of parameters is $$8 \log_2 N$$ (when using complex factors), and the time complexity of the hidden state computation is $$O(N \log_2 N)$$.

If factors are N x N, then there is a single N x X factor and the standard RNN is recovered.

Thus, there is flexibility between computational budget and statistical performance.

Soft unitary constraint

A regularization term is simply added to the loss :

If all the factors are unitary, then the full matrix W is unitary, and if all factors are approximately unitary, then W is also approximately unitary. This new loss term introduces the usual strength hyperparameter like any other regularization term.

Using complex matrices

Imposing a unitary constraint on a real matrix means that the space it can cover is disconnected (because its determinant is either 1 or -1). Using a complex space means that the unitary set is now connected and continuous, and is more easily optimized.

Experiments

NOTE: The authors implemented their own custom CUDA kernels for fast Kronecker operations in C++ (inexistent in Theano, Tensorflow and Pytorch)

Copy-memory problem (sequence to sequence)

Input : $$[x_0, x_1, ..., x_9, \text{blank} \times T-1, \text{delimiter}, \text{blank} \times 10]$$

Target : $$[\text{blank} \times T+10, x_0, x_1, ..., x_9]$$

NOTE: Expected cross-entropy for a memory-less strategy is 0.03 for T=1000 and 0.015 for T=2000