Code available here

Problem

Classifying videos with deep learning architectures is hard

CNN only

Two options:

  1. Big 3D volumes with variable sizes that might not fit into memory.
  2. Process each frame separately and fuse the results to make a prediction

CNN+RNN

Using CNN+RNN with end-to-end training is impractical. In practice, 2 options were explored :

  1. Focus on the CNN part and reduce the sequence length
  2. Focus on the RNN part and use a pre-trained CNN (no end-to-end training)

RNN only

Using only an RNN is also a problem, given the input dimensionality (e.g. a video with \(160 \times 120 \times 3\) frames results in a \(57,600\) input vector, and \(5,760,000\) parameters if using a 100-units hidden layer).

Summary

The paper focuses on RNN-only video classification using raw pixels. To reduce the input dimensionality, the input-to-hidden matrix is factorized with the Tensor-Train decomposition.

Tensor-Train Factorization

Step 1: Factorization

Each entry in the target tensor is represented as a sequence of matrix multiplications using elements selected from core-tensors.

Each index \(l_k\) is used to select an element from the core-tensors along the first dimension \(p_k\). Note that since \(r_0 = r_d = 1\), the result is always a scalar.

Visually :

Step 2: Preliminary steps for Feed-Forward layer factorization

We add the constraint that each integer \(p_k\) can be factorized as \(p_k = m_k \cdot n_k \, \forall \, k \in [1,d]\). Then, we reshape \(\mathcal{G}_k\) into \(\mathcal{G}_k^* \in \mathbb{R}^{m_k \times n_k \times r_{k-1} \times r_k}\). Each index \(l_k\) can be uniquely represented with two indices \((i_k, j_k)\):

Step 3: Feed-Forward Layer Factorization

The usual notation for a feed-forward layer is: \(\hat{\mathbf{y}} = \mathbf{Wx} + \mathbf{b}\) The equation is rewritten in an equivalent way using scalars:

If both \(M\) and \(N\) can be factorized into integer arrays of length \(d\), i.e. \(M = \prod_{k=1}^d m_k, N = \prod_{k=1}^d n_k\), then \(\mathbf{x}\) and \(\hat{\mathbf{y}}\) can be reshaped to have the same number of dimension: \(\mathcal{X} \in \mathbb{R}^{m_1 \times ... \times m_d}\), \(\mathcal{Y} \in \mathbb{R}^{n_1 \times ... \times n_d}\). The mapping function \(\mathbb{R}^{m_1 \times ... \times m_d} \rightarrow \mathbb{R}^{n_1 \times ... \times n_d}\) is rewritten as:

Thus, Eq.6 is a special case of Eq.7 with \(d = 1\). Finally, the weight tensor \(\mathcal{W}\) can be replaced by the TTF representation:

Step 4: Tensor-Train RNN

The LSTM has 4 “gates”, or weights matrices. Each of those matrices can have a TTF representation. However, a concatenation trick can be used like most LSTM implementations. Thus, all gates are concatenated as one big output tensor which can be compressed even more.

Experiments

UCF11 (Youtube Action Dataset)

  • 1600 video clips
  • 11 classes
  • \(320 \times 240\) downsampled to \(160 \times 120\), at 24 fps
  • Sequence length varies between 204 and 1492

  • (Liu et al., 2013): SIFT/STIP + Harris3D Corner Detector + Optical Flow + GMM w/ EM
  • (Hasan & Roy-Chowdhury, 2014): Ensemble of multi-class SVMs
  • (Sharma et al., 2015): Pre-trained GoogLeNet CNN + 3-fold stacked LSTM + Attention

Training times: 8-10 days for GRU/LSTM vs. 2 days fro TT versions.

Hollywood2 Data

  • 69 movies (33 training, 36 test) -> 823 training clips, 884 test clips
  • 12 labels, multiclass
  • \(234 \times 100\), at 12 fps
  • Sequence length varies between 29 and 1079

  • (Sharma et al., 2015): Pre-trained GoogLeNet CNN + 3-fold stacked LSTM + Attention
  • (Fernando et al., 2015): Frame features + Frame ordering -> Learned representation -> SVM classifier
  • (Fernando & Gould, 2016): Frame-wise CNN + “Rank pooling” + Softmax