Highlights

The authors propose the data-selection method ‘Iterative Projection and Matching’ which is computationally efficient and features a high ‘selection accuracy’.

The approach is evaluated in different experiments on active learning, learning using representatives, training of a GAN or video summarization.

Introduction

Data selection, i.e. choosing $$K$$ representatives from a set of $$N$$ samples, can become a non-trivial problem when $$N$$ gets large. There are two main approaches of convex relaxation:

• D-optimal solutions (stochastic: volume sampling (VS))
• A-optimal solutions.

• lack of guarantee that ‘un-selected samples are well-covered by the selected ones
• outliers are selected with a high probability’ due to favouring diversity of samples.

An alternative are ‘representative approaches’ which have the disadvantages that

• they have a high time complexity, and
• rely on proper fine-tuning of parameters.

The authors propose ‘Iterative Projection and Matching’ (IPM) to address these issues. Their particular contributions are:

• linear time complexity
• no parameters
• theoretical underpinning of properties of IPM.

Methods

Let $$\mathbf{a}_1, \ldots, \mathbf{a}_M \in \mathbb{R}^N$$ be $$M$$ given data points of dimension $$N$$. We define an $$M \times N$$ matrix, $$\mathbf{A}$$, such that $$\mathbf{a}_m^T$$ is the $$m^th$$ row of $$\mathbf{A}$$, for $$m = 1, \ldots M.$$ The goal is to reduce this matrix into a $$K \times N$$ matrix, $$\mathbf{A}_R,$$ based on an optimal metric.

Representative selection

The central problem is formulated as

$\arg\min_{|\mathbb{T}| = K} \|\mathbf{A} - \pi_\mathbb{T}(\mathbf{A})\|_F^2,$

where $$\mathbb{T} \subset \{1, \ldots, M\}$$, $$\pi_\mathbb{T}(\mathbf{A})$$ is the matrix of rows of $$\mathbf{A}$$ projected onto the span of selected rows indexed by $$\mathbb{T}$$, and $$\|\cdot\|_F$$ denotes the Frobenius norm.

IPM

The problem can be reformulated in terms of

$\arg\min_{\mathbf{U},\mathbf{V}} \| \mathbf{A} - \mathbf{UV}^T \|_F^2 \; \mathrm{s.t.} \; \mathbf{v}_k \in \mathbb{A},$

where $$\mathbb{A}$$ contains the normalised rows of $$\mathbf{A}$$, and $$\mathbf{v}_k$$ is the $$k$$-th column of $$\mathbf{V}$$.

This is further simplified to

$(\mathbf{u}, \mathbf{v}) = \arg\min_{\mathbf{u}, \mathbf{v}} \| \mathbf{A} - \mathbf{uv}^T \|_F^2 \; \mathrm{s.t.} \; \|\mathbf{v}\| = 1,$ $m^{(1)} = \arg\max_m |\mathbf{v}^T\tilde{\mathbf{a}}_m|,$

where $$\tilde{\mathbf{a}}_m$$ is the normalised $$m$$-th row of $$\mathbf{A}$$.

Properties of IPM

Lower bound on maximum correlation

The authors show theoretically, that the lower bound for the maximum correlation between $$\mathbf{v}$$ and a data point $$\mathbf{a}_i$$ is given by the ‘rank-oneness measure $$\mathrm{ROM}(\mathbf{A})$$’. (See paper for details.) This ensures that they always find a data point which is sufficiently similar to the right singular vector $$\mathbf{v}$$ of $$\mathbf{A}$$.

Robustness to outliers

In another analysis, it is shown that the first singular vector $$\mathbf{v}$$ of $$\mathbf{A}$$ ‘is the most robust spectral component against changes in the data.’ (See paper for details.)

Results

Active learning

Fine-tuning a 3D ResNet18 (pretrained on Kinetics-400) on UCF-101 (human action dataset) the authors compare different data selection methods in a classification experiment.

At the first active learning cylce, the training set consists of one sample per class in the dataset.

At each following cycle, ‘one sample per class is selected, without the knowledge of the labels, and added to the training set’. In each cycle, ‘the fully connected layer […] is fine-tuned for 60 epochs.

The authors draw the following conclusions:

• in early cycles, uncertainty-based methods perform worse than random selection,
• IPM is outperforming all other methods in early cycles,
• adding uncertainty by $$m^\ast = \arg\max_m \alpha |\mathbf{v}^T\tilde{\mathbf{a}}_m| + (1 - \alpha) q(\mathbf{a}_m)$$ letting alpha decay with rate 0.95 (from $$\alpha = 1$$ in the first cycle) is allowing for best performance.

Learning using representatives

The strategy here is to select a subset of representative samples from each class of a dataset and train only on the reduced dataset.

The ‘CMU Muli-PIE Face Database’ (249 subjects, 13 poses, 20 illuminations, 2 expressions) is reduced to 10 images per subject using different selection methods.

IPM slects the best representatives in terms of different view angles.

Also, the results of a GAN for multi-view generation from single-view input images are best using IPM compared to other methods.

Visualising efficiency of sampled representatives

Using the 3D ResNet18 classifier pretrained on Kinetics-400, the output features of the last convolutional layer are being used to select representatives from UCF-101. Using t-SNE visualization, the authors show that IPM enables the fine-tuned classifier to learn the separation between two random classes better compared to other selection methods.

Other experiments

Further results on

• finding representatives for ImageNet, and
• video summarization are shown. (See paper for details.)

Conclusions

The authors could show that their approach for data selection is

• simple (and therefore easily extensible)
• time-efficient, and
• data-efficient.

In their experiments, the approach outperforms the competitors in terms of

• performance of the trained networks, and
• performance of data selection (computation cost, representativity of the dataset).