This paper proposes a solution for small data sets which suffer from overfitting. The authors employ a metric learning which helps the network to rapidly learn from small data set. They applied this idea on a small set of Image Net and they improve an accuracy from 87.6% to 93.2%.

They use the advantages of one shot learning, but for prediction on the target, they investigate a metric to find how much an unseen sample is simmilar to set of seen samples in a episod and this similarity can be present as a weight on classes. This metric plays the same role as attention or a kernel function. The authors used a cosine distance on two vectors as a metric.

Suppose \((x_{i},y_{i})\) is a given (input, class) and \(\bar{x}\) is a given input and predicted label \(\bar{y}\) calculated as:

\[\bar{y} = \sum_{i=1}^k a(\bar{x},x_i) y_{i}\]

A small support set contains of \(k\) label example with input \(x\) and lable \(y\), \(S= {(x_{i},y_{i})}_{i=1}^k\), given a new example \(\bar{x}\), we want to know the probability that it is an example of a given class \(P(\bar{y}\mid\bar{x},S)\) where \(P\) is parametrized by a neural network. Therefore \(S\) maps to a classifier \(c_{s}(\bar{x})\). Simply, suppose a task \(T\) has a distribution over lable sets \(L\).

First a set \(L\) from task \(T\) is sampled, Then from these sample labels, the support set \(S\) and a batch \(B\) are sampled. The Maching network has to minimize the prediction of lables in batch \(B\) conditioned on support set \((S)\).

For a similarity metric, two embeding functions \(f\) and \(g\) need to take similarity on feature space X. function \(g\) has embeded \(x_{i}\) independently from other elements but \(S\) could be able to effect how be embeded \(\bar{x}\) through function \(f\). Then an attention kernel calculate cosin distance on these functions (similar to nearest neighbor).

The embeding of \(x_{i}\) is a neural network (VGG) follow by a Bi-LSTM. The function f is a neural network (VGG) and then the embeding function g applied to each element \(x_{i}\) to process the kernel for each set of S. Note that \(\bar{y}\) is a linear combination of the lables in support set.