Learning from Simulated and Unsupervised Images through Adversarial Training
Summary
The refiner \(R_{\theta}\) is a fully convolutional network without striding or pooling.
In addition to the usual adversarial loss (\(l_{real}\)), a regularization loss is used to preserve “annotation” information from the simulator (\(l_{reg}\)).
The function \(\psi\) in the regularization term is a mapping to feature space. It is usually the identity function, but in some cases the authors use other features, like the mean of color channels or a convnet output for example.
Generated images history for discriminator training
The training set for the discriminator update is built using 50% real images, 25% of refined images generated by the latest generator, and 25% of refined images generated by past versions of the generator. This is done to improve the stability of adversarial training. The authors note that this method is complimentary to using a running average of the model parameters.
Experiments and Results
Datasets:
- Appearance-based gaze estimation on MPIIGaze dataset
- Hand pose estimation on NYU hand pose dataset of depth images
Visual Turing test A “Visual Turing test” for classifying real vs. refined images was done, and the human accuracy was 51.7%, showing that refined images are almost indistinguishable from real images.
Training on refined synthetic data outperforms training on purely synthetic data by 22.3%.
Comparison to other methods