Binarized Neural Networks
Summary
BNNs achieved nearly stateoftheart results over the MNIST, CIFAR10 and SVHN datasets We wrote a binary matrix multiplication GPU kernel with which it is possible to run our MNIST BNN 7 times faster than with an unoptimized GPU kernel, without suffering any loss in classification accuracy.
Goal :
 Binarize weights and activations
 Be more efficient:
most of the 32bit floating point multiplyaccumulations are replaced by 1bit XNORcount operations
 Use less memory:
In comparison with 32bit DNNs, BNNs require 32 times smaller memory size and 32 times fewer memory accesses
Deterministic vs. stochastic binarization
Deterministic :
\[x^b = \text{sign}(x) = \begin{cases} +1 \text{ if } x >= 0, \\ 1 \text{ otherwise} \end{cases}\]Stochastic :
\[x^b = \begin{cases} +1 \text{ with probability } p = \sigma(x), \\ 1 \text{ with probability } 1  p \end{cases};\] \[\text{ where } \sigma(x) = \text{clip}(\frac{x+1}{2}, 0, 1) = \max(0, \min(1, \frac{x+1}{2}))\] Note: Stochastic binarization should be better than deterministic binarization, but is harder to implement since it requires random bits.
Training and gradients

Realvalued gradients of the weights and activations are stored during training for SGD to work.

The derivative of the sign function is zero almost everywhere, so the gradient cannot be used as is. Instead, a straightthrough estimator is used, which corresponds to computing the gradient of the hard tanh: \(\text{Htanh}(x) = \text{clip}(x, 1, 1)\)

For better efficiency, Shiftbased BatchNorm is used, which is an approximation of BatchNorm that uses almost no multiplications.
Experiments
 MNIST
 CIFAR10
 SVHN
BNNs take longer to train, but are nearly as accurate:
New binary kernel is 7x faster on GPU: