The authors propose a network compression method that combines pruning and quantization.

• They use unstructured pruning, and do not care about acceleration; they only measure the storage requirements of the weight tensors.
• They use a standard sparse coding scheme to acheive practical space savings after quantizing.
• They show strong results.

Method

1. Clipping: Set two thresholds $$c^-$$ and $$c^+$$ so that a fraction $$p$$ of all weights are between these thresholds. For the next forward pass, these weights will be set to zero. In the next forward pass, they may be non-zero. $$p$$ is constant, but $$c^-$$ and $$c^+$$ change.

2. Partitionning: Divide the non-clipped portion of the axis of weight values into $$2^b-1$$ intervals, where $$b$$ is the bits budget.

3. Quantizing: Quantize the weights to the average of their interval (for the next forward pass only).

Setting hyperparameters

Prior to running Clip-Q, we need to find $$p$$ and $$b$$. The authors use Bayesian optimization; more specifically, they use a Gaussian Process. They use this method for each layer, and find $$\theta_i = (p_i, b_i)$$ for each layer $$i$$. The optimze the following objective:

$\mathrm{min}_{\theta} \quad \epsilon(\theta) - \lambda * c_i(\theta)$

where $$\epsilon(\theta)$$ is the error on a subset of the training set and $$c_i(\theta)$$ is a measure of compression benefit (economy of bits). See section 3.2 of the paper for more details.