GSGD: Optimizing ReLU neural networks in its positively scaleinvariant space
Highlights
 ReLU neural networks are optimized in a space that is invariant under positive rescaling. This space is also sufficient to represent ReLU networks.
 This is achieved by exploiting dependencies between the values of all the paths from the input layer to the output layer.
 This method only works for MLPs and CNNs with the same number of neurons per hidden layer.
Introduction
Let \(N_w(x): \mathcal{X} \to \mathcal{Y}\) denote a Llayer multilayer perceptron with weights \(w \in \mathcal{W} \subset \mathbb{R}^m\), input space \(\mathcal{X} \subset \mathcal{R}^d\) and output space \(\mathcal{Y} \subset \mathcal{R}^K\). In the \(l\)th layer, there are \(h_l\) neurons. We denote the \(i_l\)th node by \(O_{i_l}^l\) and its value on a forward pass by \(o_{i_l}^l\). The weight matrix between layers \(l1\) and \(l\) is denoted \(w^l\), and the weight connecting neurons \(O_{i_l1}^{l1}\) and \(O_{i_l}^l\) is denoted \(w^l(i_{l1},i_{l})\). The path from the input node \(O_{i_0}^0\) to the output node \(O_{i_L}^L\) passing through the nodes \(O_{i_0}^0,...,O_{i_L}^L\) is denoted by \((i_0,...,i_L)\). The network structure can be regarded as a directed graph \((\mathcal{O},E)\), with vertices \(\mathcal{O}=\{ O_1,...,O_{H+d+K} \}\) and edges \(E\), where H is the number of hidden neurons and \(E=m\).
A path is represented by a vector \(p=(p_1,...,p_m)^T\) and if edge \(e\) is contained in path \(p\), then \(p_e=1\) and otherwise \(p_e=0\). These vectors have \(L\) entries that are nonzero. The value of path \(p\) is defined as \(v_p(w)= \prod_{i=1}^m w_i^{p_i}\). The set of all such paths is denoted \(\mathcal{P}\). Finally, the activation status of path \(p\) is \(a_p(x;w)=\prod_{j:p_{e_{ij}}=1} \mathbb{I}\big( o_j(x;w) > 0 \big)\).
Positively ScaleInvariant Space

The group acting on weight space via positive scale invariance is the group \(\mathcal{G}=\{ g_{(c_1,...,c_H)}(\cdot) : c_1,...,c_H \in \mathcal{R}^+ \}\).

If \(w=g(w')\) for some \(g\in \mathcal{G}\) then \(N_w(x)=N_{w'}(x)\) for all input vector \(x\).

Based on this, the authors propose to optimize the values of paths instead of weights.

They consider the set of path vectors \(\mathcal{P} \subset \{ 0,1 \}^m\).

The matrix composed by all paths is denoted A and its called the structure matrix of the ReLU network. It is of size \(m \times n\).
 Given the structure matrix and the values of basis paths, the values of the weights cannot be determined unless the values of free variables are fixed.
\(\mathcal{G}\)SGD optimization algorithm
Three methods are required to obtain the \(\mathcal{G}SGD\) algorithm.
Skeleton Method
For an MLP of depth L and width h, this method constructs the skeleton weights.
The update rule for basis paths is:
Inverse ChainRule (ICR) Method
The end goal of this method is to compute the gradient with respect to the basis paths in terms of the gradient with respect to the weights. By the chain rule we have the following equation:
Observe that
 \(\dfrac{\partial v_p}{\partial w_e} = \dfrac{v_p}{w_e}\) if the edge \(e\) is contained in the path \(p\), and \(\dfrac{\partial v_p}{\partial w_e} = 0\) if not.
 Each nonskeleton weight is contained in only one basis path. This means that there is only one nonzero element in each column of the matrix in (3).
WeightAllocation (WA) Method
With the previous two methods, the update rule for basis paths is computed. But a way to update the weights is required, and this is the purpose of weightallocation. For this we need the following definitions:
 The pathratio at iteration t: \(R^t(p^j):= v_{p_j}^t/ v_{p_j}^{t1}\).
 The weightratio at iteration t: \(r^t(w_i)=w_i^t/w_i^{t1}\).
 \(w \odot p^j := \prod_{i=1}^m w_i^{p_i^j}\).
Observe that \(v_{p^j}(w) = w \odot p^j\) and that \(R^t(p^j) = r^t(w) \odot p^j\). The operation \(\odot\) can be extended to matrices too in the obvious way.
where \(A'=(p^1,...,p^{mH})\). Assume that \(w_1,...,w_H\) are the skeleton weights. The vector \(( R^t(p^1)),..., R^t(p^{mH}))\) is completed to an mdimensional vector by adding 1’s at the beginning. Define \(\widetilde{A}:= [B,A']\) with \(B=[I,0]^T\), where I is an \(H \times H\) identity matrix and \(0\) is an \(H \times m\) zero matrix. Then
 \[rank( \widetilde{A} ) = m\]
 This is not the only way to obtain weightratios from pathratios. But the authors prove the following:
Experiments
Each curve is the average of 5 training curves.
 For CNNs the following results are reported:

The authors claim that due to the skip connections there are no basis paths for the whole ResNet model. They also claim that there is no positive scale invariance across residual connections because it breaks the structure that they construct.

They also treat batchnorm separately. The following approximation is considered:
 For MLP with two hidden layers of size H, where \(\bigtriangleup\) denotes the performance gap between \(\mathcal{G}SGD\) and vanilla SGD. These curves are produced over the data set FashionMNIST
 Finally, they compare with PathSGD:
Conclusions

The use of the dependencies between paths in a ReLU network can be exploited to produce better optimization when compared with usual SGD.

It is not clear how they obtain the skeleton weights for CNNs.

They do not compare with other optimization algorithms like momentum or Adam, only with vanilla SGD and pathSGD.