Some recent activation functions mimicking ReLU

Activation functions play a critical role in deep learning, influencing how models learn and generalize. Below is a description of the relationships between several important activation functions: ReLU, ELU, GELU, GLU, SiLU, Swish, ReGLU, GEGLU, and SwiGLU.

1. ReLU (Rectified Linear Unit)

ReLU is one of the most commonly used activation functions in neural networks. It is defined as:

\[\text{ReLU}(x) = \max(0, x)\]

If \(x\) comes from a previous layer, we have \(x=Wx^-+b\) where \(x^-\) is the output of the layer before.

2. ELU (Exponential Linear Unit)

ELU is an extension of ReLU that introduces smoothness for negative inputs, defined as:

\[\text{ELU}(x) = \begin{cases} x & \text{if } x \geq 0 \\ \alpha(\exp(x) - 1) & \text{if } x < 0 \end{cases}\]

3. GELU (Gaussian Error Linear Unit)

GELU is a more sophisticated activation function that models the neuron activation probabilistically:

\[\text{GELU}(x) = x \cdot \Phi(x)\]

where \(\Phi(x)\) is the cumulative distribution function of the Gaussian distribution.

4. SiLU (Sigmoid Linear Unit)

SiLU is a simple approximation of ReLU but without any discontinuity of the first derivative:

\[\text{SiLU}(x) = x \cdot \sigma(x)\]

where \(\sigma(x)\) is the sigmoid function.

5. Swish

Swish is a simple generalization of SiLU

\[\text{Swish}(x) = x\sigma(\beta x)\]

where \(\beta\) is learnable. When \(\beta=1\), Swish becomes SiLU.

6. GLU (Gated Linear Unit)

GLU introduces a gating mechanism where the output of a linear transformation is modulated by a gate:

\[\text{GLU}(x) = x_1 \cdot \sigma(x_2)\]

where \(x_1 = Wx^-+b\) and \(x_2 = Vx^-+c\) are different linear transformations of the input \(x\).

7. ReGLU (Rectified Gated Linear Unit)

ReGLU is a variant of GLU that uses ReLU instead of the sigmoid function as the gating mechanism:

\[\text{ReGLU}(x) = x_1 \cdot \text{ReLU}(x_2)\]

8. GEGLU (Gaussian Error Gated Linear Unit)

GEGLU is another variant of GLU that uses GELU instead of the sigmoid function:

\[\text{GEGLU}(x) = x_1 \cdot \text{GELU}(x_2)\]

9. SwiGLU (Swish-Gated Linear Unit)

SwiGLU replaces the gating mechanism in GLU with the Swish activation function:

\[\text{SwiGLU}(x) = x_1 \cdot \text{Swish}(x_2)\]