Normalization and Activations¶
Normalization¶
Layer Normalization (LayerNorm)¶
- Used in the original Transformer ("Attention Is All You Need").
- Formula: let x be a vector, μ = mean(x), σ^2 = var(x):
The LayerNorm operation is commonly written as:
$$ \mathrm{LayerNorm}(x) = \gamma \frac{x - \mu}{\sqrt{\sigma^2 + \varepsilon}} + \beta $$
- Pre-LN vs Post-LN:
- Post-LN (original): normalization after the residual block.
- Pre-LN (common now): normalization before the sub-layer (attention/FFN). Pre-LN often improves gradient stability and training dynamics for deep models.
RMS Normalization (RMSNorm)¶
- RMSNorm removes the mean-centering step and uses root-mean-square for normalization. It has fewer operations and parameters than LayerNorm and can be faster/more memory efficient.
A common RMSNorm formulation is:
$$ \mathrm{RMSNorm}(x) = \gamma \frac{x}{\sqrt{\tfrac{1}{d}\sum_{i=1}^d x_i^2 + \varepsilon}} $$
- Notes:
- No trainable bias (β) in the basic RMSNorm formulation.
- Slightly cheaper (no subtraction/mean computation) and used in many modern architectures.
Why choose one vs the other¶
- LayerNorm provides full centering and scaling; RMSNorm is a lightweight alternative with similar empirical performance in many cases.
- Pre-LN residual blocks + either norm usually give stable training for deep transformers.
Feed-Forward Layers (FFN) — common forms¶
- Standard: FFN(x) = W2 * f(W1 x + b1) + b2
- Standard:
$$ \mathrm{FFN}(x) = W_2\, f(W_1 x + b_1) + b_2 $$
- W₁: input → hidden, activation f (ReLU/GeLU/etc.), W₂: hidden → output.
- Gated/Mixture variants are common (GLU/GeGLU/SwiGLU) where activations include element-wise gating for improved expressivity.
Activations¶
Common activation functions used in transformer FFNs and variants:
- ReLU: f(x) = max(0, x)
- ReLU:
$$ \mathrm{ReLU}(x) = \max(0, x) $$ - Simple, cheap; used in early Transformers.
- GeLU (Gaussian Error Linear Unit):
- GeLU (Gaussian Error Linear Unit):
- Common approximation:
$$ \mathrm{GeLU}(x) \approx x\,\Phi(x) $$
where Φ is the standard normal CDF. Many implementations use a fast approximation for efficiency. - Used in GPT and many transformer variants; smoother than ReLU.
- SiLU / Swish: f(x) = x * sigmoid(x)
- SiLU / Swish:
$$ \mathrm{SiLU}(x) = x\,\sigma(x) $$
where σ(x)=1/(1+e^{-x}) is the logistic sigmoid. - Smooth, can improve performance in some models.
- Gated activations: FF ReGLU:
$$ \mathrm{FF ReGLU}(x) = (\max(0, xW_1) \otimes (xV)) W_2 $$
FF GeGLU:
$$ \mathrm{FF GeGLU}(x, W, V, W_2) = (\mathrm{GLU}(xW) \otimes (xV)) W_2 $$
SwiGLU (Swish is x\cdot\sigma(x)):
$$ \mathrm{FF SwiGLU}(x, W, V, W_2) = (\mathrm{Swish}(xW) \otimes (xV)) W_2 $$
V -> extra parameter Note: Gated models use smaller dimensions for dff by ⅔
Practical notes¶
- Most modern transformer implementations use GeLU (or a gated variant) in the FFN.
- When optimizing for memory/speed, consider RMSNorm + GeLU (or gated GeLU) with pre-LN transformer blocks.
Serial vs Parallel Layers¶
In standard transformer blocks, layers are computed serially: first attention, then MLP.
Some recent models use a parallel formulation, where the MLP and attention operate in parallel on the same normalized input.
Serial (standard) formulation:
$$
y = x + \mathrm{MLP}(\mathrm{LayerNorm}(x + \mathrm{Attention}(\mathrm{LayerNorm}(x))))
$$
Parallel formulation:
$$
y = x + \mathrm{MLP}(\mathrm{LayerNorm}(x)) + \mathrm{Attention}(\mathrm{LayerNorm}(x))
$$
Benefits: - Parallel layers enable the MLP and attention input matrix multiplications to be fused, resulting in ~15% faster training at large scale. - Ablation experiments show a small quality drop at 8B scale, but no degradation at 62B scale; extrapolation suggests parallel layers are quality-neutral at even larger scales (e.g., 540B).