Skip to content

Normalization and Activations

Normalization

Layer Normalization (LayerNorm)

  • Used in the original Transformer ("Attention Is All You Need").
  • Formula: let x be a vector, μ = mean(x), σ^2 = var(x):

The LayerNorm operation is commonly written as:

$$ \mathrm{LayerNorm}(x) = \gamma \frac{x - \mu}{\sqrt{\sigma^2 + \varepsilon}} + \beta $$

  • Pre-LN vs Post-LN:
  • Post-LN (original): normalization after the residual block.
  • Pre-LN (common now): normalization before the sub-layer (attention/FFN). Pre-LN often improves gradient stability and training dynamics for deep models.

RMS Normalization (RMSNorm)

  • RMSNorm removes the mean-centering step and uses root-mean-square for normalization. It has fewer operations and parameters than LayerNorm and can be faster/more memory efficient.

A common RMSNorm formulation is:

$$ \mathrm{RMSNorm}(x) = \gamma \frac{x}{\sqrt{\tfrac{1}{d}\sum_{i=1}^d x_i^2 + \varepsilon}} $$

  • Notes:
  • No trainable bias (β) in the basic RMSNorm formulation.
  • Slightly cheaper (no subtraction/mean computation) and used in many modern architectures.

Why choose one vs the other

  • LayerNorm provides full centering and scaling; RMSNorm is a lightweight alternative with similar empirical performance in many cases.
  • Pre-LN residual blocks + either norm usually give stable training for deep transformers.

Feed-Forward Layers (FFN) — common forms

  • Standard: FFN(x) = W2 * f(W1 x + b1) + b2
  • Standard:

$$ \mathrm{FFN}(x) = W_2\, f(W_1 x + b_1) + b_2 $$

  • W₁: input → hidden, activation f (ReLU/GeLU/etc.), W₂: hidden → output.
  • Gated/Mixture variants are common (GLU/GeGLU/SwiGLU) where activations include element-wise gating for improved expressivity.

Activations

Common activation functions used in transformer FFNs and variants:

  • ReLU: f(x) = max(0, x)
  • ReLU:

$$ \mathrm{ReLU}(x) = \max(0, x) $$ - Simple, cheap; used in early Transformers.

  • GeLU (Gaussian Error Linear Unit):
  • GeLU (Gaussian Error Linear Unit):
  • Common approximation:

$$ \mathrm{GeLU}(x) \approx x\,\Phi(x) $$

where Φ is the standard normal CDF. Many implementations use a fast approximation for efficiency. - Used in GPT and many transformer variants; smoother than ReLU.

  • SiLU / Swish: f(x) = x * sigmoid(x)
  • SiLU / Swish:

$$ \mathrm{SiLU}(x) = x\,\sigma(x) $$

where σ(x)=1/(1+e^{-x}) is the logistic sigmoid. - Smooth, can improve performance in some models.

  • Gated activations: FF ReGLU:

$$ \mathrm{FF ReGLU}(x) = (\max(0, xW_1) \otimes (xV)) W_2 $$

FF GeGLU:

$$ \mathrm{FF GeGLU}(x, W, V, W_2) = (\mathrm{GLU}(xW) \otimes (xV)) W_2 $$

SwiGLU (Swish is x\cdot\sigma(x)):

$$ \mathrm{FF SwiGLU}(x, W, V, W_2) = (\mathrm{Swish}(xW) \otimes (xV)) W_2 $$

V -> extra parameter Note: Gated models use smaller dimensions for dff by ⅔

Practical notes

  • Most modern transformer implementations use GeLU (or a gated variant) in the FFN.
  • When optimizing for memory/speed, consider RMSNorm + GeLU (or gated GeLU) with pre-LN transformer blocks.

Serial vs Parallel Layers

In standard transformer blocks, layers are computed serially: first attention, then MLP.

Some recent models use a parallel formulation, where the MLP and attention operate in parallel on the same normalized input.

Serial (standard) formulation:


$$ y = x + \mathrm{MLP}(\mathrm{LayerNorm}(x + \mathrm{Attention}(\mathrm{LayerNorm}(x)))) $$

Parallel formulation:


$$ y = x + \mathrm{MLP}(\mathrm{LayerNorm}(x)) + \mathrm{Attention}(\mathrm{LayerNorm}(x)) $$

Benefits: - Parallel layers enable the MLP and attention input matrix multiplications to be fused, resulting in ~15% faster training at large scale. - Ablation experiments show a small quality drop at 8B scale, but no degradation at 62B scale; extrapolation suggests parallel layers are quality-neutral at even larger scales (e.g., 540B).