Lecture 11

Normalization / Initialization

Lecture Notes: Normalization and Initialization in Deep Learning

Based on lecture transcript and accompanying slides.
Topic: Normalization and Initialization (Gordon, Lecture 11).
Course context: Regularization, stability, and optimization in neural networks.


1. Research Projects and Reading Papers

1.1 Choosing a Project Direction

Before designing architectures (e.g., transformers), it is recommended to first decide on an application domain.

1.2 Critical and Optimistic Reading

Academic papers, even from leading institutions, invariably contain flaws. The productive mindset combines:


2. Motivation for Normalization

2.1 Optimization Landscape Intuition

Training via gradient descent can be visualized as movement over a loss surface.

2.2 Deep Learning Context


3. Batch Normalization (BatchNorm)

3.1 Conceptual Overview

Proposed by Ioffe and Szegedy (2015), Batch Normalization addresses instability in training deep networks. It normalizes intermediate activations within each mini-batch to stabilize distributions. Backpropogating larger parameters in gradient descent creates large partial derivatives, and multiplying many of them form a larger & larger number; BatchNorm helps to deal with “exploding gradients”.

Essentially, these are additional layers on our models which improve stability and convergence rates.

Note: Assuming each minibatch is a node belonging to a given hidden layer, we are providing additional (normalization) information to each.

Given activations (z_i) for a layer across a mini-batch of size (n):

Normalizing Net Inputs:

\mu_B = \frac{1}{n}\sum_{i=1}^{n} z_i, \qquad \sigma_B^2 = \frac{1}{n}\sum_{i=1}^{n}(z_i - \mu_B)^2 \hat{z}_i = \frac{z_i - \mu_B}{\sigma_B}

However, in practice the below equation is used for numerical stability:

\hat{z}_i = \frac{z_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}

Affine transformation (learnable scaling and shifting):

y_i = \gamma \hat{z}_i + \beta
Batchnorm
Figure 1. batchnorm

3.2 Learning BatchNorm Parameters

BatchNorm fits naturally into the computation graph:

  1. Linear transformation: (z = Wx + b)
  2. Normalization: (\hat{z})
  3. Rescaling and shifting: (y = \gamma \hat{z} + \beta)
  4. Nonlinear activation: (a = f(y))

Since all operations are differentiable, backpropagation proceeds seamlessly.

Batchnorm and backprop
Figure 2. batchnorm nad backprop


3.3 Training vs. Inference


3.4 BatchNorm in PyTorch


3.5 Empirical Benefits

  1. Accelerated convergence through more stable gradients.
    • The model has the same optimization without batch norm, but with a smoother learning rate, we can train more quickly
  2. Mitigation of internal distribution shifts.
  3. Enables higher learning rates and deeper networks.
  4. Often improves generalization due to mild regularization effects.

3.6 Theoretical Explanations (Multiple Perspectives)

Year Explanation Source / Insight
2015 Internal Covariate Shift hypothesis Original BN paper
2018–2019 Smoothing of optimization landscape MIT study (Santurkar et al.)
2018 Implicit regularization effect Empirical observations
2019 Stabilized gradient dynamics allowing larger learning rates Theoretical reinterpretation

Despite differing theoretical justifications, all confirm that BN improves training stability and speed.


3.7 Practical Considerations


4. Initialization of Network Weights

4.1 Importance of Initialization

Improper initialization can result in:

This results in the inability for hidden layers to be distingushed from one another, and prevents us from finding the optimal minima.


4.2 Xavier (Glorot) Initialization

Designed for activation functions centered near zero (e.g., tanh). Steps:

  1. Initialize weights from Normal or Uniform distribution.
  2. Scale weights proportional to the number of inputs to the layer.
W^{l} := W^{l} * \sqrt{\frac{1}{m^{l-1}}} \begin{aligned} \mathrm{Var}\left(z_j^{(l)}\right) &= \mathrm{Var}\left(\sum_{k=1}^{m_{l-1}} W_{jk}^{(l)} a_k^{(l-1)}\right) \\ &= \sum_{k=1}^{m_{l-1}} \mathrm{Var}\left(W_{jk}^{(l)} a_k^{(l-1)}\right) = \sum_{k=1}^{m_{l-1}} \mathrm{Var}\left(W_{jk}^{(l)}\right) \mathrm{Var}\left(a_k^{(l-1)}\right) \\ &= \sum_{k=1}^{m_{l-1}} \mathrm{Var}\left(W^{(l)}\right) \mathrm{Var}\left(a^{(l-1)}\right) = m_{l-1} \, \mathrm{Var}\left(W^{(l)}\right) \mathrm{Var}\left(a^{(l-1)}\right) \end{aligned}

4.3 He (Kaiming) Initialization

Tailored for ReLU activations, which are non-symmetric around zero.

Same steps as in Xavier Initialization, but we add a factor of √2 when scaling weights:

W^{l} := W^{l} * \sqrt{\frac{2}{m^{l-1}}}

Reasoning:

W \sim \mathcal{N}\left(0, \frac{2}{n_{in}}\right)

This scaling compensates for the half-rectification effect of ReLU and maintains activation variance consistency.


4.4 Architectural Dependence

The optimal initialization scheme may depend on:


5. Gradient Stability and Residual Structures

To combat vanishing/exploding gradients:


6. Implementation Summary

6.1 Normalization

Technique Applied Axis Common Use Advantages Limitations
BatchNorm Across mini-batch CNNs Fast convergence Requires large batch
LayerNorm Across features Transformers Batch-independent Slightly slower
InstanceNorm Across spatial dims Style transfer Instance-specific Limited effect on stability
GroupNorm Across channel groups Small-batch CNNs Stable Needs tuning of group size

6.2 Initialization Quick Reference

Scheme Suitable Activation Formula Key Idea
Xavier (Glorot) tanh / sigmoid Var(W)=1/(n_{in}+n_{out}) Equalize activation variance
He (Kaiming) ReLU / LeakyReLU Var(W)=2/n_{in} Compensate for ReLU truncation

7. Key Takeaways

  1. Normalization stabilizes internal representations, accelerates convergence, and improves training reliability.
  2. BatchNorm remains the most effective and widely used technique despite incomplete theoretical justification.
  3. Initialization directly influences optimization trajectory; Xavier and He methods are standard baselines.
  4. Residual connections further enhance gradient flow, crucial for very deep models.
  5. Understanding and controlling the interplay between normalization, initialization, and architecture is central to modern deep learning engineering.