Lecture 09

Multilayer Perceptrons & Backpropagation

Today’s Topics:

1. Multilayer Perceptron Architecture

The Multilayer Perceptron (MLP) is a neural network model that extends the single-layer perceptron by stacking multiple fully connected layers into a computation graph. Each layer performs a linear transformation followed by a nonlinear activation function, enabling the network to capture non-linear relationships and hierarchical features.

Key Points:

  1. Definition
    • An MLP consists of multiple layers where each neuron in one layer connects to every neuron in the next.
    • This structure generalizes linear models and enables more powerful function approximation than the basic perceptron, which can only output a single bit of information (0 or 1).
    • Illustration:


2. Growth of Parameters (Slide 10)

Every connection introduces a weight and a bias. As the network grows, the number of trainable parameters increases rapidly.
This highlights both the expressive power of MLPs and their tendency to overfit.


3. From “Shallow” to “Deep” (Slide 11)

Adding more layers leads to deep learning.

This is the conceptual leap from “just a perceptron with more neurons” to the foundation of deep architectures.


4. Optimization Landscape: Convex vs. Non-convex (Slides 12–17)

Earlier models (e.g., logistic regression, adaline) had convex loss functions, ensuring one global minimum.
Formally, a function $f$ is convex if:

\[f(\lambda x + (1-\lambda)y) \leq \lambda f(x) + (1-\lambda) f(y), \quad \forall x,y,\; \lambda \in [0,1].\]

5. Importance of Initialization (Slide 18)

If all weights are initialized to zero, every neuron behaves identically, and the network cannot learn. The model would not be able to distinguish any importance of the nodes.
To break symmetry and ensure effective learning, random initialization (centered at zero, with variance scaled properly) and input normalization are essential.


Logical Flow


2. Nonlinear Activation Functions

Nonlinear activation functions are the key to making multilayer perceptrons more powerful than simple linear models. Without non-linearity, stacking multiple layers would still result in a single linear transformation, offering no advantage over logistic regression. By introducing nonlinear functions at each layer, MLPs can learn non-linear decision boundaries and solve problems like XOR that linear models fail to capture.


1. Why Nonlinearity Matters (Slide 21)


2. Common Activation Functions (Slides 22–24)

Several nonlinear activation functions are widely used. Each has its benefits and drawbacks:



3. Activation Functions and Robustness (Slides 25–26)



Logical Flow

Takeaway: Nonlinear activations unlock the true power of MLPs. They not only allow non-linear decision boundaries but also shape training dynamics, convergence, and robustness.


3. Multilayer Perceptron Code Examples

What to check before training

Diagnosing Loss Curves

4. Overfitting and Underfitting (intro)


Training vs. Generalization Error

Big picture.


Bias–Variance Decomposition — Formulas

General definition

\[\mathrm{Bias}_\theta[\hat{\theta}] \=\ \mathbb{E}_\theta[\hat{\theta}] \-\ \theta\] \[\mathrm{Var}_\theta[\hat{\theta}] \=\ \mathbb{E}_\theta[\hat{\theta}^{2}] \-\ \big(\mathbb{E}_\theta[\hat{\theta}]\big)^{2}\]

Intuition

Intuition. Each “arrow” is the model learned from a different training set: bias is how far the average landing point is from the true target \(y\), and variance is how widely the arrows scatter around that average. (Noise is ignored in this sketch—high bias misses consistently; high variance sprays shots even if the average is on target.)

Why Deep Learning Loves Large Datasets


Double Descent (classical vs modern view)

Classic U-curve (first descent).
As capacity increases from very small, Bias falls faster than Variance rises, so test error drops.

Interpolation peak.
Near the point where the model just fits training data exactly (interpolation), Variance spikestest error peaks.

Second descent (overparameterized regime).
With even more capacity, test error falls again because learning dynamics and architecture steer solutions toward simpler interpolants:

Which descent is “better”?