Lecture 14

Midterm Review

Table of Contents

  1. What is Machine Learning?
  2. The Building Blocks of Deep Learning
  3. Rosenblatt’s Perceptron
  4. Logistic Regression
  5. Multilayer Perceptron
  6. Backpropagation: An Algorithm to Train Models with Hidden Variables
  7. PyTorch: Automated Differentiation
  8. Improvements to Optimization
  9. Regularization
  10. Normalization

1. What is Machine Learning?

Figure 1. Traditional Programming Paradigm vs. Machine Learning Paradigm

Machine learning is using statistically based methods to improve a model through its inputs and outputs.

Definition and Types

Formally, a computer program is said to learn from experience $E$ with respect to some task $T$ and performance measure $P$ if its performance at $T$ as measured by $P$ improves with $E$.

Supervised Learning

Unsupervised Learning

Reinforcement Learning

The Connection Between Fields

The relationship between artificial intelligence, machine learning, and deep learning can be understood hierarchically. AI represents a system that accomplishes a task through a machine learning or deep learning model. Machine Learning uses statistically based methods to improve a system’s performance through the inputs and outputs of the system. Deep Learning is a subset of machine learning with specific architectures, such as those trained with backpropagation.

Figure 2. Overlaying fields of AI, Machine Learning, and Deep Learning.

2. The Building Blocks of Deep Learning

McCulloch & Pitt’s Neuron Model (1943)

Warren McCulloch and Walter Pitts created the first computational model of a neuron. The goal is to sum up the inputs and apply a threshold function. Their model used threshold-based activation with weights of $+1$ or $-1$, but did not support continuous weighting.

From Biological Neuron to Artificial Neuron

The McCulloch & Pitts neuron with threshold and binary weights can represent “AND”, “OR”, and “NOT” operations, but cannot represent “XOR”.

Figure 3. Biological neuron to Computational neuron.

3. Rosenblatt’s Perceptron

Core Concept

Frank Rosenblatt introduced a learning rule for the computational neuron model in 1957. The perceptron generalizes McCulloch-Pitts neurons by introducing continuous weighting and using an activation function (typically a threshold function for the classic Rosenblatt perceptron).

Figure 4. Classic Rosenblatt Perceptron.

Perceptron Learning Algorithm

The perceptron can find a decision boundary if classes are linearly separable. Given a dataset $D = {(x^{[1]}, y^{[1]}), \ldots, (x^{[N]}, y^{[N]})}$, the algorithm proceeds as follows:

  1. Initialize $w := 0_m$ (assume weight includes bias)
  2. For every training epoch:
    • For every $(x^{[i]}, y^{[i]}) \in D$:
      • $\hat{y}^{[i]} := \sigma(x^{[i]T} w)$ (only 0 or 1)
      • $\text{err} := (y^{[i]} - \hat{y}^{[i]})$ (only $-1$, $0$, or $1$)
      • $w := w + \text{err} \times x^{[i]}$

Geometric Intuition

The decision boundary is a hyperplane in input space, and the weight vector $w$ is perpendicular to this decision boundary. Learning adjusts the orientation of the decision boundary.

Perceptron Limitations

The perceptron has several important limitations. As a linear classifier, it cannot learn non-linear boundaries and is limited to binary classification, making it unable to solve XOR problems. It does not converge if classes are not linearly separable. While many solutions may achieve zero training error, most will not be optimal in terms of generalization performance.

Activation Functions

Modern neural networks use various activation functions beyond the threshold function:

Sigmoid: Smooth and differentiable, outputs in $(0, 1)$

Tanh: Mean-centered, outputs in $(-1, 1)$. Advantages include mean centering, positive and negative values, and larger gradients. It’s important to normalize inputs to mean zero and use random weight initialization with average weight centered at zero.

ReLU: $\text{ReLU}(z) = \max(0, z)$

Leaky ReLU: Allows small negative values

ELU: Exponential Linear Unit

SELU: Scaled Exponential Linear Unit


4. Logistic Regression

Logistic Regression Neuron

For binary classes $y \in {0, 1}$, logistic regression uses a sigmoid activation function.

Probability Formulation

Given the output $a = \sigma(w^T x)$, we compute the probability as $P(y = 1 x) = a$, following the Bernoulli distribution.

Maximum Likelihood Estimation

Under maximum likelihood estimation, we maximize the multi-sample likelihood:

\[L(w) = \prod_{i=1}^n P(y^{(i)}|x^{(i)})\]

The log-likelihood is:

\[\log L(w) = \sum_{i=1}^n \log P(y^{(i)}|x^{(i)})\]

Gradient Descent Learning Rule

We optimize via gradient descent using stochastic updates:

\[w := w + \eta \nabla_w \log P(y^{(i)}|x^{(i)})\]

5. Multilayer Perceptron

Architecture

A multilayer perceptron is a computation graph with multiple fully-connected layers. By stacking multiple layers of perceptrons, the network can learn non-linear decision boundaries.

Multinomial (“Softmax”) Logistic Regression

For multi-class classification, we use the softmax function, which is a “soft” (differentiable) version of “max”:

Figure 6. Softmax Archictecture
\[\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_j e^{z_j}}\]

Loss Function

Assuming one-hot encoding, the cross-entropy loss is:

\[L = -\sum_i y_i \log(\hat{y}_i)\]

Solving XOR

Multilayer perceptrons can solve the XOR problem that single-layer perceptrons cannot. A single hidden layer is sufficient for the XOR problem, demonstrating the power of depth.


6. Backpropagation: An Algorithm to Train Models with Hidden Variables

The Training Problem

How can we train a multilayer model when there are no targets or ground truth for the hidden nodes? The solution is backpropagation.

Core Concept

Neural networks are function compositions that can be represented as computation graphs. By applying the chain rule and working in reverse order, we can compute gradients efficiently:

\[\frac{\partial f}{\partial x_i} = \sum_j \frac{\partial f}{\partial z_j} \frac{\partial z_j}{\partial x_i}\]

Computation Graphs

Computation graphs consist of input variables, intermediate computations, and outputs. The chain rule is applied in reverse to propagate gradients.

Forward pass: Compute activations from inputs to outputs

Backward pass: Compute gradients from outputs to inputs, where the local gradient is multiplied by the upstream gradient

Weight-Sharing

When the same weights are used in multiple locations in the network, gradients are accumulated from all uses of those weights.


7. PyTorch: Automated Differentiation

PyTorch provides automated differentiation, making it easy to train neural networks through three main steps:

Step 1: Definition - Define model architecture, specifying layers and activation functions

Step 2: Creation - Instantiate the model, initialize optimizer and loss function

Step 3: Training - Perform forward pass to compute predictions, compute loss, perform backward pass to compute gradients using .backward(), update weights using the optimizer, and zero gradients before the next iteration


8. Improvements to Optimization

Non-Convex Loss

While linear regression, Adaline, logistic regression, and softmax regression have convex loss functions, deep neural networks typically have non-convex loss landscapes. In practice, we usually end up at different local minima if we repeat the training.

Minibatch Training

Minibatch learning is a form of stochastic gradient descent where each minibatch can be considered a sample drawn from the training set. This means the gradient is noisier, which can be good (providing a chance to escape local minima) or bad (leading to extensive oscillation).

Learning Rate Decay

Due to batch effects, minibatch loss and gradients are approximations, typically resulting in oscillations. To dampen oscillations towards the end of training, we can decay the learning rate. However, decreasing the learning rate too early is dangerous. A practical tip is to try training the model without learning rate decay first, then add it later.

Common variants include:

  1. Exponential Decay: $\eta_t := \eta_0 e^{-k \cdot t}$ where $k$ is the decay rate
  2. Halving: $\eta_t := \eta_{t-1}/2$ where $t$ is a multiple of $T_0$ (e.g., $T_0 = 100$)
  3. Inverse decay: $\eta_t := \frac{\eta_0}{1 + k \cdot t}$

Training with Momentum

The main idea is to dampen oscillations by using “velocity” (the speed of the “movement” from previous updates). Not only move in the opposite direction of the gradient, but also move in the weighted averaged direction of the last few updates:

\[v_t = \gamma v_{t-1} + \eta \nabla_w L\] \[w_t = w_{t-1} - v_t\]

Nesterov Momentum

Nesterov momentum is an improvement that looks ahead before making the step. Since we already know where the momentum part will push us in this step, we calculate the new gradient with that update in mind.

Adaptive Learning Rates

The rule of thumb for adaptive learning rates is to decrease learning if the gradient changes its direction and increase learning if the gradient stays consistent.

RMSProp

RMSProp is an unpublished but very popular algorithm by Geoff Hinton, based on Rprop and very similar to AdaDelta. The main idea is to divide the learning rate by an exponentially decreasing moving average of the squared gradients (RMS stands for “Root Mean Squared”). This takes into account that gradients can vary widely in magnitude.

ADAM (Adaptive Moment Estimation)

ADAM is probably the most widely used optimization algorithm in deep learning. It combines momentum with RMSProp:


9. Regularization

Where We Are

Good news: We can solve non-linear problems! Bad news: Our multilayer neural networks have lots of parameters and it’s easy to overfit the data.

Parameters vs Hyperparameters

Parameters (learned from data): weights (weight parameters) and biases (bias units)

Hyperparameters (set before training): minibatch size, data normalization schemes, number of epochs, number of hidden layers, number of hidden units, learning rates, loss function, activation function types, regularization schemes, weight initialization schemes, and optimization algorithm type

Overfitting and Underfitting

Underfitting occurs when the model is too simple, resulting in high bias. Overfitting occurs when the model is too complex, resulting in high variance. The goal is to find the right balance.

Bias-Variance Decomposition

\[\text{Total Error} = \text{Bias}^2 + \text{Variance} + \text{Irreducible Error}\]

Bias is the error from incorrect assumptions, while variance is the error from sensitivity to the training set.

General Strategies to Avoid Overfitting

Collecting more data, especially high-quality data, is best and always recommended. Alternative approaches include semi-supervised learning, transfer learning, and self-supervised learning. Data augmentation is helpful, and reducing model capacity can also help.

Data Augmentation

The key idea is that if we know the label shouldn’t depend on a transformation $h(x)$, then we can generate new training data $(h(x^{(i)}), y^{(i)})$. However, we must already know something that our outcome doesn’t depend on. For image classification, this includes transformations like rotation, zooming, and sepia filters.

Reduce Network’s Capacity

The key idea is that the simplest model that matches the outputs should generalize the best. Strategies include choosing a smaller architecture (fewer hidden layers and units), adding dropout, using ReLU with L1 penalty to prune dead activations, enforcing smaller weights through early stopping or L2 norm penalty, and adding noise through dropout. With recent large language models and foundation models, it’s possible to use a large pretrained model and perform efficient fine-tuning.

Early Stopping

Split your dataset into three parts, using the test set only once at the end and using validation accuracy for tuning. Stop training early by observing the training/validation accuracy gap during training and stopping when the train and validation gap grows.

L2 Regularization

Add a penalty term to the loss function:

\[L_{\text{total}} = L_{\text{data}} + \lambda \sum_i w_i^2\]

This encourages smaller weights and helps prevent overfitting. In PyTorch, use the weight_decay parameter in the optimizer.

Dropout

Randomly “drop” (set to zero) activations during training. Each neuron has probability $p$ of being dropped. This forces the network to learn redundant representations and acts as an ensemble of many networks. At test time, use all neurons but scale activations by $p$.


10. Normalization

Normalization and Gradient Descent

Normalizing inputs helps gradient descent converge faster by making the optimization landscape more isotropic (circular rather than elliptical).

Problem in Deep Models

Normalizing the inputs only affects the first hidden layer. What about the rest of the layers? Internal covariate shift refers to the problem that the distribution of layer inputs changes during training.

Batch Normalization (“BatchNorm”)

Batch normalization normalizes hidden layer inputs, helps with exploding/vanishing gradient problems, and can increase training stability and convergence rate. It can be understood as additional normalization layers with additional parameters.

Step 1: Normalize Net Inputs

For each minibatch, compute mean and variance, then normalize:

\[\hat{z} = \frac{z - \mu_{\text{batch}}}{\sqrt{\sigma^2_{\text{batch}} + \epsilon}}\]

Step 2: Pre-Activation Scaling

Scale and shift:

\[\tilde{z} = \gamma \hat{z} + \beta\]

where $\gamma$ and $\beta$ are learnable parameters. Technically, a BatchNorm layer could learn to perform standardization with zero mean and unit variance.

BatchNorm and Backpropagation

Gradients flow through normalization, with the chain rule applied to compute gradients with respect to $\gamma$, $\beta$, and inputs.

BatchNorm in PyTorch

Use nn.BatchNorm1d(num_features) for fully-connected layers and nn.BatchNorm2d(num_channels) for convolutional layers.

BatchNorm at Test-Time

Use an exponentially weighted average (moving average) of mean and variance:

running_mean = momentum * running_mean + (1 - momentum) * sample_mean

where momentum is typically around 0.1 (and the same for variance). Alternatively, you can also use global training set mean and variance.

Layer Normalization

Batch normalization calculates mean and standard deviation based on a minibatch, whereas layer normalization calculates mean and standard deviation based on feature/embedding vectors. In statistical language, batch normalization achieves zero mean unit variance, whereas layer normalization projects the feature vector to the unit sphere. Layer normalization is used in Transformers.


Initialization

Weight Initialization

We can’t initialize all weights to 0 due to the symmetry problem, but we want weights to be relatively small. Traditionally, we can initialize weights by sampling from a random uniform distribution in range $[0, 1]$, or better, $[-0.5, 0.5]$. Alternatively, we could sample from a Gaussian distribution with mean 0 and small variance (e.g., 0.1 or 0.01).

Xavier Initialization

Xavier Glorot and Yoshua Bengio introduced this method in their 2010 paper “Understanding the difficulty of training deep feedforward neural networks.” The method involves:

  1. Initializing weights from a Gaussian or uniform distribution
  2. Scaling the weights proportional to the number of inputs to the layer (for the first hidden layer, that is the number of features in the dataset; for the second hidden layer, that is the number of units in the first hidden layer)

Formula:

\[W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}} + n_{\text{out}}}\right)\]

He Initialization

Kaiming He et al. introduced this method in their 2015 paper “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification.” Assuming activations with mean 0, Xavier initialization assumes a derivative of 1 for the activation function (which is reasonable for tanh). For ReLU, the activations are not centered at zero. He initialization takes this into account by adding a scaling factor of $\sqrt{2}$:

\[W \sim \mathcal{N}\left(0, \frac{2}{n_{\text{in}}}\right)\]

Convolutional Neural Networks

Why Images Are Hard

Images present several challenges: high dimensionality (even small images have thousands of pixels), spatial structure (nearby pixels are correlated), and translation invariance (objects can appear anywhere in the image).

Full Connectivity Problem

Full connectivity is problematic for large inputs. For example, $3 \times 200 \times 200$ images imply 120,000 weights per neuron in the first hidden layer. This leads to too many parameters (causing overfitting) and is computationally expensive.

CNNs

Yann LeCun introduced convolutional neural networks in 1989. The key idea is to share parameters: instead of learning position-specific weights, learn weights defined for relative positions. Learn “filters” that are reused across the image to generalize across spatial translation of input. The core concept is to replace matrix multiplication in neural networks with a convolution. Later developments have shown that this approach can work for any graph-structured data, not just images.

Weight Sharing in Kernels

Reused weights are small and the same filter is applied to different regions of the input, dramatically reducing the number of parameters.

Pooling: Lossy Compression

Pooling reduces spatial dimensions. Max pooling takes the maximum value in each region, while average pooling takes the average value. Pooling provides translation invariance and reduces computation in subsequent layers.

Main Ideas of CNNs

Sparse-connectivity: A single element in the feature map is connected to only a small patch of pixels (very different from connecting to the whole input image in multilayer perceptrons)

Parameter-sharing: The same weights are used for different patches of the input image

Many layers: Combining extracted local patterns to global patterns

Key Concepts

Sparse Connectivity: Each neuron only connects to a small region of the input, reducing the number of parameters and exploiting local spatial structure.

Receptive Fields: The region of the input that affects a particular neuron. Receptive fields grow over depth, allowing deeper layers to represent more global features.

Parameter Sharing: Same filter weights used across the entire image, enabling detection of features regardless of position and dramatically reducing parameter count.

Impact of Convolutions on Size

Output size depends on input size, kernel size, stride, and padding:

\[\text{output size} = \left\lfloor \frac{\text{input size} - \text{kernel size} + 2 \times \text{padding}}{\text{stride}} \right\rfloor + 1\]

Padding

Add zeros around the border of the input to preserve spatial dimensions. “Valid” padding means no padding, while “Same” padding ensures output size equals input size.

Backpropagation in CNNs

The same concept applies as before: multivariable chain rule, now with an additional weight-sharing constraint. Gradients are accumulated across all positions where a filter is applied. Convolution in the forward pass corresponds to transposed convolution in the backward pass.


Summary

We’ve covered the fundamentals of deep learning from perceptrons to CNNs. Key concepts include backpropagation, optimization, regularization, and normalization. Modern deep learning combines all these techniques. Next topics will cover more advanced architectures and generative models.