Lecture 10

Regularization and Generalization

Today’s Topics:


1. Improving Generalization

Generalization refers to how well a trained model performs on unseen data.
A model that generalizes well captures the true underlying patterns in the dataset instead of memorizing noise from the training set.

Achieving high training accuracy is not enough — the goal is to ensure that the model performs well on new data.
Overfitting occurs when a model is too closely tailored to the training set, leading to poor test performance.

Key strategies to improve generalization:

Figure 1. Strategies for improving model generalization.

2. Data Augmentation

Data augmentation increases dataset size by generating label-preserving transformations — improving robustness and reducing overfitting.
Useful when labeled data is scarce or costly (e.g., medical imaging).

Common augmentations:

These simulate real-world variations and encourage invariant feature learning.

\mathcal{D}_{\mathrm{aug}} = \{ (h(x_i), y_i) \mid (x_i, y_i)\in\mathcal{D} \}
Figure 2. Original vs. augmented data.

3. Early Stopping

Early stopping halts training when validation performance stops improving — preventing overfitting.

Procedure

  1. Split data into training/validation/test.
  2. Track validation performance.
  3. Stop training when validation loss stops decreasing.
Figure 3. Early stopping accuracy curve.
Figure 4. Early stopping loss curve.

4. L1 and L2 Regularization

Regularization penalizes large weights, encouraging simpler models and preventing overfitting.

4.1 L1 Regularization (Lasso)

\mathcal{L}_{L1}=\frac{1}{n}\sum_{i=1}^n L(y^{[i]},\hat{y}^{[i]}) + \frac{\lambda}{n}\sum_j |w_j|
Figure 5. L1 regularization promotes sparsity.

4.2 L2 Regularization (Ridge)

\mathcal{L}_{L2}=\frac{1}{n}\sum_{i=1}^n L(y^{[i]},\hat{y}^{[i]}) + \frac{\lambda}{n}\sum_j w_j^2 w_{i,j} := w_{i,j} - \eta\!\left(\frac{\partial L}{\partial w_{i,j}} + \frac{2\lambda}{n}w_{i,j}\right)
Figure 6. L2 regularization smooths weight distribution.

5. Dropout

Dropout randomly removes a fraction of neurons during training — forcing the network to learn redundant, distributed representations.

Why it works

\tilde{h}_i = z_i h_i,\quad z_i = \begin{cases} 0 & \text{with probability } p \\ 1 & \text{with probability } 1-p \end{cases}
Figure 7. Dropout training process.
Figure 8. Dropout improves validation accuracy.

Summary

Regularization and generalization methods — like data augmentation, early stopping, L2, and dropout — are essential to ensure models learn meaningful patterns, not noise.
They enable robust generalization and stable performance on unseen data.


© 2025 University of Wisconsin — STAT 453 Lecture Notes