Lecture 10

Regularization and Generalization

Today’s Topics:

1. Improving Generalization
2. Data Augmentation
3. Early Stopping
4. L1 and L2 Regularization
5. Dropout

1. Improving Generalization

Generalization refers to how well a trained model performs on unseen data.
A model that generalizes well captures the true underlying patterns in the dataset instead of memorizing noise from the training set.

Achieving high training accuracy is not enough — the goal is to ensure that the model performs well on new data.
Overfitting occurs when a model is too closely tailored to the training set, leading to poor test performance.

Key strategies to improve generalization:

Collect more diverse data
Use data augmentation
Reduce model capacity (simpler models)
Apply regularization (L1, L2, dropout)
Use early stopping
Employ transfer/self-supervised learning

Figure 1. Strategies for improving model generalization.

2. Data Augmentation

Data augmentation increases dataset size by generating label-preserving transformations — improving robustness and reducing overfitting.
Useful when labeled data is scarce or costly (e.g., medical imaging).

Common augmentations:

Random crop / resize
Flipping, rotation, translation, zoom
Adding noise or color jitter
Mixup, CutMix

These simulate real-world variations and encourage invariant feature learning.

\mathcal{D}_{\mathrm{aug}} = \{ (h(x_i), y_i) \mid (x_i, y_i)\in\mathcal{D} \}

3. Early Stopping

Early stopping halts training when validation performance stops improving — preventing overfitting.

Procedure

Split data into training/validation/test.
Track validation performance.
Stop training when validation loss stops decreasing.

Figure 3. Early stopping accuracy curve.

4. L1 and L2 Regularization

Regularization penalizes large weights, encouraging simpler models and preventing overfitting.

4.1 L1 Regularization (Lasso)

\mathcal{L}_{L1}=\frac{1}{n}\sum_{i=1}^n L(y^{[i]},\hat{y}^{[i]}) + \frac{\lambda}{n}\sum_j |w_j|

Promotes sparsity (many weights → 0)
Useful for feature selection

Figure 5. L1 regularization promotes sparsity.

4.2 L2 Regularization (Ridge)

\mathcal{L}_{L2}=\frac{1}{n}\sum_{i=1}^n L(y^{[i]},\hat{y}^{[i]}) + \frac{\lambda}{n}\sum_j w_j^2

w_{i,j} := w_{i,j} - \eta\!\left(\frac{\partial L}{\partial w_{i,j}} + \frac{2\lambda}{n}w_{i,j}\right)

Figure 6. L2 regularization smooths weight distribution.

5. Dropout

Dropout randomly removes a fraction of neurons during training — forcing the network to learn redundant, distributed representations.

Why it works

Prevents co-adaptation
Acts like model ensemble
Improves robustness

\tilde{h}_i = z_i h_i,\quad z_i = \begin{cases} 0 & \text{with probability } p \\ 1 & \text{with probability } 1-p \end{cases}

Figure 8. Dropout improves validation accuracy.

Summary

Regularization and generalization methods — like data augmentation, early stopping, L2, and dropout — are essential to ensure models learn meaningful patterns, not noise.
They enable robust generalization and stable performance on unseen data.