Lecture 16 - Autoencoders and Variational Autoencoders (VAEs)
Detailed lecture notes exploring Autoencoders, their variants, and Variational Autoencoders in Deep Generative Models.
1. Introduction and Overview
This lecture focuses on Deep Generative Models (DGMs) — models designed to learn the underlying data distribution, enabling both prediction and generation of new samples. We move from discriminative models, which model $P(Y|X)$, to generative models, which model $P(X, Y)$ or $P(X)$.
Autoencoders (AEs) and Variational Autoencoders (VAEs) form the foundation of this shift, combining unsupervised representation learning with probabilistic inference.
2. Discriminative vs. Generative Modeling
| Model Type | Learns | Objective | Examples | |
|---|---|---|---|---|
| Discriminative | $P(Y | X)$ | Classify or predict outcomes | Logistic Regression, CNNs |
| Generative | $P(X, Y)$ or $P(X | Y)$ | Model data generation process | Autoencoders, VAEs, GANs |
In generative models, latent variables $z$ represent hidden structure in the data, making the following computations challenging:
\(p_\theta(x) = \int p_\theta(x, z) \, dz\) \(p_\theta(z|x) \propto p_\theta(x|z) p(z)\)
Because $z$ is unobserved, both the marginal likelihood and posterior inference are intractable in complex data, requiring approximate inference.
3. Deep Generative Models (DGMs)
DGMs use neural networks to parameterize probability distributions over observed data $x$ and latent variables $z$. The “deep” structure arises from multiple layers of hidden variables that represent hierarchical abstractions.
Key ideas:
- Learn probabilistic mappings between $x$ and $z$.
- Use neural networks for non-linear transformations.
- Combine deep learning’s representational power with probabilistic reasoning.
4. Autoencoders: Concept and Motivation
An Autoencoder (AE) is an unsupervised neural network trained to reproduce its input. It compresses the input into a latent representation (code) and reconstructs the input from this compressed form.
Applications:
- Dimensionality reduction and data visualization
- Noise removal and feature denoising
- Data compression and reconstruction
- Feature learning for downstream tasks
5. Architecture of an Autoencoder
An autoencoder consists of:
- Encoder: $h = g(x)$ — compresses the input into latent representation.
- Decoder: $\hat{x} = f(h) = f(g(x))$ — reconstructs the input.
Training objective: \(L(x, \hat{x}) = ||x - \hat{x}||^2\)
By minimizing reconstruction loss, the network learns to capture meaningful low-dimensional structure.
6. Autoencoders vs PCA
While PCA performs linear dimensionality reduction, AEs learn non-linear mappings through neural activations. PCA can be seen as a special case of an AE with one linear hidden layer.
Advantages of AEs:
- Capture complex non-linear manifolds
- Can be stacked to form Deep Autoencoders
- Support feature learning and transfer to other models
7. Autoencoder Variants
7.1 Denoising Autoencoders (DAE)
Train the network to reconstruct the original data from a noisy input, e.g. by adding Gaussian noise or dropout.
This improves robustness and generalization.
7.2 Dropout Autoencoders
Add dropout layers to force redundancy in learned features, preventing overfitting and improving noise tolerance.
7.3 Sparse Autoencoders
Encourage sparsity in activations using an L1 penalty. Only a few neurons activate for each input, leading to interpretable latent representations.
7.4 Variational Autoencoders (VAEs)
Introduce a probabilistic latent space, enabling smooth interpolation and random sampling of new data points.
8. Variational Autoencoders (VAEs): Theory and Intuition
VAEs model the data generation process as:
- Sample latent variable $z \sim p(z)$ (e.g., $\mathcal{N}(0, I)$).
-
Generate data $x$ from conditional distribution $p_\theta(x z)$.
| The encoder approximates the posterior $q_\phi(z | x)$ using neural networks, producing mean and variance parameters $(\mu, \sigma)$. |
We cannot compute $\log p_\theta(x)$ directly because the marginalization over $z$ is intractable.
By introducing an approximate posterior $q_\phi(z|x)$ and applying Jensen’s inequality:
This lower bound is the Evidence Lower Bound (ELBO), which VAEs maximize during training.
Objective (Evidence Lower Bound - ELBO):
\[\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))\]- First term → reconstruction accuracy
- Second term → regularization ensuring latent space continuity
9. The Reparameterization Trick
To allow gradients to flow through random sampling, we use:
\[z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]This separates stochasticity from the deterministic part, making training possible with backpropagation.
10. Generating New Samples with VAEs
After training:
- Sample $z \sim \mathcal{N}(0, I)$
- Decode with $x = f_\theta(z)$
The result is a newly generated data sample similar to the training distribution — for instance, a new handwritten digit when trained on MNIST.
11. Applications
- Image denoising (e.g., Gondara 2016, medical imaging)
- Data generation and augmentation
- Anomaly detection (via reconstruction error)
- Feature learning for unsupervised tasks
- Data compression and transfer learning
12. Summary and Takeaways
- Autoencoders learn compact latent representations for reconstruction and feature extraction.
- Denoising and Sparse variants enhance robustness and interpretability.
- Variational Autoencoders combine neural networks with probabilistic inference, enabling sample generation.
- The reparameterization trick enables stochastic training with gradient descent.
- These architectures are foundational to modern generative AI models (GANs, diffusion models).
References
- Michelucci, Umberto. Modern Autoencoders: Theory and Applications. arXiv preprint arXiv:2201.03898, 2022.
- Gondara, L. “Medical Image Denoising Using Convolutional Denoising Autoencoders.” IEEE ICDMW, 2016.
- Kingma, D. P., and Welling, M. “Auto-Encoding Variational Bayes.” arXiv preprint arXiv:1312.6114, 2013.