Lecture 16
Autoencoders and Variational Autoencoders (VAEs)
Today’s Topics:
- Today’s Topics:
- 1. Introduction and Overview
- 2. Discriminative vs. Generative Modeling
- 3. Deep Generative Models (DGMs)
- 4. Autoencoders: Concept and Motivation
- 5. Architecture of an Autoencoder
- 6. Autoencoders vs PCA
- 7. Autoencoder Variants
- 8. Variational Autoencoders (VAEs): Theory and Intuition
- 9. The Reparameterization Trick
- 10. Generating New Samples with VAEs
- 11. Applications
- 12. Summary and Takeaways
- References
1. Introduction and Overview
This lecture focuses on Deep Generative Models (DGMs) — models designed to learn the underlying data distribution, enabling both prediction and generation of new samples. We move from discriminative models, which model $P(Y|X)$, to generative models, which model $P(X, Y)$ or $P(X)$.
Autoencoders (AEs) and Variational Autoencoders (VAEs) form the foundation of this shift, combining unsupervised representation learning with probabilistic inference.
2. Discriminative vs Generative Modeling
| Model Type | Learns | Objective | Examples |
|---|---|---|---|
| Discriminative | $P(Y \mid X)$ | Classify or predict outcomes | Logistic Regression, CNNs |
| Generative | $P(X, Y)$ or $P(X \mid Y)$ | Model data generation process | Autoencoders, VAEs, GANs |
In generative models, latent variables $z$ represent hidden structure in the data, making the following computations challenging:
Because $z$ is unobserved, both the marginal likelihood and posterior inference are intractable in complex data, requiring approximate inference.
3. Deep Generative Models (DGMs)
DGMs use neural networks to parameterize probability distributions over observed data $x$ and latent variables $z$. The “deep” structure arises from multiple layers of hidden variables that represent hierarchical abstractions.
Key ideas:
- Learn probabilistic mappings between $x$ and $z$.
- Use neural networks for non-linear transformations.
- Combine deep learning’s representational power with probabilistic reasoning.
- Latent variable $z$ capture hidden structure that explains high-dimensional observations $x$.
4. Autoencoders: Concept and Motivation
An Autoencoder (AE) is an unsupervised (no labeled) neural network trained to reproduce its input. It compresses the input into a latent representation (code) and reconstructs the input from this compressed form.
Applications:
- Dimensionality reduction and data visualization
- Noise removal and feature denoising
- Data compression and reconstruction
- Feature learning for downstream tasks
5. Architecture of an Autoencoder
An autoencoder consists of:
- Encoder: $h = g(x)$ — compresses the input into latent representation.
- Decoder: $\hat{x} = f(h) = f(g(x))$ — reconstructs the input.
Training objective:
By minimizing reconstruction loss, the network learns to capture meaningful low-dimensional structure.
6. Autoencoders vs PCA
While PCA performs linear dimensionality reduction, AEs learn non-linear mappings through neural activations. PCA can be seen as a special case of an AE with one linear hidden layer.
Advantages of AEs:
- Capture complex non-linear manifolds
- Can be stacked to form Deep Autoencoders
- Support feature learning and transfer to other models
7. Autoencoder Variants
7.1 Denoising Autoencoders (DAE)
Train the network to reconstruct the original data from a noisy input, e.g. by adding Gaussian noise or dropout.
This improves robustness and generalization.
7.2 Dropout Autoencoders
Add dropout layers to force redundancy in learned features, preventing overfitting and improving noise tolerance.
7.3 Sparse Autoencoders
Encourage sparsity in activations using an L1 penalty. Only a few neurons activate for each input, leading to interpretable latent representations.
7.4 Variational Autoencoders (VAEs)
Introduce a probabilistic latent space, enabling smooth interpolation and random sampling of new data points.
8. Variational Autoencoders (VAEs): Theory and Intuition
Define the latent space to follow a normal distribution, which enables sampling. We want to first sample $z$ from $p_{\theta^\star}(z)$ and translate $z$ into $x$ using the decoder network. We want to estimate the true parameters $\theta^\star$ of this generative model How should we represent this model? we can choose a prior $p(z)$ to be simple, e.g. Gaussian. Then, the conditional $p(x \mid z)$ is complex. We will use a neural network.
How do we train this model? We maximize the likelihood of the training data
\[p_{\theta}(x) = \int p_{\theta}(z) p_{\theta}(x \mid z) \, dz\]Since we are modeling probabilistic generation of data, the encoder and decoder networks are probabilistic. We use
\[q_{\phi}(z \mid x)\]to represent the encoder network and
\[p_{\theta}(x \mid z)\]to represent the decoder network.
Now, equipped with our encoder and decoder networks, let us work out the log data likelihood:
\[\begin{align} \log p_{\theta}(x) &= \mathbb{E}_{z \sim q_{\phi}(z \mid x)}[\log p_{\theta}(x)] \\ &= \mathbb{E} _{z} \left[ \log \frac{p_{\theta}(x\mid z) p_{\theta}(z)}{p_{\theta}(z \mid x)} \right] \\ &= \mathbb{E} _{z} \left[ \log \frac{p_{\theta}(x\mid z) p_{\theta}(z)}{p_{\theta}(z \mid x)} \frac{q_{\phi}(z \mid x)}{q_{\phi} (z \mid x)} \right] \\ &= \mathbb{E}_{z}[\log p_{\theta}(x \mid z)] - \mathbb{E}_{z}\left[ \log \frac{q_{\phi}(z \mid x)}{p_{\theta}(z)} \right] + \mathbb{E}_{z}\left[\log \frac{q_{\phi}(z \mid x)}{p_\theta(z \mid x)} \right] \\ &= \mathbb{E}_{z}[\log p_{\theta}(x \mid z)] - D_{KL}(q_{\phi}(z \mid x) \mid\mid p_{\theta}(z)) + D_{KL}(q_{\phi}(z \mid x) \mid\mid p_{\theta}(z \mid x)) \end{align}\]The decoder network gives $p_{\theta}(x \mid z)$. We can compute the estimate of the first term (reconstruction accuracy) through sampling. The second term (regularization ensuring latent space continuity) is a KL divergence between Gaussians for the encoder and the z prior. This term has a nice closed-form solution. We cannot compute the third term, but we know KL divergence is always non-negative. We will use the first two terms as a loss function (Evidence Lower Bound, ELBO), which we can take the gradient of and optimize. VAEs tend to generate blurred images due to the mode covering behavior.
9. The Reparameterization Trick
To allow gradients to flow through random sampling, we use:
This separates stochasticity from the deterministic part, making training possible with backpropagation.
10. Generating New Samples with VAEs
After training:
- Sample $z \sim \mathcal{N}(0, I)$
- Decode with $x = f_\theta(z)$
The result is a newly generated data sample similar to the training distribution — for instance, a new handwritten digit when trained on MNIST.
11. Applications
- Image denoising (e.g., Gondara 2016, medical imaging)
- Data generation and augmentation
- Anomaly detection (via reconstruction error)
- Feature learning for unsupervised tasks
- Data compression and transfer learning
12. Summary and Takeaways
- Autoencoders learn compact latent representations for reconstruction and feature extraction.
- Denoising and Sparse variants enhance robustness and interpretability.
- Variational Autoencoders combine neural networks with probabilistic inference, enabling sample generation.
- The reparameterization trick enables stochastic training with gradient descent.
- These architectures are foundational to modern generative AI models (GANs, diffusion models).
References
- Michelucci, Umberto. Modern Autoencoders: Theory and Applications. arXiv preprint arXiv:2201.03898, 2022.
- Gondara, L. “Medical Image Denoising Using Convolutional Denoising Autoencoders.” IEEE ICDMW, 2016.
- Kingma, D. P., and Welling, M. “Auto-Encoding Variational Bayes.” arXiv preprint arXiv:1312.6114, 2013.