Lecture 16 - Autoencoders and Variational Autoencoders (VAEs)

Detailed lecture notes exploring Autoencoders, their variants, and Variational Autoencoders in Deep Generative Models.

1. Introduction and Overview

This lecture focuses on Deep Generative Models (DGMs) — models designed to learn the underlying data distribution, enabling both prediction and generation of new samples. We move from discriminative models, which model $P(Y|X)$, to generative models, which model $P(X, Y)$ or $P(X)$.

Autoencoders (AEs) and Variational Autoencoders (VAEs) form the foundation of this shift, combining unsupervised representation learning with probabilistic inference.


2. Discriminative vs. Generative Modeling

Model Type Learns Objective Examples  
Discriminative $P(Y X)$ Classify or predict outcomes Logistic Regression, CNNs
Generative $P(X, Y)$ or $P(X Y)$ Model data generation process Autoencoders, VAEs, GANs

In generative models, latent variables $z$ represent hidden structure in the data, making the following computations challenging:

\(p_\theta(x) = \int p_\theta(x, z) \, dz\) \(p_\theta(z|x) \propto p_\theta(x|z) p(z)\)

Because $z$ is unobserved, both the marginal likelihood and posterior inference are intractable in complex data, requiring approximate inference.


3. Deep Generative Models (DGMs)

DGMs use neural networks to parameterize probability distributions over observed data $x$ and latent variables $z$. The “deep” structure arises from multiple layers of hidden variables that represent hierarchical abstractions.

Key ideas:


4. Autoencoders: Concept and Motivation

An Autoencoder (AE) is an unsupervised neural network trained to reproduce its input. It compresses the input into a latent representation (code) and reconstructs the input from this compressed form.

Applications:


5. Architecture of an Autoencoder

An autoencoder consists of:

Training objective: \(L(x, \hat{x}) = ||x - \hat{x}||^2\)

By minimizing reconstruction loss, the network learns to capture meaningful low-dimensional structure.


6. Autoencoders vs PCA

While PCA performs linear dimensionality reduction, AEs learn non-linear mappings through neural activations. PCA can be seen as a special case of an AE with one linear hidden layer.

Advantages of AEs:


7. Autoencoder Variants

7.1 Denoising Autoencoders (DAE)

Train the network to reconstruct the original data from a noisy input, e.g. by adding Gaussian noise or dropout.
This improves robustness and generalization.

7.2 Dropout Autoencoders

Add dropout layers to force redundancy in learned features, preventing overfitting and improving noise tolerance.

7.3 Sparse Autoencoders

Encourage sparsity in activations using an L1 penalty. Only a few neurons activate for each input, leading to interpretable latent representations.

7.4 Variational Autoencoders (VAEs)

Introduce a probabilistic latent space, enabling smooth interpolation and random sampling of new data points.


8. Variational Autoencoders (VAEs): Theory and Intuition

VAEs model the data generation process as:

  1. Sample latent variable $z \sim p(z)$ (e.g., $\mathcal{N}(0, I)$).
  2. Generate data $x$ from conditional distribution $p_\theta(x z)$.
The encoder approximates the posterior $q_\phi(z x)$ using neural networks, producing mean and variance parameters $(\mu, \sigma)$.

We cannot compute $\log p_\theta(x)$ directly because the marginalization over $z$ is intractable.
By introducing an approximate posterior $q_\phi(z|x)$ and applying Jensen’s inequality:

\[\log p_\theta(x) = \log \int p_\theta(x,z)dz \ge \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x)||p(z))\]

This lower bound is the Evidence Lower Bound (ELBO), which VAEs maximize during training.

Objective (Evidence Lower Bound - ELBO):

\[\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))\]

9. The Reparameterization Trick

To allow gradients to flow through random sampling, we use:

\[z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]

This separates stochasticity from the deterministic part, making training possible with backpropagation.


10. Generating New Samples with VAEs

After training:

  1. Sample $z \sim \mathcal{N}(0, I)$
  2. Decode with $x = f_\theta(z)$

The result is a newly generated data sample similar to the training distribution — for instance, a new handwritten digit when trained on MNIST.


11. Applications


12. Summary and Takeaways


References