Lecture 16

Autoencoders and Variational Autoencoders (VAEs)

Today’s Topics:

1. Introduction and Overview

This lecture focuses on Deep Generative Models (DGMs) — models designed to learn the underlying data distribution, enabling both prediction and generation of new samples. We move from discriminative models, which model $P(Y|X)$, to generative models, which model $P(X, Y)$ or $P(X)$.

Autoencoders (AEs) and Variational Autoencoders (VAEs) form the foundation of this shift, combining unsupervised representation learning with probabilistic inference.


2. Discriminative vs Generative Modeling

Model Type Learns Objective Examples
Discriminative $P(Y \mid X)$ Classify or predict outcomes Logistic Regression, CNNs
Generative $P(X, Y)$ or $P(X \mid Y)$ Model data generation process Autoencoders, VAEs, GANs

In generative models, latent variables $z$ represent hidden structure in the data, making the following computations challenging:

p_\theta(x) = \int p_\theta(x, z)\, dz p_\theta(z \mid x) \propto p_\theta(x \mid z)\, p(z)

Because $z$ is unobserved, both the marginal likelihood and posterior inference are intractable in complex data, requiring approximate inference.


3. Deep Generative Models (DGMs)

DGMs use neural networks to parameterize probability distributions over observed data $x$ and latent variables $z$. The “deep” structure arises from multiple layers of hidden variables that represent hierarchical abstractions.

Key ideas:


4. Autoencoders: Concept and Motivation

An Autoencoder (AE) is an unsupervised (no labeled) neural network trained to reproduce its input. It compresses the input into a latent representation (code) and reconstructs the input from this compressed form.

Applications:


5. Architecture of an Autoencoder

An autoencoder consists of:

Training objective:

L(x, \hat{x}) = ||x - \hat{x}||^2

By minimizing reconstruction loss, the network learns to capture meaningful low-dimensional structure.


6. Autoencoders vs PCA

While PCA performs linear dimensionality reduction, AEs learn non-linear mappings through neural activations. PCA can be seen as a special case of an AE with one linear hidden layer.

Advantages of AEs:


7. Autoencoder Variants

7.1 Denoising Autoencoders (DAE)

Train the network to reconstruct the original data from a noisy input, e.g. by adding Gaussian noise or dropout.
This improves robustness and generalization.

7.2 Dropout Autoencoders

Add dropout layers to force redundancy in learned features, preventing overfitting and improving noise tolerance.

7.3 Sparse Autoencoders

Encourage sparsity in activations using an L1 penalty. Only a few neurons activate for each input, leading to interpretable latent representations.

7.4 Variational Autoencoders (VAEs)

Introduce a probabilistic latent space, enabling smooth interpolation and random sampling of new data points.


8. Variational Autoencoders (VAEs): Theory and Intuition

Define the latent space to follow a normal distribution, which enables sampling. We want to first sample $z$ from $p_{\theta^\star}(z)$ and translate $z$ into $x$ using the decoder network. We want to estimate the true parameters $\theta^\star$ of this generative model How should we represent this model? we can choose a prior $p(z)$ to be simple, e.g. Gaussian. Then, the conditional $p(x \mid z)$ is complex. We will use a neural network.

How do we train this model? We maximize the likelihood of the training data

\[p_{\theta}(x) = \int p_{\theta}(z) p_{\theta}(x \mid z) \, dz\]

Since we are modeling probabilistic generation of data, the encoder and decoder networks are probabilistic. We use

\[q_{\phi}(z \mid x)\]

to represent the encoder network and

\[p_{\theta}(x \mid z)\]

to represent the decoder network.

Now, equipped with our encoder and decoder networks, let us work out the log data likelihood:

\[\begin{align} \log p_{\theta}(x) &= \mathbb{E}_{z \sim q_{\phi}(z \mid x)}[\log p_{\theta}(x)] \\ &= \mathbb{E} _{z} \left[ \log \frac{p_{\theta}(x\mid z) p_{\theta}(z)}{p_{\theta}(z \mid x)} \right] \\ &= \mathbb{E} _{z} \left[ \log \frac{p_{\theta}(x\mid z) p_{\theta}(z)}{p_{\theta}(z \mid x)} \frac{q_{\phi}(z \mid x)}{q_{\phi} (z \mid x)} \right] \\ &= \mathbb{E}_{z}[\log p_{\theta}(x \mid z)] - \mathbb{E}_{z}\left[ \log \frac{q_{\phi}(z \mid x)}{p_{\theta}(z)} \right] + \mathbb{E}_{z}\left[\log \frac{q_{\phi}(z \mid x)}{p_\theta(z \mid x)} \right] \\ &= \mathbb{E}_{z}[\log p_{\theta}(x \mid z)] - D_{KL}(q_{\phi}(z \mid x) \mid\mid p_{\theta}(z)) + D_{KL}(q_{\phi}(z \mid x) \mid\mid p_{\theta}(z \mid x)) \end{align}\]

The decoder network gives $p_{\theta}(x \mid z)$. We can compute the estimate of the first term (reconstruction accuracy) through sampling. The second term (regularization ensuring latent space continuity) is a KL divergence between Gaussians for the encoder and the z prior. This term has a nice closed-form solution. We cannot compute the third term, but we know KL divergence is always non-negative. We will use the first two terms as a loss function (Evidence Lower Bound, ELBO), which we can take the gradient of and optimize. VAEs tend to generate blurred images due to the mode covering behavior.


9. The Reparameterization Trick

To allow gradients to flow through random sampling, we use:

z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

This separates stochasticity from the deterministic part, making training possible with backpropagation.


10. Generating New Samples with VAEs

After training:

  1. Sample $z \sim \mathcal{N}(0, I)$
  2. Decode with $x = f_\theta(z)$

The result is a newly generated data sample similar to the training distribution — for instance, a new handwritten digit when trained on MNIST.


11. Applications


12. Summary and Takeaways


References