Lecture 16 - Autoencoders and Variational Autoencoders (VAEs)

Detailed lecture notes exploring Autoencoders, their variants, and Variational Autoencoders in Deep Generative Models.

1. Introduction and Overview

This lecture focuses on Deep Generative Models (DGMs) — models designed to learn the underlying data distribution, enabling both prediction and generation of new samples. We move from discriminative models, which model $P(Y|X)$, to generative models, which model $P(X, Y)$ or $P(X)$.

Autoencoders (AEs) and Variational Autoencoders (VAEs) form the foundation of this shift, combining unsupervised representation learning with probabilistic inference.

2. Discriminative vs. Generative Modeling

Model Type	Learns	Objective	Examples
Discriminative	$P(Y	X)$	Classify or predict outcomes	Logistic Regression, CNNs
Generative	$P(X, Y)$ or $P(X	Y)$	Model data generation process	Autoencoders, VAEs, GANs

In generative models, latent variables $z$ represent hidden structure in the data, making the following computations challenging:

$p_\theta(x) = \int p_\theta(x, z) \, dz$ $p_\theta(z|x) \propto p_\theta(x|z) p(z)$

Because $z$ is unobserved, both the marginal likelihood and posterior inference are intractable in complex data, requiring approximate inference.

3. Deep Generative Models (DGMs)

DGMs use neural networks to parameterize probability distributions over observed data $x$ and latent variables $z$. The “deep” structure arises from multiple layers of hidden variables that represent hierarchical abstractions.

Key ideas:

Learn probabilistic mappings between $x$ and $z$.
Use neural networks for non-linear transformations.
Combine deep learning’s representational power with probabilistic reasoning.

4. Autoencoders: Concept and Motivation

An Autoencoder (AE) is an unsupervised neural network trained to reproduce its input. It compresses the input into a latent representation (code) and reconstructs the input from this compressed form.

Applications:

Dimensionality reduction and data visualization
Noise removal and feature denoising
Data compression and reconstruction
Feature learning for downstream tasks

5. Architecture of an Autoencoder

An autoencoder consists of:

Encoder: $h = g(x)$ — compresses the input into latent representation.
Decoder: $\hat{x} = f(h) = f(g(x))$ — reconstructs the input.

Training objective: $L(x, \hat{x}) = ||x - \hat{x}||^2$

By minimizing reconstruction loss, the network learns to capture meaningful low-dimensional structure.

6. Autoencoders vs PCA

While PCA performs linear dimensionality reduction, AEs learn non-linear mappings through neural activations. PCA can be seen as a special case of an AE with one linear hidden layer.

Advantages of AEs:

Capture complex non-linear manifolds
Can be stacked to form Deep Autoencoders
Support feature learning and transfer to other models

7. Autoencoder Variants

7.1 Denoising Autoencoders (DAE)

Train the network to reconstruct the original data from a noisy input, e.g. by adding Gaussian noise or dropout.
This improves robustness and generalization.

7.2 Dropout Autoencoders

Add dropout layers to force redundancy in learned features, preventing overfitting and improving noise tolerance.

7.3 Sparse Autoencoders

Encourage sparsity in activations using an L1 penalty. Only a few neurons activate for each input, leading to interpretable latent representations.

7.4 Variational Autoencoders (VAEs)

Introduce a probabilistic latent space, enabling smooth interpolation and random sampling of new data points.

8. Variational Autoencoders (VAEs): Theory and Intuition

VAEs model the data generation process as:

Sample latent variable $z \sim p(z)$ (e.g., $\mathcal{N}(0, I)$).
Generate data $x$ from conditional distribution $p_\theta(x z)$.

The encoder approximates the posterior $q_\phi(z

x)$ using neural networks, producing mean and variance parameters $(\mu, \sigma)$.

We cannot compute $\log p_\theta(x)$ directly because the marginalization over $z$ is intractable.
By introducing an approximate posterior $q_\phi(z|x)$ and applying Jensen’s inequality:

\[\log p_\theta(x) = \log \int p_\theta(x,z)dz \ge \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x)||p(z))\]

This lower bound is the Evidence Lower Bound (ELBO), which VAEs maximize during training.

Objective (Evidence Lower Bound - ELBO):

\[\mathcal{L}(\theta, \phi; x) = \mathbb{E}_{q_\phi(z|x)}[\log p_\theta(x|z)] - D_{KL}(q_\phi(z|x) || p(z))\]

First term → reconstruction accuracy
Second term → regularization ensuring latent space continuity

9. The Reparameterization Trick

To allow gradients to flow through random sampling, we use:

\[z = \mu + \sigma \odot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)\]

This separates stochasticity from the deterministic part, making training possible with backpropagation.

10. Generating New Samples with VAEs

After training:

Sample $z \sim \mathcal{N}(0, I)$
Decode with $x = f_\theta(z)$

The result is a newly generated data sample similar to the training distribution — for instance, a new handwritten digit when trained on MNIST.

11. Applications

Image denoising (e.g., Gondara 2016, medical imaging)
Data generation and augmentation
Anomaly detection (via reconstruction error)
Feature learning for unsupervised tasks
Data compression and transfer learning

12. Summary and Takeaways

Autoencoders learn compact latent representations for reconstruction and feature extraction.
Denoising and Sparse variants enhance robustness and interpretability.
Variational Autoencoders combine neural networks with probabilistic inference, enabling sample generation.
The reparameterization trick enables stochastic training with gradient descent.
These architectures are foundational to modern generative AI models (GANs, diffusion models).

References

Michelucci, Umberto. Modern Autoencoders: Theory and Applications. arXiv preprint arXiv:2201.03898, 2022.
Gondara, L. “Medical Image Denoising Using Convolutional Denoising Autoencoders.” IEEE ICDMW, 2016.
Kingma, D. P., and Welling, M. “Auto-Encoding Variational Bayes.” arXiv preprint arXiv:1312.6114, 2013.