Lecture 06
Automatic Differentiation with PyTorch
1. Introduction
Training deep neural networks centers on minimizing a scalar loss via gradient-based optimization. Two pillars enable this:
- Backpropagation efficiently computes $\nabla_{\theta} L$ (gradients of the loss w.r.t. parameters).
- Gradient Descent (GD) (and variants like SGD/Adam) update the parameters using these gradients.
Modern frameworks automate the derivative calculations through automatic differentiation (autodiff).
2. PyTorch Resources
PyTorch is a Python-first deep learning library with tensor operations, GPU acceleration, and dynamic computation graphs.
- Install via pip or conda at pytorch.org
- Tutorials: pytorch.org/tutorials
- Community forum: discuss.pytorch.org
3. Computation Graphs
A computation graph is a directed acyclic graph (DAG) representing an expression as simple ops (nodes) with data dependencies (edges). It underpins forward evaluation and reverse-mode gradient computation.
Chain Rule on Graphs
For a scalar loss $L$ depending on parameters $\theta$ through intermediate variables, e.g.,
\[x = f(\theta), \quad y = g(x), \quad L = h(y),\]reverse-mode autodiff applies the chain rule backward:
\[\frac{\partial L}{\partial \theta} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x} \cdot \frac{\partial x}{\partial \theta}.\]Example: ReLU
The Rectified Linear Unit is:
\[\text{ReLU}(z) = \max(0, z), \quad \frac{d}{dz} \text{ReLU}(z) = \begin{cases} 1, & z > 0, \\ 0, & z \leq 0. \end{cases}\]A single-neuron computation can be decomposed as:
\[u = w \cdot x, \quad v = u + b, \quad a = \text{ReLU}(v).\]Gradients flow only through the active branch ($v > 0$), helping mitigate vanishing gradients and enabling efficient layer stacking.
Common Graph Structures
- Single-path chains: univariate chain rule applies node-by-node.
- Weight sharing: parameters reused across paths (CNNs/RNNs); partials accumulate across all uses.
- Fully connected layer: $y = \sigma(Wx + b)$ with multivariate chain rule across fan-in/fan-out.
4. Automatic Differentiation in PyTorch
Reverse vs. Forward Mode
- Reverse mode (backprop): efficient when a scalar loss depends on many parameters (typical in DL).
- Forward mode: efficient when #inputs $\ll$ #outputs (less common in DL training loops).
Autograd Workflow
- Track tensors: create tensors with
requires_grad=True
. - Build graph: run the forward pass; PyTorch records ops dynamically (define-by-run).
- Backward pass: call
loss.backward()
to compute gradients. - Access/use grads: read
param.grad
; update with an optimizer; zero grads for the next step.
Minimal example:
import torch
x = torch.tensor(2.0, requires_grad=True)
y = x**2 + 3*x
y.backward()
print(x.grad) # 2*x + 3 = 7
5. PyTorch API: Model & Training Loop
Define a Model (nn.Module)
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(10, 1)
def forward(self, x):
return torch.relu(self.fc(x))
Create Components
model = Net()
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)
Training Loop Skeleton
for epoch in range(num_epochs):
optimizer.zero_grad() # 1) clear old grads
outputs = model(inputs) # 2) forward
loss = criterion(outputs, targets)
loss.backward() # 3) backprop
optimizer.step() # 4) update
Good Practices
- Call
optimizer.zero_grad()
every iteration to prevent gradient accumulation. - Use
model.eval()
andtorch.no_grad()
for evaluation/inference. - Move tensors to the same device (CPU/GPU) via
.to(device)
.
6. Debugging & Practical Notes
- Inspect tensor shapes/values and
.grad
to diagnose issues. - Visualize graphs when helpful (e.g., external tools like torchviz).
- Check learning rate, initialization, and data normalization if loss diverges.
7. Worked Scalar Example: ReLU Neuron
Take $x = 3, w = 2, b = 1$.
Forward:
\[u = wx = 6, \quad v = u + b = 7, \quad a = \text{ReLU}(v) = 7.\]Backward:
\[\frac{\partial a}{\partial v} = \mathbb{1}[v > 0] = 1, \quad \frac{\partial v}{\partial u} = 1, \quad \frac{\partial u}{\partial w} = x, \quad \frac{\partial u}{\partial x} = w, \quad \frac{\partial v}{\partial b} = 1.\]Thus:
\[\frac{\partial a}{\partial w} = \frac{\partial a}{\partial v} \frac{\partial v}{\partial u} \frac{\partial u}{\partial w} = 1 \cdot 1 \cdot x = 3, \quad \frac{\partial a}{\partial b} = 1 \cdot 1 = 1.\]8. Summary
- Computation graphs formalize dependencies and enable systematic application of the chain rule.
- PyTorch builds these graphs dynamically and uses reverse-mode autodiff for efficient training.
- Typical workflow: define model → forward pass → compute loss → backward → optimizer step.