Lecture 06
Automatic Differentiation with PyTorch
1. Introduction
Training deep neural networks centers on minimizing a scalar loss via gradient-based optimization.
Two pillars enable this:
- Backpropagation efficiently computes
$begin:math:display$\\nabla_{\\theta} L$end:math:display$
(gradients of the loss w.r.t. parameters). - Gradient Descent (GD) (and variants like SGD/Adam) update the parameters using these gradients.
Modern frameworks automate derivative calculations through automatic differentiation (autodiff).
2. PyTorch Resources
PyTorch is a Python-first deep learning library with tensor operations, GPU acceleration, and dynamic computation graphs.
- Install via pip or conda at pytorch.org
- Tutorials: pytorch.org/tutorials
- Community forum: discuss.pytorch.org
3. Computation Graphs
A computation graph is a directed acyclic graph (DAG) representing an expression as simple ops (nodes) with data dependencies (edges).
It underpins forward evaluation and reverse-mode gradient computation.
Chain Rule on Graphs
For a scalar loss $begin:math:text$L$end:math:text$ depending on parameters $begin:math:text$\theta$end:math:text$ through intermediate variables:
Reverse-mode autodiff applies the chain rule backward:
Example: ReLU
The Rectified Linear Unit is:
A single-neuron computation can be decomposed as:
Gradients flow only through the active branch ($begin:math:text$v > 0$end:math:text$), helping mitigate vanishing gradients and enabling efficient layer stacking.
Common Graph Structures
- Single-path chains: univariate chain rule applies node-by-node.
- Weight sharing: parameters reused across paths (CNNs/RNNs); partials accumulate.
- Fully connected layer:
$begin:math:display$y = \\sigma(Wx + b)$end:math:display$ with multivariate chain rule.
4. Automatic Differentiation in PyTorch
Reverse vs. Forward Mode
- Reverse mode (backprop): efficient when a scalar loss depends on many parameters (typical in DL).
- Forward mode: efficient when #inputs ≪ #outputs (less common in DL).
Autograd Workflow
- Track tensors: create tensors with
requires_grad=True. - Build graph: run forward pass (PyTorch records ops dynamically).
- Backward pass: call
loss.backward()to compute gradients. - Access/use grads: read
param.grad; update with optimizer; zero grads next step.
Minimal example (step-by-step annotated):
import torch
# Step 1: Create a tensor with gradient tracking enabled
# requires_grad=True tells PyTorch to record all operations on this tensor
x = torch.tensor(2.0, requires_grad=True)
# Step 2: Define a simple computation
# y = x**2 + 3*x creates a dynamic computation graph:
# - Node1: x
# - Node2: square operation
# - Node3: multiply by 3
# - Node4: add both results
y = x**2 + 3*x
# Step 3: Compute gradient dy/dx
# PyTorch traverses the graph in reverse order (reverse-mode autodiff)
# Using chain rule:
# dy/dx = d(x^2)/dx + 3 * d(x)/dx = 2x + 3
y.backward()
# Step 4: Access the stored gradient value
# PyTorch stores the computed gradient in x.grad
print(x.grad) # Expected: 2*x + 3 = 7
# Internally:
# - PyTorch dynamically builds a graph as ops execute
# - During backward(), it walks that graph in reverse, accumulating grads
# - After backward, the graph is freed unless retain_graph=True is passed
---
## 5. PyTorch API: Model & Training Loop
### Define a Model (nn.Module)
```python
import torch.nn as nn
# This defines a fully connected linear layer with ReLU activation
# nn.Module is a base class that lets PyTorch track all parameters automatically.
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc = nn.Linear(10, 1) # 10 inputs -> 1 output
def forward(self, x):
# The forward() method defines how data flows through layers
# Here, x is multiplied by W, adds b, then applies ReLU activation.
return torch.relu(self.fc(x))
Create Components
model = Net() # Create an instance of the network
criterion = torch.nn.MSELoss() # Mean Squared Error loss
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2) # Stochastic Gradient Descent
Training Loop Skeleton
for epoch in range(num_epochs):
optimizer.zero_grad() # Clear old gradients from previous iteration
outputs = model(inputs) # Forward pass: compute predictions via model(x)
loss = criterion(outputs, targets) # Compute loss value
loss.backward() # Backward pass: compute dLoss/dW for each parameter
optimizer.step() # Update parameters: W := W - lr * dLoss/dW
Good Practices
- Call
optimizer.zero_grad()every iteration to prevent gradient accumulation. - Use
model.eval()andtorch.no_grad()for evaluation/inference. - Move tensors to the same device (CPU/GPU) via
.to(device).
6. Debugging & Practical Notes
- Inspect tensor shapes/values and
.gradto diagnose issues. - Visualize graphs when helpful (e.g., external tools like torchviz).
- Check learning rate, initialization, and data normalization if loss diverges.
7. Worked Scalar Example: ReLU Neuron
Take $x = 3, w = 2, b = 1$.
Forward:
Backward:
8. Dynamic Computation Graphs (Define-by-Run)
Why ReLU helps:
PyTorch uses a dynamic computation graph approach, also called define-by-run. The graph is built on-the-fly during the forward pass, meaning you can use standard Python control flow (loops, conditionals) and the graph structure can change every iteration.
Key advantages:
- Flexibility: Different graph structures for different inputs (e.g., variable-length sequences, conditional branching)
- Debugging: Easier to debug with standard Python tools since execution is immediate
- Intuitive: Code reads like normal Python rather than symbolic graph construction
Contrast with static graphs (e.g., TensorFlow 1.x): Static frameworks require defining the entire graph upfront, then running it repeatedly. This can be more efficient for production but less flexible for research and prototyping.
9. Common PyTorch Pitfalls
Pitfall 1: Forgetting to Zero Gradients
Gradients accumulate by default in PyTorch. If you don’t call optimizer.zero_grad() before each backward pass, gradients from previous iterations will keep adding up, corrupting your parameter updates and causing training to diverge.
Pitfall 2: Missing torch.no_grad() During Inference
During evaluation or inference, PyTorch still builds the computation graph by default, wasting memory and slowing down predictions by 2-3x. Always wrap inference code with torch.no_grad() to disable gradient tracking. Also remember to call model.eval() to switch dropout and batch normalization to evaluation mode.
10. Summary
- Computation graphs formalize dependencies and enable systematic application of the chain rule.
- PyTorch builds these graphs dynamically and uses reverse-mode autodiff for efficient training.
- Typical workflow: define model → forward pass → compute loss → backward → optimizer step.