Lecture 06

Automatic Differentiation with PyTorch

1. Introduction

Training deep neural networks centers on minimizing a scalar loss via gradient-based optimization.
Two pillars enable this:

Backpropagation efficiently computes $begin:math:display$\\nabla_{\\theta} L$end:math:display$
(gradients of the loss w.r.t. parameters).
Gradient Descent (GD) (and variants like SGD/Adam) update the parameters using these gradients.

Modern frameworks automate derivative calculations through automatic differentiation (autodiff).

2. PyTorch Resources

PyTorch is a Python-first deep learning library with tensor operations, GPU acceleration, and dynamic computation graphs.

Install via pip or conda at pytorch.org
Tutorials: pytorch.org/tutorials
Community forum: discuss.pytorch.org

3. Computation Graphs

A computation graph is a directed acyclic graph (DAG) representing an expression as simple ops (nodes) with data dependencies (edges).
It underpins forward evaluation and reverse-mode gradient computation.

Chain Rule on Graphs

For a scalar loss $begin:math:text$L$end:math:text$ depending on parameters $begin:math:text$\theta$end:math:text$ through intermediate variables:

$begin:math:display$ x = f(\\theta), \\quad y = g(x), \\quad L = h(y) $end:math:display$

Reverse-mode autodiff applies the chain rule backward:

$begin:math:display$ \\frac{\\partial L}{\\partial \\theta} = \\frac{\\partial L}{\\partial y} \\cdot \\frac{\\partial y}{\\partial x} \\cdot \\frac{\\partial x}{\\partial \\theta} $end:math:display$

Example: ReLU

The Rectified Linear Unit is:

$begin:math:display$ \\text{ReLU}(z) = \\max(0, z), \\quad \\frac{d}{dz} \\text{ReLU}(z) = \\begin{cases} 1, & z > 0,\\\\ 0, & z \\leq 0 \\end{cases} $end:math:display$

A single-neuron computation can be decomposed as:

$begin:math:display$ u = w \\cdot x, \\quad v = u + b, \\quad a = \\text{ReLU}(v) $end:math:display$

Gradients flow only through the active branch ($begin:math:text$v > 0$end:math:text$), helping mitigate vanishing gradients and enabling efficient layer stacking.

Common Graph Structures

Single-path chains: univariate chain rule applies node-by-node.
Weight sharing: parameters reused across paths (CNNs/RNNs); partials accumulate.
Fully connected layer: $begin:math:display$y = \\sigma(Wx + b)$end:math:display$ with multivariate chain rule.

4. Automatic Differentiation in PyTorch

Reverse vs. Forward Mode

Reverse mode (backprop): efficient when a scalar loss depends on many parameters (typical in DL).
Forward mode: efficient when #inputs ≪ #outputs (less common in DL).

Autograd Workflow

Track tensors: create tensors with requires_grad=True.
Build graph: run forward pass (PyTorch records ops dynamically).
Backward pass: call loss.backward() to compute gradients.
Access/use grads: read param.grad; update with optimizer; zero grads next step.

Minimal example (step-by-step annotated):

import torch

# Step 1: Create a tensor with gradient tracking enabled
# requires_grad=True tells PyTorch to record all operations on this tensor
x = torch.tensor(2.0, requires_grad=True)

# Step 2: Define a simple computation
# y = x**2 + 3*x creates a dynamic computation graph:
#   - Node1: x
#   - Node2: square operation
#   - Node3: multiply by 3
#   - Node4: add both results
y = x**2 + 3*x

# Step 3: Compute gradient dy/dx
# PyTorch traverses the graph in reverse order (reverse-mode autodiff)
# Using chain rule:
#   dy/dx = d(x^2)/dx + 3 * d(x)/dx = 2x + 3
y.backward()

# Step 4: Access the stored gradient value
# PyTorch stores the computed gradient in x.grad
print(x.grad)  # Expected: 2*x + 3 = 7

# Internally:
# - PyTorch dynamically builds a graph as ops execute
# - During backward(), it walks that graph in reverse, accumulating grads
# - After backward, the graph is freed unless retain_graph=True is passed

---

## 5. PyTorch API: Model & Training Loop

### Define a Model (nn.Module)

```python
import torch.nn as nn

# This defines a fully connected linear layer with ReLU activation
# nn.Module is a base class that lets PyTorch track all parameters automatically.
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 1)  # 10 inputs -> 1 output
    
    def forward(self, x):
        # The forward() method defines how data flows through layers
        # Here, x is multiplied by W, adds b, then applies ReLU activation.
        return torch.relu(self.fc(x))

Create Components

model = Net()  # Create an instance of the network
criterion = torch.nn.MSELoss()  # Mean Squared Error loss
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)  # Stochastic Gradient Descent

Training Loop Skeleton

for epoch in range(num_epochs):
    optimizer.zero_grad()        # Clear old gradients from previous iteration
    outputs = model(inputs)      # Forward pass: compute predictions via model(x)
    loss = criterion(outputs, targets)  # Compute loss value
    loss.backward()              # Backward pass: compute dLoss/dW for each parameter
    optimizer.step()             # Update parameters: W := W - lr * dLoss/dW

Good Practices

Call optimizer.zero_grad() every iteration to prevent gradient accumulation.
Use model.eval() and torch.no_grad() for evaluation/inference.
Move tensors to the same device (CPU/GPU) via .to(device).

6. Debugging & Practical Notes

Inspect tensor shapes/values and .grad to diagnose issues.
Visualize graphs when helpful (e.g., external tools like torchviz).
Check learning rate, initialization, and data normalization if loss diverges.

7. Worked Scalar Example: ReLU Neuron

Take $x = 3, w = 2, b = 1$.

Forward:

$begin:math:display$ u = wx = 6, \\quad v = u + b = 7, \\quad a = \\text{ReLU}(v) = 7 $end:math:display$

Backward:

$begin:math:display$ \\frac{\\partial a}{\\partial v} = \\mathbb{1}[v > 0] = 1, \\quad \\frac{\\partial v}{\\partial u} = 1, \\quad \\frac{\\partial u}{\\partial w} = x, \\quad \\frac{\\partial u}{\\partial x} = w, \\quad \\frac{\\partial v}{\\partial b} = 1 $end:math:display$

8. Dynamic Computation Graphs (Define-by-Run)

Why ReLU helps:

PyTorch uses a dynamic computation graph approach, also called define-by-run. The graph is built on-the-fly during the forward pass, meaning you can use standard Python control flow (loops, conditionals) and the graph structure can change every iteration.

Key advantages:

Flexibility: Different graph structures for different inputs (e.g., variable-length sequences, conditional branching)
Debugging: Easier to debug with standard Python tools since execution is immediate
Intuitive: Code reads like normal Python rather than symbolic graph construction

Contrast with static graphs (e.g., TensorFlow 1.x): Static frameworks require defining the entire graph upfront, then running it repeatedly. This can be more efficient for production but less flexible for research and prototyping.

9. Common PyTorch Pitfalls

Pitfall 1: Forgetting to Zero Gradients

Gradients accumulate by default in PyTorch. If you don’t call optimizer.zero_grad() before each backward pass, gradients from previous iterations will keep adding up, corrupting your parameter updates and causing training to diverge.

Pitfall 2: Missing `torch.no_grad()` During Inference

During evaluation or inference, PyTorch still builds the computation graph by default, wasting memory and slowing down predictions by 2-3x. Always wrap inference code with torch.no_grad() to disable gradient tracking. Also remember to call model.eval() to switch dropout and batch normalization to evaluation mode.

10. Summary

Computation graphs formalize dependencies and enable systematic application of the chain rule.
PyTorch builds these graphs dynamically and uses reverse-mode autodiff for efficient training.
Typical workflow: define model → forward pass → compute loss → backward → optimizer step.