Lecture 06

Automatic Differentiation with PyTorch

1. Introduction

Training deep neural networks centers on minimizing a scalar loss via gradient-based optimization.
Two pillars enable this:

Modern frameworks automate derivative calculations through automatic differentiation (autodiff).


2. PyTorch Resources

PyTorch is a Python-first deep learning library with tensor operations, GPU acceleration, and dynamic computation graphs.


3. Computation Graphs

A computation graph is a directed acyclic graph (DAG) representing an expression as simple ops (nodes) with data dependencies (edges).
It underpins forward evaluation and reverse-mode gradient computation.

Chain Rule on Graphs

For a scalar loss $begin:math:text$L$end:math:text$ depending on parameters $begin:math:text$\theta$end:math:text$ through intermediate variables:

$begin:math:display$ x = f(\\theta), \\quad y = g(x), \\quad L = h(y) $end:math:display$

Reverse-mode autodiff applies the chain rule backward:

$begin:math:display$ \\frac{\\partial L}{\\partial \\theta} = \\frac{\\partial L}{\\partial y} \\cdot \\frac{\\partial y}{\\partial x} \\cdot \\frac{\\partial x}{\\partial \\theta} $end:math:display$

Example: ReLU

The Rectified Linear Unit is:

$begin:math:display$ \\text{ReLU}(z) = \\max(0, z), \\quad \\frac{d}{dz} \\text{ReLU}(z) = \\begin{cases} 1, & z > 0,\\\\ 0, & z \\leq 0 \\end{cases} $end:math:display$

A single-neuron computation can be decomposed as:

$begin:math:display$ u = w \\cdot x, \\quad v = u + b, \\quad a = \\text{ReLU}(v) $end:math:display$

Gradients flow only through the active branch ($begin:math:text$v > 0$end:math:text$), helping mitigate vanishing gradients and enabling efficient layer stacking.

Common Graph Structures


4. Automatic Differentiation in PyTorch

Reverse vs. Forward Mode

Autograd Workflow

  1. Track tensors: create tensors with requires_grad=True.
  2. Build graph: run forward pass (PyTorch records ops dynamically).
  3. Backward pass: call loss.backward() to compute gradients.
  4. Access/use grads: read param.grad; update with optimizer; zero grads next step.

Minimal example (step-by-step annotated):

import torch

# Step 1: Create a tensor with gradient tracking enabled
# requires_grad=True tells PyTorch to record all operations on this tensor
x = torch.tensor(2.0, requires_grad=True)

# Step 2: Define a simple computation
# y = x**2 + 3*x creates a dynamic computation graph:
#   - Node1: x
#   - Node2: square operation
#   - Node3: multiply by 3
#   - Node4: add both results
y = x**2 + 3*x

# Step 3: Compute gradient dy/dx
# PyTorch traverses the graph in reverse order (reverse-mode autodiff)
# Using chain rule:
#   dy/dx = d(x^2)/dx + 3 * d(x)/dx = 2x + 3
y.backward()

# Step 4: Access the stored gradient value
# PyTorch stores the computed gradient in x.grad
print(x.grad)  # Expected: 2*x + 3 = 7

# Internally:
# - PyTorch dynamically builds a graph as ops execute
# - During backward(), it walks that graph in reverse, accumulating grads
# - After backward, the graph is freed unless retain_graph=True is passed

---

## 5. PyTorch API: Model & Training Loop

### Define a Model (nn.Module)

```python
import torch.nn as nn

# This defines a fully connected linear layer with ReLU activation
# nn.Module is a base class that lets PyTorch track all parameters automatically.
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 1)  # 10 inputs -> 1 output
    
    def forward(self, x):
        # The forward() method defines how data flows through layers
        # Here, x is multiplied by W, adds b, then applies ReLU activation.
        return torch.relu(self.fc(x))

Create Components

model = Net()  # Create an instance of the network
criterion = torch.nn.MSELoss()  # Mean Squared Error loss
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)  # Stochastic Gradient Descent

Training Loop Skeleton

for epoch in range(num_epochs):
    optimizer.zero_grad()        # Clear old gradients from previous iteration
    outputs = model(inputs)      # Forward pass: compute predictions via model(x)
    loss = criterion(outputs, targets)  # Compute loss value
    loss.backward()              # Backward pass: compute dLoss/dW for each parameter
    optimizer.step()             # Update parameters: W := W - lr * dLoss/dW

Good Practices


6. Debugging & Practical Notes


7. Worked Scalar Example: ReLU Neuron

Take $x = 3, w = 2, b = 1$.

Forward:

$begin:math:display$ u = wx = 6, \\quad v = u + b = 7, \\quad a = \\text{ReLU}(v) = 7 $end:math:display$

Backward:

$begin:math:display$ \\frac{\\partial a}{\\partial v} = \\mathbb{1}[v > 0] = 1, \\quad \\frac{\\partial v}{\\partial u} = 1, \\quad \\frac{\\partial u}{\\partial w} = x, \\quad \\frac{\\partial u}{\\partial x} = w, \\quad \\frac{\\partial v}{\\partial b} = 1 $end:math:display$

8. Dynamic Computation Graphs (Define-by-Run)

Why ReLU helps:

PyTorch uses a dynamic computation graph approach, also called define-by-run. The graph is built on-the-fly during the forward pass, meaning you can use standard Python control flow (loops, conditionals) and the graph structure can change every iteration.

Key advantages:

Contrast with static graphs (e.g., TensorFlow 1.x): Static frameworks require defining the entire graph upfront, then running it repeatedly. This can be more efficient for production but less flexible for research and prototyping.


9. Common PyTorch Pitfalls

Pitfall 1: Forgetting to Zero Gradients

Gradients accumulate by default in PyTorch. If you don’t call optimizer.zero_grad() before each backward pass, gradients from previous iterations will keep adding up, corrupting your parameter updates and causing training to diverge.

Pitfall 2: Missing torch.no_grad() During Inference

During evaluation or inference, PyTorch still builds the computation graph by default, wasting memory and slowing down predictions by 2-3x. Always wrap inference code with torch.no_grad() to disable gradient tracking. Also remember to call model.eval() to switch dropout and batch normalization to evaluation mode.


10. Summary