Lecture 06

Automatic Differentiation with PyTorch

1. Introduction

Training deep neural networks centers on minimizing a scalar loss via gradient-based optimization. Two pillars enable this:

Modern frameworks automate the derivative calculations through automatic differentiation (autodiff).


2. PyTorch Resources

PyTorch is a Python-first deep learning library with tensor operations, GPU acceleration, and dynamic computation graphs.


3. Computation Graphs

A computation graph is a directed acyclic graph (DAG) representing an expression as simple ops (nodes) with data dependencies (edges). It underpins forward evaluation and reverse-mode gradient computation.

Chain Rule on Graphs

For a scalar loss $L$ depending on parameters $\theta$ through intermediate variables, e.g.,

\[x = f(\theta), \quad y = g(x), \quad L = h(y),\]

reverse-mode autodiff applies the chain rule backward:

\[\frac{\partial L}{\partial \theta} = \frac{\partial L}{\partial y} \cdot \frac{\partial y}{\partial x} \cdot \frac{\partial x}{\partial \theta}.\]

Example: ReLU

The Rectified Linear Unit is:

\[\text{ReLU}(z) = \max(0, z), \quad \frac{d}{dz} \text{ReLU}(z) = \begin{cases} 1, & z > 0, \\ 0, & z \leq 0. \end{cases}\]

A single-neuron computation can be decomposed as:

\[u = w \cdot x, \quad v = u + b, \quad a = \text{ReLU}(v).\]

Gradients flow only through the active branch ($v > 0$), helping mitigate vanishing gradients and enabling efficient layer stacking.

Common Graph Structures


4. Automatic Differentiation in PyTorch

Reverse vs. Forward Mode

Autograd Workflow

  1. Track tensors: create tensors with requires_grad=True.
  2. Build graph: run the forward pass; PyTorch records ops dynamically (define-by-run).
  3. Backward pass: call loss.backward() to compute gradients.
  4. Access/use grads: read param.grad; update with an optimizer; zero grads for the next step.

Minimal example:

import torch

x = torch.tensor(2.0, requires_grad=True)
y = x**2 + 3*x
y.backward()
print(x.grad)  # 2*x + 3 = 7

5. PyTorch API: Model & Training Loop

Define a Model (nn.Module)

import torch.nn as nn

class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc = nn.Linear(10, 1)
    
    def forward(self, x):
        return torch.relu(self.fc(x))

Create Components

model = Net()
criterion = torch.nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters(), lr=1e-2)

Training Loop Skeleton

for epoch in range(num_epochs):
    optimizer.zero_grad()        # 1) clear old grads
    outputs = model(inputs)      # 2) forward
    loss = criterion(outputs, targets)
    loss.backward()              # 3) backprop
    optimizer.step()             # 4) update

Good Practices


6. Debugging & Practical Notes


7. Worked Scalar Example: ReLU Neuron

Take $x = 3, w = 2, b = 1$.

Forward:

\[u = wx = 6, \quad v = u + b = 7, \quad a = \text{ReLU}(v) = 7.\]

Backward:

\[\frac{\partial a}{\partial v} = \mathbb{1}[v > 0] = 1, \quad \frac{\partial v}{\partial u} = 1, \quad \frac{\partial u}{\partial w} = x, \quad \frac{\partial u}{\partial x} = w, \quad \frac{\partial v}{\partial b} = 1.\]

Thus:

\[\frac{\partial a}{\partial w} = \frac{\partial a}{\partial v} \frac{\partial v}{\partial u} \frac{\partial u}{\partial w} = 1 \cdot 1 \cdot x = 3, \quad \frac{\partial a}{\partial b} = 1 \cdot 1 = 1.\]

8. Summary