Lecture 05

Fitting Neurons with Gradient Descent

Lecture Overview

  1. Online, batch, and minibatch mode
  2. Relation between percetron and linear regression
  3. An iterative training algorithm for linear regression
  4. Calculus Refresher I: Derivatives
  5. Calculus Refresher II: Gradients
  6. Understanding gradient descent
  7. Training an adaptive linear neuron (Adaline)

1. Online, batch, and minibatch mode

Perceptron Recap

, where

\sigma \!\Bigl( \sum_{i=1}^{m} x_i w_i + b \Bigr) = \sigma \!\bigl( \mathbf{x}^{T} \mathbf{w} + b \bigr) = \hat{y} \sigma(z) = \begin{cases} 0, & z \le 0 \\ 1, & z > 0 \end{cases} b = -\theta

Let $\mathcal{D} = (\langle \mathbf{x}^{[1]}, y^{[1]} \rangle, \mathbf{x}^{[2]}, y^{[2]} \rangle, \dots, \mathbf{x}^{[n]}, y^{[n]} \rangle) \in (\mathbb{R}^{m} \times {0,1})^{n} $.

  1. Initialize $ \mathbf{w} := \mathbf{0} \in \mathbb{R}^{m}, \quad b := 0 $
  2. For every training epoch:

    A. For every $ \langle x^{[i]}, y^{[i]} \rangle \in \mathcal{D} $:

    • (a) Compute output (prediction) $ \hat{y}^{[i]} := \sigma( x^{[i]\top} \mathbf{w} + b \bigr) $

    • (b) Calculate error $ \mathrm{err} := \bigl( y^{[i]} - \hat{y}^{[i]} \bigr) $

    • (c) Update parameters $ \mathbf{w} := \mathbf{w} + \mathrm{err} \times x^{[i]}, \quad b := b + \mathrm{err} $

“On-line” mode (= SGD: Schocastic Gradient Descent)

  1. Initialize $ \mathbf{w} := \mathbf{0} \in \mathbb{R}^{m}, \quad b := 0 $
  2. For every training epoch:

    A. For every $ \langle x^{[i]}, y^{[i]} \rangle \in \mathcal{D} $:

    • (a) Compute output (prediction)
    • (b) Calculate error
    • (c) Update $ \mathbf{w}, \mathbf{b}$

“On-line” mode II (alternative)

  1. Initialize $ \mathbf{w} := \mathbf{0} \in \mathbb{R}^{m}, \quad b := 0 $
  2. For every training epoch:

    A. Pick random $ \langle x^{[i]}, y^{[i]} \rangle \in \mathcal{D} $:

    • (a) Compute output (prediction)
    • (b) Calculate error
    • (c) Update $ \mathbf{w}, \mathbf{b}$

Batch mode

  1. Initialize $ \mathbf{w} := \mathbf{0} \in \mathbb{R}^{m}, \quad b := 0 $
  2. For every training epoch:

    A. Take all training examples from $\mathcal{D} $:

    • (a) Compute output (prediction)
    • (b) Calculate error

    B. Update $ \mathbf{w}, \mathbf{b}$

Minibach mode (mix between on-line and batch)

  1. Initialize $ \mathbf{w} := \mathbf{0} \in \mathbb{R}^{m}, \quad b := 0 $
  2. For every training epoch:

    A. For every minibatch of size k, namely

    $ (\langle x^{[i]}, y^{[i]} \rangle, \dots, \langle x^{[i+k]}, y^{[i+k]} \rangle) \in \mathcal{D}$

    • (a) Compute output (prediction)
    • (b) Calculate error
    • (c) Update $ \mathbf{w} := \mathbf{w} + \Delta \mathbf{w}, \quad \mathbf{b} := \mathbf{b} + \Delta \mathbf{b} $

2. Relation between perceptron and linear regression

Perceptron

Linear Regression

(Least-Squares) Linear Regression

\mathbf{w} = (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}y

, assuming the bias is included in $ \mathbf{w}$, and the design matrix has an additional vector of 1’s


3. An iterative training algorithm for linear regression

(Least-Squares) Linear Regression

Better way

Update Rules (“on-line” mode)

Perceptron Learning Rule

  1. Initialize $ \mathbf{w} := \mathbf{0} \in \mathbb{R}^{m}, \quad b := 0 $
  2. For every training epoch:

    A. For every $ \langle x^{[i]}, y^{[i]} \rangle \in \mathcal{D} $:

    • (a) $ \hat{y}^{[i]} := \sigma\bigl(x^{[i]\top}\mathbf{w} + b\bigr) $
    • (b) $ err := \bigl(y^{[i]} - \hat{y}^{[i]}\bigr) $
    • (c) $ \mathbf{w} := \mathbf{w} + err \times x^{[i]}, \quad b := b + err $

Stochastic Gradient Descent (Vectorized)

  1. Initialize $ \mathbf{w} := \mathbf{0} \in \mathbb{R}^{m}, \quad b := 0 $

  2. For every training epoch:

    A. For every $ \langle x^{[i]}, y^{[i]} \rangle \in \mathcal{D} $:

    • (a) $ \hat{y}^{[i]} := \sigma\bigl(x^{[i]\top}\mathbf{w} + b\bigr) $

    • (b) $ \nabla_{\mathbf{w}}\mathcal{L} = \bigl(y^{[i]} - \hat{y}^{[i]}\bigr)x^{[i]}, \quad \nabla_{b}\mathcal{L} = \bigl(y^{[i]} - \hat{y}^{[i]}\bigr) $

    • (c) $ \mathbf{w} := \mathbf{w} + \eta \times \bigl(-\nabla_{\mathbf{w}}\mathcal{L}\bigr), \quad b := b + \eta \times \bigl(-\nabla_{b}\mathcal{L}\bigr) $

    where $\eta$ = learning rate, $\bigl(-\nabla_{\mathbf{w}}\mathcal{L}\bigr)$ = negative gradient

Stochastic Gradient Descent (For understanding only)

  1. Initialize $ \mathbf{w} := \mathbf{0} \in \mathbb{R}^{m}, \quad b := 0 $

  2. For every training epoch:

    A. For every $ \langle x^{[i]}, y^{[i]} \rangle \in \mathcal{D} $:

    • (a) $ \hat{y}^{[i]} := \sigma\bigl(x^{[i]\top}\mathbf{w} + b\bigr) $

    B. For weight $ \boldsymbol{j \in {1, \ldots, m}} $:

    • (b) $ \frac{\partial \mathcal{L}}{\partial w_{j}} = \bigl(y^{[i]} - \hat{y}^{[i]}\bigr)x_{j}^{[i]} $

    • (c) $ w_{j} := w_{j} + \eta \times \Bigl(-\frac{\partial \mathcal{L}}{\partial w_{j}}\Bigr) $

    C. $ \frac{\partial \mathcal{L}}{\partial b} = \bigl(y^{[i]} - \hat{y}^{[i]}\bigr) $, $ b := b + \eta \times \Bigl(-\frac{\partial \mathcal{L}}{\partial b}\Bigr) $

    where $\eta \times \Bigl(-\frac{\partial \mathcal{L}}{\partial b}\Bigr)$ coincidentally appears almost to be the same as the perceptron rule, except that the prediction is a real number, and we have a learning rate

This learning rule is called Gradient Descent


4. Calculus Refresher I: Derivatives

Differential Calculus Refresher

  1. The derivative of a function $f(x)$ at a point $x=a$ is defined as the limit of the difference quotient (if it exists):
f'(a) = \lim\limits_{\Delta x \to 0} \frac{f(a+\Delta x) - f(a)}{\Delta x}
  1. If a function $f(x)$ is differentiable at every point of an interval A, we say that $f(x)$ is differentiable on A. In this case, for each $ x=a \in A$ there exists a derivative $f’(a)$ corresponding to it.
f'(x) = \frac{df}{dx} = \lim\limits_{\Delta x \to 0} \frac{f(x+ \Delta x) - f(x)}{ \Delta x} f'(x)= \lim\limits_{\Delta x \to 0} \frac{f(x+\Delta x) - f(x)}{\Delta x}=\lim\limits_{\Delta x \to 0} \frac{2x+2\Delta x-2x}{\Delta x} =\lim\limits_{\Delta x \to 0} \frac{2\Delta x}{\Delta x}=2 f'(x)= \lim\limits_{\Delta x \to 0} \frac{f(x+\Delta x) - f(x)}{\Delta x}=\lim\limits_{\Delta x \to 0} \frac{x^2+2x \Delta x+(\Delta x)^2-x^2}{\Delta x} =\lim\limits_{\Delta x \to 0} \frac{2x \Delta x+(\Delta x)^2}{\Delta x}=\lim\limits_{\Delta x \to 0}2x+ \Delta x=2x

Cheatsheet 1: Frequently used derivative formulas

f(x) = a \quad \Rightarrow \quad f'(x) = 0 f(x) = x \quad \Rightarrow \quad f'(x) = 1 f(x) = ax \quad \Rightarrow \quad f'(x) = a f(x) = x^a \quad \Rightarrow \quad f'(x) = ax^{a-1} f(x) = a^x \quad \Rightarrow \quad f'(x) = \log(a) a^x f(x) = \log(x) \quad \Rightarrow \quad f'(x) = \frac{1}{x} f(x) = \log_a(x) \quad \Rightarrow \quad f'(x) = \frac{1}{x \log(a)} f(x) = \sin(x) \quad \Rightarrow \quad f'(x) = \cos(x) f(x) = \cos(x) \quad \Rightarrow \quad f'(x) = -\sin(x) f(x) = \tan(x) \quad \Rightarrow \quad f'(x) = \sec^2(x)

Cheatsheet 2: Derivative Rules

h(x) = f(x) + g(x), \quad h'(x) = f'(x) + g'(x) h(x) = f(x) - g(x), \quad h'(x) = f'(x) - g'(x) h(x) = f(x) g(x), \quad h'(x) = f'(x) g(x) + f(x) g'(x) h(x) = \frac{f(x)}{g(x)}, \quad h'(x) = \frac{f'(x) g(x) - f(x) g'(x)}{g(x)^2} h(x) = \frac{1}{f(x)}, \quad h'(x) = \frac{-f'(x)}{f(x)^2} h(x) = f(g(x)), \quad h'(x) = f'(g(x)) g'(x)

Chain Rule & Computation Graph

x \xrightarrow{g} g(x) \xrightarrow{f} f(g(x))=z x \xrightarrow{g} g(x) \xrightarrow{f'} f'(g(x)) x \xrightarrow{g'} g'(x) f'(g(x)) \cdot g'(x)=z'

Chain Rule & Leibniz Notation

\frac{d}{dx}[f(g(x))]=\frac{df}{dg}\cdot \frac{dg}{dx} \frac{d}{dx}[f(g(h(u(v(x)))))]=\frac{df}{dg}\cdot \frac{dg}{dh} \cdot \frac{dh}{du} \cdot \frac{du}{dv} \cdot \frac{dv}{dx} \frac{d}{dg}\log(g)=\frac{1}{g}=\frac{1}{\sqrt{x}} \frac{d}{dx}x^{\frac{1}{2}}=\frac{1}{2 \sqrt{x}} \frac{df}{dx}=\frac{1}{2 \sqrt{x}}\cdot\frac{1}{\sqrt{x}}=\frac{1}{2x} \frac{df}{dg}\cdot \frac{dg}{dh} \cdot \frac{dh}{du} \cdot \frac{du}{dv} \cdot \frac{dv}{dx} (reverse\ mode\: start\ from\ the\ outer\ parts ) = \frac{dv}{dx} \cdot \frac{du}{dv} \cdot \frac{dh}{du} \cdot \frac{dg}{dh} \cdot \frac{df}{dg}(forward\ mode\: start\ from\ the\ inner\ parts)

5. Calculus Refresher II: Gradients

Derivatives of Multivariable Functions

  1. For a multivariable function $f(x_1,x_2,x_3,…x_m)$:
\nabla f(\mathbf{\vec{x}}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_m} \end{bmatrix} \in \mathbb{R}^{m \times 1} \nabla f(\mathbf{x,y}) = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \\ \end{bmatrix}= \begin{bmatrix} 2xy \\ x^2+1 \\ \end{bmatrix}

Multivariable Chain Rule

\frac{d}{dx}[f(g(x),h(x))]=\frac{\partial f}{\partial g} \cdot \frac{\partial g}{\partial x}+ \frac{\partial f}{\partial h} \cdot \frac{\partial h}{\partial x} \frac{d}{dx}[f(g(x),h(x))]= 2gh \cdot 3+(g^2+1)\cdot 2x = 2xg^2+6gh+2x

Multivariable Chain Rule in vector form

\frac{\partial f}{\partial x} = \begin{bmatrix} \frac{\partial f}{\partial g} & \frac{\partial f}{\partial h} \end{bmatrix} \begin{bmatrix} \frac{\partial g}{\partial x} \\ \frac{\partial h}{\partial x} \end{bmatrix}= \frac{\partial f}{\partial g} \frac{\partial g}{\partial x} + \frac{\partial f}{\partial h} \frac{\partial h}{\partial x} \mathbf{v}(x) = \begin{bmatrix} g(x) \\ h(x) \end{bmatrix} \mathbf{v'}(x) = \begin{bmatrix} \frac{dg}{dx} \\ \frac{dh}{dx} \end{bmatrix} \nabla f(\mathbf{g,h})= \begin{bmatrix} \frac{\partial f}{\partial g} \\ \frac{\partial f}{\partial h} \end{bmatrix} \frac{\partial f}{\partial x}= \nabla f(\mathbf{g,h}) \cdot \mathbf{v}'(x) = \begin{bmatrix} \frac{\partial f}{\partial g} & \frac{\partial f}{\partial h} \end{bmatrix} \begin{bmatrix} \frac{dg}{dx} \\ \frac{dh}{dx} \end{bmatrix}= \frac{\partial f}{\partial g} \frac{dg}{dx} + \frac{\partial f}{\partial h} \frac{dh}{dx}

The Jacobian Matrix

\mathbf{f}(\mathbf{x}) = \begin{bmatrix} f_1(x_1, x_2, \dots, x_m) \\ f_2(x_1, x_2, \dots, x_m) \\ \vdots \\ f_m(x_1, x_2, \dots, x_m) \end{bmatrix}, \quad J(\mathbf{x}) = \frac{\partial \mathbf{f}}{\partial \mathbf{x}} = \begin{bmatrix} \frac{\partial f_1}{\partial x_1} & \frac{\partial f_1}{\partial x_2} & \cdots & \frac{\partial f_1}{\partial x_m} \\ \frac{\partial f_2}{\partial x_1} & \frac{\partial f_2}{\partial x_2} & \cdots & \frac{\partial f_2}{\partial x_m} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial f_m}{\partial x_1} & \frac{\partial f_m}{\partial x_2} & \cdots & \frac{\partial f_m}{\partial x_m} \end{bmatrix} \in \mathbb{R}^{m \times m} \nabla f_i(\mathbf{x}) = \begin{bmatrix} \frac{\partial f_i}{\partial x_1} \\ \frac{\partial f_i}{\partial x_2} \\ \vdots \\ \frac{\partial f_i}{\partial x_m} \end{bmatrix} \in \mathbb{R}^{m \times 1}

6. Understanding Gradient Descent

Back to Linear Regression

Gradient Descent

\text{SSE}(y, \hat{y}) = \sum_i (y_i - \hat{y}_i)^2

Benefits of convexity:


Returning to Previous Notes on Least-Squares Linear Regression

The update rule turns out to be this:

“On-line” mode (Perceptron learning rule)

  1. Initialize
\mathbf{w} := 0 \in \mathbb{R}^m, \quad b := 0
  1. For every training epoch:
    • For every $\langle x^{[i]}, y^{[i]} \rangle \in D$
\hat{y}^{[i]} := x^{[i]\top} \mathbf{w} + b \text{err} := y^{[i]} - \hat{y}^{[i]} \mathbf{w} := \mathbf{w} + \text{err} \cdot x^{[i]}, \quad b := b + \text{err}

Stochastic Gradient Descent (SGD)

  1. Initialize
\mathbf{w} := 0, \quad b := 0
  1. For every training epoch:
    • For every $\langle x^{[i]}, y^{[i]} \rangle \in D$
\hat{y}^{[i]} := x^{[i]\top} \mathbf{w} + b \mathbf{w} := \mathbf{w} - \eta \cdot \nabla_\mathbf{w} L, \quad b := b - \eta \cdot \nabla_b L

Linear Regression Loss Derivative

Sum of Squared Error (SSE) loss is also called squared loss.

\mathcal{L}(\mathbf{w}, b) = \sum_i \bigl(\hat{y}^{[i]} - y^{[i]}\bigr)^2 \frac{\partial \mathcal{L}}{\partial w_j} = \frac{\partial}{\partial w_j} \sum_i \bigl(\hat{y}^{[i]} - y^{[i]}\bigr)^2 = \frac{\partial}{\partial w_j} \sum_i \bigl(\sigma(\mathbf{w}^\top x^{[i]}) - y^{[i]}\bigr)^2 = \sum_i 2\bigl(\sigma(\mathbf{w}^\top x^{[i]}) - y^{[i]}\bigr) \frac{\partial}{\partial w_j}\bigl(\sigma(\mathbf{w}^\top x^{[i]}) - y^{[i]}\bigr) = \sum_i 2\bigl(\sigma(\mathbf{w}^\top x^{[i]}) - y^{[i]}\bigr) \frac{d\sigma}{d(\mathbf{w}^\top x^{[i]})} \frac{\partial}{\partial w_j}\mathbf{w}^\top x^{[i]} = \sum_i 2\bigl(\sigma(\mathbf{w}^\top x^{[i]}) - y^{[i]}\bigr) \frac{d\sigma}{d(\mathbf{w}^\top x^{[i]})} \, x_j^{[i]}

Alternative Linear Regression Loss Derivative

Mean Squared Error (MSE) often scales by factor ½ for convenience.

\mathcal{L}(\mathbf{w}, b) = \frac{1}{2n} \sum_i \bigl(\hat{y}^{[i]} - y^{[i]}\bigr)^2 \frac{\partial \mathcal{L}}{\partial w_j} = \frac{\partial}{\partial w_j} \frac{1}{2n} \sum_i \bigl(\hat{y}^{[i]} - y^{[i]}\bigr)^2 = \frac{\partial}{\partial w_j} \sum_i \frac{1}{2n}\bigl(\sigma(\mathbf{w}^\top x^{[i]}) - y^{[i]}\bigr)^2 = \sum_i \frac{1}{n}\bigl(\sigma(\mathbf{w}^\top x^{[i]}) - y^{[i]}\bigr) \frac{\partial}{\partial w_j}\bigl(\sigma(\mathbf{w}^\top x^{[i]}) - y^{[i]}\bigr) = \frac{1}{n}\sum_i \bigl(\sigma(\mathbf{w}^\top x^{[i]}) - y^{[i]}\bigr) \frac{d\sigma}{d(\mathbf{w}^\top x^{[i]})} \frac{\partial}{\partial w_j}\mathbf{w}^\top x^{[i]} = \frac{1}{n}\sum_i \bigl(\sigma(\mathbf{w}^\top x^{[i]}) - y^{[i]}\bigr) \frac{d\sigma}{d(\mathbf{w}^\top x^{[i]})} \, x_j^{[i]}

How to Think About Contour Plots

The following show the loss plot for two weights as a contour plot and flattened into a 2D space. As you can see, updates are perpendicular to contour lines.


Stochastic Gradient Descent as Surface Plot

Stochastic updates are a bit noisier because each minibatch is an approximation of the overall loss on the training set.

Analogy:
Imagine you are a scientist who develops a new pharmaceutical drug.


7. Training and Adaptive Linear Neuron (Adaline)

(Least-Squares) Linear Regression

We want to speed up computations and memory when input data has large dimensions.

\mathbf{w} = (\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}y

Assuming the bias is included in w, and the design matrix has an additional vector of 1’s.


ADALINE

Widrow and Hoff’s ADALINE (1960): A nicely differential neuron model.