Lecture 08

(Multinomial) Logistic Regression

Logistic Regression

This lecture we covered logistic regression in the context of an artificial nueron. We also explored logits and cross entropy (negative likelihood) and learnt how to generalise classification to multiple classes.

Basic Function: Sigmoid

\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

Use this as the activation function in a neuron. Bounds between 0 and one means we can interpret the sigmoid output as probability. To predict labels we can use a threshold fuction but we do not use this for training.

\[ y = \sigma(w^{T}x+b) \] \[ P(y|x) = a^y*(1-a)^{(i-y)} \]

Which is the bernoulli distribution

The likelihood function of the bernoulli distribution is

\[ \prod_{i=1}^{n} \left( \sigma(z^{(i)}) \right)^{y^{(i)}} \left( 1 - \sigma(z^{(i)}) \right)^{1 - y^{(i)}} \]

But we often minimise the negaive log likelihood for feasibility

\[ \hat{w} = \arg\min \, \ell(w) \]

Logistic loss learning rule

Loss for one learning example

\[ \mathcal{L}(\mathbf{w}) = -y^{(i)} \log\left( \sigma(z^{(i)}) \right) + (1 - y^{(i)}) \log\left( 1 - \sigma(z^{(i)})\right) \]

We apply this loss function with the same stochastic gradient rule from Linear regression to obrain graidents. After gradients are calculated we proceed with the same updating process as the ADALINE.


3. Predicting Labels vs Probabilities

In logistic regression, the model produces a probability value between 0 and 1 through the sigmoid activation.
To convert this probability into a class label, we typically apply a threshold function.

\[ \hat{y} := \begin{cases} 1 & \text{if } \sigma(z) > 0.5 \\ 0 & \text{otherwise} \end{cases} \]

This threshold rule is equivalent to checking the sign of the linear combination before applying the sigmoid:

\[ \hat{y} := \begin{cases} 1 & \text{if } z > 0.0 \\ 0 & \text{otherwise} \end{cases} \]

We can think of this thresholding step as a separate part of the model that converts the continuous neural network output into a discrete class label.
However, it’s important to note that:


4. Logits and Cross-Entropy

About the term “Logits”

\[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) \]

About the term “Binary Cross Entropy”

\[ \mathcal{H_a(y)} = - \sum_{i} \Big[ y^{[i]} \log({a}^{[i]}) + (1 - y^{[i]}) \log(1 - {a}^{[i]}) \Big] \] \[ \mathcal{H_a(y)} = - \sum_{i}^n \sum_{k=1}^{K} y^{[i]}_k \log(a^{[i]}_k) \]

5. Logistic Regression Code Example

6. Generalizing to Multiple Classes: Softmax Regression

Approaches to multi-class classification

The softmax function converts a vector of logits (net inputs) into a probability distribution:

\[ \sigma_{\text{softmax}}(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^{K} e^{z_k}} \]

where:

Why “Softmax”?

Softmax is a differentiable (soft) version of the max function:


7. One-Hot Encoding and Multi-category Cross-Entropy

One-Hot Encoding for Multi-class Labels

When working with multinomial (softmax) logistic regression, class labels must be converted from integer format to one-hot encoded format. One-hot encoding represents each class label as a binary vector where only one element is 1 (indicating the true class) and all others are 0.

Example transformation:

Original Labels One-Hot Encoded (4 classes)
Class 0 [1, 0, 0, 0]
Class 1 [0, 1, 0, 0]
Class 3 [0, 0, 0, 1]
Class 2 [0, 0, 1, 0]

Multi-category Cross-Entropy Loss

The cross-entropy loss function for multi-class classification extends the binary case to handle $h$ different class labels. For $n$ training examples and $h$ classes:

\[ \mathcal{L} = \sum_{i=1}^{n} \] \[ \mathcal{L} = \sum_{i=1}^{n} \sum_{j=1}^{h} -y_j^{[i]} \log(a_j^{[i]}) \]

Relationship to Binary Cross-Entropy

The multi-category cross-entropy reduces to binary cross-entropy when $h=2$:

Binary case:

\[ \mathcal{L} = -\sum_{i=1}^{n} \left(y^{[i]} \log(a^{[i]}) + (1-y^{[i]}) \log(1-a^{[i]})\right) \]

Multi-category case with one-hot encoding:

\[ \mathcal{L} = \sum_{i=1}^{n} \sum_{j=1}^{h} -y_j^{[i]} \log(a_j^{[i]}) \]

Both formulations measure how well the predicted probability distribution matches the true distribution.


Practical Example

Consider a batch of 4 training examples with 3 classes:

True labels (one-hot):

\[ \mathbf{Y}_{\text{onehot}} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 1 \end{bmatrix} \]

Predicted probabilities:

\[ \mathbf{A}_{\text{softmax}} = \begin{bmatrix} 0.3792 & 0.3104 & 0.3104 \\ 0.3072 & 0.4147 & 0.2780 \\ 0.4263 & 0.2248 & 0.3490 \\ 0.2668 & 0.2978 & 0.4354 \end{bmatrix} \]

Loss calculations:

Average loss:

\[ \mathcal{L} = \frac{1}{4}(0.9697 + 0.8802 + 1.0527 + 0.8315) \approx 0.9335 \]

8. Softmax Regression Learning Rule

Gradient Computation via Chain Rule

To update weights using gradient descent, we need to compute the gradient for each weight. Using the multivariable chain rule:

\[ \frac{\partial \mathcal{L}}{\partial w_{1,2}} = \frac{\partial \mathcal{L}}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_{1,2}} + \frac{\partial \mathcal{L}}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_{1,2}} \]

Simplified Gradient Formula

After applying the chain rule and simplifying, the gradient reduces to:

\[ \frac{\partial \mathcal{L}}{\partial w_{j,i}} = -(y_j - a_j)x_i \]

Vectorized form:

\[ \nabla_{\mathbf{W}} \mathcal{L} = -(\mathbf{X}^{\top}(\mathbf{Y} - \mathbf{A}))^{\top} \]

where:


Gradient Descent Update Rule

The stochastic gradient descent algorithm for softmax regression:

Initialize:

\[ \mathbf{W} := \mathbf{0} \in \mathbb{R}^{h \times m}, \quad \mathbf{b} := \mathbf{0} \in \mathbb{R}^{h} \]

For each training epoch:

  1. Forward pass: \[ \hat{y}^{[i]} = \sigma_{\text{softmax}}(\mathbf{W}x^{[i]} + \mathbf{b}) \]
  2. Compute gradients: \[ \nabla_{\mathbf{W}} \mathcal{L} = -(y^{[i]} - \hat{y}^{[i]}) \cdot x^{[i]^{\top}} \] \[ \nabla_{\mathbf{b}} \mathcal{L} = -(y^{[i]} - \hat{y}^{[i]}) \]
  3. Update parameters: \[ \mathbf{W} := \mathbf{W} + \eta \times (-\nabla_{\mathbf{W}} \mathcal{L}) \] \[ \mathbf{b} := \mathbf{b} + \eta \times (-\nabla_{\mathbf{b}} \mathcal{L}) \]

where \(\eta\) is the learning rate.

Note:
The gradient has the same elegant form as in logistic regression and ADALINE —
the error term \((y - \hat{y})\) multiplied by the input \(x\).