Lecture 08

(Multinomial) Logistic Regression

Logistic Regression

This lecture we covered logisitcs regression in the context of an atrificial nueron. We also explored logits and cross entropy (negative likelihood) and learnt how to genralise classification to multiple classes.

Basic Function: Sigmoid

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

Use this as the activation fucntion in a neuron. Bounds between 0 and one means we can interpret the sigmoid output as probability. To predict labels we can use a threshold fuction but we do not use this for training.

\[y = \sigma(w^{T}x+b)\] \[P(y|x) = a^y*(1-a)^{(i-y)}\]

Which is the bernoulli distibution

The likelihood function of the bernoulli function is

\[\prod_{i=1}^{n} \left( \sigma(z^{(i)}) \right)^{y^{(i)}} \left( 1 - \sigma(z^{(i)}) \right)^{1 - y^{(i)}}\]

But we often minimise the negaive log likelihood for feasability

\[\hat{w} = \arg\min \, \ell(w)\]

Logistic loss learning rule

Loss for one learning example

\[\mathcal{L}(\mathbf{w}) = -y^{(i)} \log\left( \sigma(z^{(i)}) \right) + (1 - y^{(i)}) \log\left( 1 - \sigma(z^{(i)})\right)\]

We apply this loss function with the same stochastic gradient rule from Linear regression to obrain graidents. After gradients are calculated we proceed with the same updating process as the ADALINE.


4. Logits and Cross-Entropy

About the term “Logits”

\[\text{logit}(p) = \log\left(\frac{p}{1-p}\right)\]

About the term “Binary Cross Entropy”

\[\mathcal{H_a(y)} = - \sum_{i} \Big[ y^{[i]} \log({a}^{[i]}) + (1 - y^{[i]}) \log(1 - {a}^{[i]}) \Big]\] \[\mathcal{H_a(y)} = - \sum_{i}^n \sum_{k=1}^{K} y^{[i]}_k \log(a^{[i]}_k)\]

5. Logistic Regression Code Example

6. Generalizing to Multiple Classes: Softmax Regression

Approaches to multi-class classification

Screenshot 2025-10-05 at 12 45 46 PM

Screenshot 2025-10-05 at 12 46 55 PM

image

## 7. One-Hot Encoding and Multi-category Cross-Entropy

One-Hot Encoding for Multi-class Labels

When working with multinomial (softmax) logistic regression, class labels must be converted from integer format to one-hot encoded format. One-hot encoding represents each class label as a binary vector where only one element is 1 (indicating the true class) and all others are 0.

Example transformation:

Original Labels One-Hot Encoded (4 classes)
Class 0 [1, 0, 0, 0]
Class 1 [0, 1, 0, 0]
Class 3 [0, 0, 0, 1]
Class 2 [0, 0, 1, 0]

Multi-category Cross-Entropy Loss

The cross-entropy loss function for multi-class classification extends the binary case to handle $h$ different class labels. For $n$ training examples and $h$ classes:

\[\mathcal{L} = \sum_{i=1}^{n}\] \[\mathcal{L} = \sum_{i=1}^{n} \sum_{j=1}^{h} -y_j^{[i]} \log(a_j^{[i]})\]

Relationship to Binary Cross-Entropy

The multi-category cross-entropy reduces to binary cross-entropy when $h=2$:

Binary case: \(\mathcal{L} = -\sum_{i=1}^{n} \left(y^{[i]} \log(a^{[i]}) + (1-y^{[i]}) \log(1-a^{[i]})\right)\)

Multi-category case with one-hot encoding: \(\mathcal{L} = \sum_{i=1}^{n} \sum_{j=1}^{h} -y_j^{[i]} \log(a_j^{[i]})\)

Both formulations measure how well the predicted probability distribution matches the true distribution.

Practical Example

Consider a batch of 4 training examples with 3 classes:

True labels (one-hot): \(\mathbf{Y}_{\text{onehot}} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 1 \end{bmatrix}\)

Predicted probabilities: \(\mathbf{A}_{\text{softmax}} = \begin{bmatrix} 0.3792 & 0.3104 & 0.3104 \\ 0.3072 & 0.4147 & 0.2780 \\ 0.4263 & 0.2248 & 0.3490 \\ 0.2668 & 0.2978 & 0.4354 \end{bmatrix}\)

Loss calculations:

Average loss: $\mathcal{L} = \frac{1}{4}(0.9697 + 0.8802 + 1.0527 + 0.8315) \approx 0.9335$


8. Softmax Regression Learning Rule

Gradient Computation via Chain Rule

To update weights using gradient descent, we need to compute $\frac{\partial \mathcal{L}}{\partial w_i}$ for each weight. Using the multivariable chain rule through the computational graph:

\[\frac{\partial \mathcal{L}}{\partial w_{1,2}} = \frac{\partial \mathcal{L}}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_{1,2}} + \frac{\partial \mathcal{L}}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_{1,2}}\]

Simplified Gradient Formula

After applying the chain rule and simplifying, the gradient reduces to:

\[\frac{\partial \mathcal{L}}{\partial w_{j,i}} = -(y_j - a_j)x_i\]

Vectorized form: \(\nabla_{\mathbf{W}} \mathcal{L} = -(\mathbf{X}^{\top}(\mathbf{Y} - \mathbf{A}))^{\top}\)

where:

Gradient Descent Update Rule

The stochastic gradient descent algorithm for softmax regression:

Initialize: $\mathbf{W} := \mathbf{0} \in \mathbb{R}^{h \times m}$, $\mathbf{b} := \mathbf{0} \in \mathbb{R}^h$

For each training epoch:

where $\eta$ is the learning rate.

Note: The gradient has the same elegant form as in logistic regression and ADALINE: the error $(y - \hat{y})$ multiplied by the input $x$.