Lecture 08

(Multinomial) Logistic Regression

Logistic Regression

This lecture we covered logistic regression in the context of an artificial nueron. We also explored logits and cross entropy (negative likelihood) and learnt how to generalise classification to multiple classes.

Basic Function: Sigmoid

\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]

Use this as the activation function in a neuron. Bounds between 0 and one means we can interpret the sigmoid output as probability. To predict labels we can use a threshold fuction but we do not use this for training.

\[ y = \sigma(w^{T}x+b) \]

\[ P(y|x) = a^y*(1-a)^{(i-y)} \]

Which is the bernoulli distribution

The likelihood function of the bernoulli distribution is

\[ \prod_{i=1}^{n} \left( \sigma(z^{(i)}) \right)^{y^{(i)}} \left( 1 - \sigma(z^{(i)}) \right)^{1 - y^{(i)}} \]

But we often minimise the negaive log likelihood for feasibility

\[ \hat{w} = \arg\min \, \ell(w) \]

Logistic loss learning rule

Loss for one learning example

\[ \mathcal{L}(\mathbf{w}) = -y^{(i)} \log\left( \sigma(z^{(i)}) \right) + (1 - y^{(i)}) \log\left( 1 - \sigma(z^{(i)})\right) \]

We apply this loss function with the same stochastic gradient rule from Linear regression to obrain graidents. After gradients are calculated we proceed with the same updating process as the ADALINE.

3. Predicting Labels vs Probabilities

In logistic regression, the model produces a probability value between 0 and 1 through the sigmoid activation.
To convert this probability into a class label, we typically apply a threshold function.

\[ \hat{y} := \begin{cases} 1 & \text{if } \sigma(z) > 0.5 \\ 0 & \text{otherwise} \end{cases} \]

This threshold rule is equivalent to checking the sign of the linear combination before applying the sigmoid:

\[ \hat{y} := \begin{cases} 1 & \text{if } z > 0.0 \\ 0 & \text{otherwise} \end{cases} \]

We can think of this thresholding step as a separate part of the model that converts the continuous neural network output into a discrete class label.
However, it’s important to note that:

Predicted class labels are not used during training.
Logistic regression (like ADALINE and modern neural networks) optimizes based on probabilities, not the thresholded labels.
The reason is that the threshold function is not differentiable, so it cannot be used in gradient-based optimization.
The logistic function allows for smooth updates of the weights by using probabilities rather than binary outcomes.

4. Logits and Cross-Entropy

About the term “Logits”

Logits = log-odds unit
The logit function is:

\[ \text{logit}(p) = \log\left(\frac{p}{1-p}\right) \]

Typically means the net input of the last neuron layer
In logistic regression, logits are: wᵀ x

About the term “Binary Cross Entropy”

Negative log-likelihood and binary cross entropy are equivalent
Binary cross-entropy is defined as:

\[ \mathcal{H_a(y)} = - \sum_{i} \Big[ y^{[i]} \log({a}^{[i]}) + (1 - y^{[i]}) \log(1 - {a}^{[i]}) \Big] \]

(Multi-category) Cross Entropy for K different class labels is defined as:
- This assumes one-hot encoding where the y’s are either 0 or 1

\[ \mathcal{H_a(y)} = - \sum_{i}^n \sum_{k=1}^{K} y^{[i]}_k \log(a^{[i]}_k) \]

5. Logistic Regression Code Example

https://github.com/rasbt/stat453-deep-learning-ss21/blob/master/L08/code/logistic-regression.ipynb
Implements logistic regression with PyTorch nn.Module

6. Generalizing to Multiple Classes: Softmax Regression

Approaches to multi-class classification

One-vs-all: predict each class label independently then choose the class with the highest confidence score
All-vs-all: explicitly predict the probability of each competing outcome then choose the class with the highest confidence score
Predict probabilities of class membership simultaneously (softmax activations are mutually exclusive class probabilities that sum to 1)
The Softmax Function

The softmax function converts a vector of logits (net inputs) into a probability distribution:

\[ \sigma_{\text{softmax}}(\mathbf{z})_j = \frac{e^{z_j}}{\sum_{k=1}^{K} e^{z_k}} \]

where:

$\mathbf{z} = \mathbf{Wx} + \mathbf{b}$ are the logits (net inputs)
$K$ is the number of classes
The outputs sum to 1: $\sum_{j=1}^{K} \sigma_{\text{softmax}}(\mathbf{z})_j = 1$

Why “Softmax”?

Softmax is a differentiable (soft) version of the max function:

The argmax function (hard max) outputs 1 for the largest value and 0 for all others
Softmax outputs probabilities, with the largest logit getting the highest probability
Because it’s differentiable, we can use gradient descent to optimize it

7. One-Hot Encoding and Multi-category Cross-Entropy

One-Hot Encoding for Multi-class Labels

When working with multinomial (softmax) logistic regression, class labels must be converted from integer format to one-hot encoded format. One-hot encoding represents each class label as a binary vector where only one element is 1 (indicating the true class) and all others are 0.

Example transformation:

Original Labels	One-Hot Encoded (4 classes)
Class 0	[1, 0, 0, 0]
Class 1	[0, 1, 0, 0]
Class 3	[0, 0, 0, 1]
Class 2	[0, 0, 1, 0]

Multi-category Cross-Entropy Loss

The cross-entropy loss function for multi-class classification extends the binary case to handle $h$ different class labels. For $n$ training examples and $h$ classes:

\[ \mathcal{L} = \sum_{i=1}^{n} \]

\[ \mathcal{L} = \sum_{i=1}^{n} \sum_{j=1}^{h} -y_j^{[i]} \log(a_j^{[i]}) \]

Relationship to Binary Cross-Entropy

The multi-category cross-entropy reduces to binary cross-entropy when $h=2$:

Binary case:

\[ \mathcal{L} = -\sum_{i=1}^{n} \left(y^{[i]} \log(a^{[i]}) + (1-y^{[i]}) \log(1-a^{[i]})\right) \]

Multi-category case with one-hot encoding:

\[ \mathcal{L} = \sum_{i=1}^{n} \sum_{j=1}^{h} -y_j^{[i]} \log(a_j^{[i]}) \]

Both formulations measure how well the predicted probability distribution matches the true distribution.

Practical Example

Consider a batch of 4 training examples with 3 classes:

True labels (one-hot):

\[ \mathbf{Y}_{\text{onehot}} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 1 \end{bmatrix} \]

Predicted probabilities:

\[ \mathbf{A}_{\text{softmax}} = \begin{bmatrix} 0.3792 & 0.3104 & 0.3104 \\ 0.3072 & 0.4147 & 0.2780 \\ 0.4263 & 0.2248 & 0.3490 \\ 0.2668 & 0.2978 & 0.4354 \end{bmatrix} \]

Loss calculations:

Example 1: $\mathcal{L}^{[1]} = -1 \cdot \log(0.3792) = 0.969692$
Example 2: $\mathcal{L}^{[2]} = -1 \cdot \log(0.4147) = 0.880200$
Example 3: $\mathcal{L}^{[3]} = -1 \cdot \log(0.3490) = 1.052680$
Example 4: $\mathcal{L}^{[4]} = -1 \cdot \log(0.4354) = 0.831490$

Average loss:

\[ \mathcal{L} = \frac{1}{4}(0.9697 + 0.8802 + 1.0527 + 0.8315) \approx 0.9335 \]

8. Softmax Regression Learning Rule

Gradient Computation via Chain Rule

To update weights using gradient descent, we need to compute the gradient for each weight. Using the multivariable chain rule:

\[ \frac{\partial \mathcal{L}}{\partial w_{1,2}} = \frac{\partial \mathcal{L}}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_{1,2}} + \frac{\partial \mathcal{L}}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_{1,2}} \]

Simplified Gradient Formula

After applying the chain rule and simplifying, the gradient reduces to:

\[ \frac{\partial \mathcal{L}}{\partial w_{j,i}} = -(y_j - a_j)x_i \]

Vectorized form:

\[ \nabla_{\mathbf{W}} \mathcal{L} = -(\mathbf{X}^{\top}(\mathbf{Y} - \mathbf{A}))^{\top} \]