Lecture 08
(Multinomial) Logistic Regression
Logistic Regression
This lecture we covered logistic regression in the context of an artificial nueron. We also explored logits and cross entropy (negative likelihood) and learnt how to generalise classification to multiple classes.
Basic Function: Sigmoid
Use this as the activation function in a neuron. Bounds between 0 and one means we can interpret the sigmoid output as probability. To predict labels we can use a threshold fuction but we do not use this for training.
Which is the bernoulli distribution
The likelihood function of the bernoulli distribution is
But we often minimise the negaive log likelihood for feasibility
Logistic loss learning rule
Loss for one learning example
We apply this loss function with the same stochastic gradient rule from Linear regression to obrain graidents. After gradients are calculated we proceed with the same updating process as the ADALINE.
3. Predicting Labels vs Probabilities
In logistic regression, the model produces a probability value between 0 and 1 through the sigmoid activation.
To convert this probability into a class label, we typically apply a threshold function.
This threshold rule is equivalent to checking the sign of the linear combination before applying the sigmoid:
We can think of this thresholding step as a separate part of the model that converts the continuous neural network output into a discrete class label.
However, it’s important to note that:
- Predicted class labels are not used during training.
Logistic regression (like ADALINE and modern neural networks) optimizes based on probabilities, not the thresholded labels. - The reason is that the threshold function is not differentiable, so it cannot be used in gradient-based optimization.
- The logistic function allows for smooth updates of the weights by using probabilities rather than binary outcomes.
4. Logits and Cross-Entropy
About the term “Logits”
- Logits = log-odds unit
- The logit function is:
- Typically means the net input of the last neuron layer
- In logistic regression, logits are: wᵀ x
About the term “Binary Cross Entropy”
- Negative log-likelihood and binary cross entropy are equivalent
- Binary cross-entropy is defined as:
- (Multi-category) Cross Entropy for K different class labels is defined as:
- This assumes one-hot encoding where the y’s are either 0 or 1
5. Logistic Regression Code Example
- https://github.com/rasbt/stat453-deep-learning-ss21/blob/master/L08/code/logistic-regression.ipynb
- Implements logistic regression with PyTorch nn.Module
6. Generalizing to Multiple Classes: Softmax Regression
Approaches to multi-class classification
- One-vs-all: predict each class label independently then choose the class with the highest confidence score
- All-vs-all: explicitly predict the probability of each competing outcome then choose the class with the highest confidence score
-
Predict probabilities of class membership simultaneously (softmax activations are mutually exclusive class probabilities that sum to 1)
-
The Softmax Function
The softmax function converts a vector of logits (net inputs) into a probability distribution:
where:
- $\mathbf{z} = \mathbf{Wx} + \mathbf{b}$ are the logits (net inputs)
- $K$ is the number of classes
- The outputs sum to 1: $\sum_{j=1}^{K} \sigma_{\text{softmax}}(\mathbf{z})_j = 1$
Why “Softmax”?
Softmax is a differentiable (soft) version of the max function:
- The argmax function (hard max) outputs 1 for the largest value and 0 for all others
- Softmax outputs probabilities, with the largest logit getting the highest probability
- Because it’s differentiable, we can use gradient descent to optimize it
7. One-Hot Encoding and Multi-category Cross-Entropy
One-Hot Encoding for Multi-class Labels
When working with multinomial (softmax) logistic regression, class labels must be converted from integer format to one-hot encoded format. One-hot encoding represents each class label as a binary vector where only one element is 1 (indicating the true class) and all others are 0.
Example transformation:
| Original Labels | One-Hot Encoded (4 classes) |
|---|---|
| Class 0 | [1, 0, 0, 0] |
| Class 1 | [0, 1, 0, 0] |
| Class 3 | [0, 0, 0, 1] |
| Class 2 | [0, 0, 1, 0] |
Multi-category Cross-Entropy Loss
The cross-entropy loss function for multi-class classification extends the binary case to handle $h$ different class labels. For $n$ training examples and $h$ classes:
Relationship to Binary Cross-Entropy
The multi-category cross-entropy reduces to binary cross-entropy when $h=2$:
Binary case:
Multi-category case with one-hot encoding:
Both formulations measure how well the predicted probability distribution matches the true distribution.
Practical Example
Consider a batch of 4 training examples with 3 classes:
True labels (one-hot):
Predicted probabilities:
Loss calculations:
- Example 1:
\(\mathcal{L}^{[1]} = -1 \cdot \log(0.3792) = 0.969692\) - Example 2:
\(\mathcal{L}^{[2]} = -1 \cdot \log(0.4147) = 0.880200\) - Example 3:
\(\mathcal{L}^{[3]} = -1 \cdot \log(0.3490) = 1.052680\) - Example 4:
\(\mathcal{L}^{[4]} = -1 \cdot \log(0.4354) = 0.831490\)
Average loss:
8. Softmax Regression Learning Rule
Gradient Computation via Chain Rule
To update weights using gradient descent, we need to compute the gradient for each weight. Using the multivariable chain rule:
Simplified Gradient Formula
After applying the chain rule and simplifying, the gradient reduces to:
Vectorized form:
where:
-
\(\mathbf{W} \in \mathbb{R}^{h \times m}\) (weight matrix)
-
\(\mathbf{X} \in \mathbb{R}^{n \times m}\) (input features)
-
\(\mathbf{A} \in \mathbb{R}^{n \times h}\) (predicted probabilities)
-
\(\mathbf{Y} \in \mathbb{R}^{n \times h}\) (one-hot labels)
Gradient Descent Update Rule
The stochastic gradient descent algorithm for softmax regression:
Initialize:
For each training epoch:
- Forward pass:
\[ \hat{y}^{[i]} = \sigma_{\text{softmax}}(\mathbf{W}x^{[i]} + \mathbf{b}) \] - Compute gradients:
\[ \nabla_{\mathbf{W}} \mathcal{L} = -(y^{[i]} - \hat{y}^{[i]}) \cdot x^{[i]^{\top}} \] \[ \nabla_{\mathbf{b}} \mathcal{L} = -(y^{[i]} - \hat{y}^{[i]}) \] - Update parameters:
\[ \mathbf{W} := \mathbf{W} + \eta \times (-\nabla_{\mathbf{W}} \mathcal{L}) \] \[ \mathbf{b} := \mathbf{b} + \eta \times (-\nabla_{\mathbf{b}} \mathcal{L}) \]
where
Note:
The gradient has the same elegant form as in logistic regression and ADALINE —
the error term