Lecture 08
(Multinomial) Logistic Regression
Logistic Regression
This lecture we covered logisitcs regression in the context of an atrificial nueron. We also explored logits and cross entropy (negative likelihood) and learnt how to genralise classification to multiple classes.
Basic Function: Sigmoid
\[\sigma(x) = \frac{1}{1 + e^{-x}}\]Use this as the activation fucntion in a neuron. Bounds between 0 and one means we can interpret the sigmoid output as probability. To predict labels we can use a threshold fuction but we do not use this for training.
\[y = \sigma(w^{T}x+b)\] \[P(y|x) = a^y*(1-a)^{(i-y)}\]Which is the bernoulli distibution
The likelihood function of the bernoulli function is
\[\prod_{i=1}^{n} \left( \sigma(z^{(i)}) \right)^{y^{(i)}} \left( 1 - \sigma(z^{(i)}) \right)^{1 - y^{(i)}}\]But we often minimise the negaive log likelihood for feasability
\[\hat{w} = \arg\min \, \ell(w)\]Logistic loss learning rule
Loss for one learning example
\[\mathcal{L}(\mathbf{w}) = -y^{(i)} \log\left( \sigma(z^{(i)}) \right) + (1 - y^{(i)}) \log\left( 1 - \sigma(z^{(i)})\right)\]We apply this loss function with the same stochastic gradient rule from Linear regression to obrain graidents. After gradients are calculated we proceed with the same updating process as the ADALINE.
4. Logits and Cross-Entropy
About the term “Logits”
- Logits = log-odds unit
- The logit function is:
- Typically means the net input of the last neuron layer
- In logistic regression, logits are: wᵀ x
About the term “Binary Cross Entropy”
- Negative log-likelihood and binary cross entropy are equivalent
- Binary cross-entropy is defined as:
- (Multi-category) Cross Entropy for K different class labels is defined as:
- This assumes one-hot encoding where the y’s are either 0 or 1
5. Logistic Regression Code Example
- https://github.com/rasbt/stat453-deep-learning-ss21/blob/master/L08/code/logistic-regression.ipynb
- Implements logistic regression with PyTorch nn.Module
6. Generalizing to Multiple Classes: Softmax Regression
Approaches to multi-class classification
-
One-vs-all: predict each class label independently then choose the class with the highest confidence score
-
All-vs-all: explicitly predict the probability of each competing outcome then choose the class with the highest confidence score
- Predict probabilities of class membership simultaneously
- Activations are class-membership probabilities (NOT mutually exclusive classes)
## 7. One-Hot Encoding and Multi-category Cross-Entropy
One-Hot Encoding for Multi-class Labels
When working with multinomial (softmax) logistic regression, class labels must be converted from integer format to one-hot encoded format. One-hot encoding represents each class label as a binary vector where only one element is 1 (indicating the true class) and all others are 0.
Example transformation:
Original Labels | One-Hot Encoded (4 classes) |
---|---|
Class 0 | [1, 0, 0, 0] |
Class 1 | [0, 1, 0, 0] |
Class 3 | [0, 0, 0, 1] |
Class 2 | [0, 0, 1, 0] |
Multi-category Cross-Entropy Loss
The cross-entropy loss function for multi-class classification extends the binary case to handle $h$ different class labels. For $n$ training examples and $h$ classes:
\[\mathcal{L} = \sum_{i=1}^{n}\] \[\mathcal{L} = \sum_{i=1}^{n} \sum_{j=1}^{h} -y_j^{[i]} \log(a_j^{[i]})\]Relationship to Binary Cross-Entropy
The multi-category cross-entropy reduces to binary cross-entropy when $h=2$:
Binary case: \(\mathcal{L} = -\sum_{i=1}^{n} \left(y^{[i]} \log(a^{[i]}) + (1-y^{[i]}) \log(1-a^{[i]})\right)\)
Multi-category case with one-hot encoding: \(\mathcal{L} = \sum_{i=1}^{n} \sum_{j=1}^{h} -y_j^{[i]} \log(a_j^{[i]})\)
Both formulations measure how well the predicted probability distribution matches the true distribution.
Practical Example
Consider a batch of 4 training examples with 3 classes:
True labels (one-hot): \(\mathbf{Y}_{\text{onehot}} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 1 \end{bmatrix}\)
Predicted probabilities: \(\mathbf{A}_{\text{softmax}} = \begin{bmatrix} 0.3792 & 0.3104 & 0.3104 \\ 0.3072 & 0.4147 & 0.2780 \\ 0.4263 & 0.2248 & 0.3490 \\ 0.2668 & 0.2978 & 0.4354 \end{bmatrix}\)
Loss calculations:
- Example 1: $\mathcal{L}^{[1]} = -1 \cdot \log(0.3792) = 0.969692$
- Example 2: $\mathcal{L}^{[2]} = -1 \cdot \log(0.4147) = 0.880200$
- Example 3: $\mathcal{L}^{[3]} = -1 \cdot \log(0.3490) = 1.052680$
- Example 4: $\mathcal{L}^{[4]} = -1 \cdot \log(0.4354) = 0.831490$
Average loss: $\mathcal{L} = \frac{1}{4}(0.9697 + 0.8802 + 1.0527 + 0.8315) \approx 0.9335$
8. Softmax Regression Learning Rule
Gradient Computation via Chain Rule
To update weights using gradient descent, we need to compute $\frac{\partial \mathcal{L}}{\partial w_i}$ for each weight. Using the multivariable chain rule through the computational graph:
\[\frac{\partial \mathcal{L}}{\partial w_{1,2}} = \frac{\partial \mathcal{L}}{\partial a_1} \cdot \frac{\partial a_1}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_{1,2}} + \frac{\partial \mathcal{L}}{\partial a_2} \cdot \frac{\partial a_2}{\partial z_1} \cdot \frac{\partial z_1}{\partial w_{1,2}}\]Simplified Gradient Formula
After applying the chain rule and simplifying, the gradient reduces to:
\[\frac{\partial \mathcal{L}}{\partial w_{j,i}} = -(y_j - a_j)x_i\]Vectorized form: \(\nabla_{\mathbf{W}} \mathcal{L} = -(\mathbf{X}^{\top}(\mathbf{Y} - \mathbf{A}))^{\top}\)
where:
- $\mathbf{W} \in \mathbb{R}^{h \times m}$ (weight matrix)
- $\mathbf{X} \in \mathbb{R}^{n \times m}$ (input features)
- $\mathbf{A} \in \mathbb{R}^{n \times h}$ (predicted probabilities)
- $\mathbf{Y} \in \mathbb{R}^{n \times h}$ (one-hot labels)
Gradient Descent Update Rule
The stochastic gradient descent algorithm for softmax regression:
Initialize: $\mathbf{W} := \mathbf{0} \in \mathbb{R}^{h \times m}$, $\mathbf{b} := \mathbf{0} \in \mathbb{R}^h$
For each training epoch:
-
For each training example $(x^{[i]}, y^{[i]})$:
a Forward pass: Compute predictions
\[\hat{y}^{[i]} = \sigma_{\text{softmax}}(\mathbf{W}x^{[i]} + \mathbf{b})\]b Compute gradients: \(\nabla_{\mathbf{W}} \mathcal{L} = -(y^{[i]} - \hat{y}^{[i]}) \cdot x^{[i]^{\top}}\) \(\nabla_{\mathbf{b}} \mathcal{L} = -(y^{[i]} - \hat{y}^{[i]})\)
c Update parameters: \(\mathbf{W} := \mathbf{W} + \eta \times (-\nabla_{\mathbf{W}} \mathcal{L})\) \(\mathbf{b} := \mathbf{b} + \eta \times (-\nabla_{\mathbf{b}} \mathcal{L})\)
where $\eta$ is the learning rate.
Note: The gradient has the same elegant form as in logistic regression and ADALINE: the error $(y - \hat{y})$ multiplied by the input $x$.