Lecture 03

Statistics / linear algebra / calculus review

Today’s Topics:

  1. Tensors in Deep Learning
  2. Tensors and PyTorch
  3. Vectors, Matrices, and Broadcasting
  4. Probability Basics
  5. Estimation Methods
  6. Linear Regression

1. Tensors in Deep Learning

2. Tensors and PyTorch

3. Vectors, Matrices, and Broadcasting

\(\mathbf{X} = \begin{bmatrix} x^{[1]}_1 & x^{[1]}_2 & \cdots & x^{[1]}_m \\ x^{[2]}_1 & x^{[2]}_2 & \cdots & x^{[2]}_m \\ \vdots & \vdots & \ddots & \vdots \\ x^{[n]}_1 & x^{[n]}_2 & \cdots & x^{[n]}_m \end{bmatrix} \in \mathbb{R}^{n \times m}, \quad \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_m \end{bmatrix} \in \mathbb{R}^{m \times 1}, \quad \mathbf{z} = \begin{bmatrix} z^{[1]} \\ z^{[2]} \\ \vdots \\ z^{[n]} \end{bmatrix} \in \mathbb{R}^{n \times 1}\)

4. Probability Basics

5. Estimation Methods

The goal of estimation is to infer unknown parameters $\theta$ from observed data.

\[\hat{\theta}_{\text{MAP}} = \arg\max_{\theta} P(\theta \mid \text{data}) = \arg\max_{\theta} \big[ P(\text{data} \mid \theta)\, P(\theta) \big].\] \[\ell(\theta) = \log L(\theta) = \sum_i \Big[ x_i \log \theta + (1-x_i)\log(1-\theta) \Big]. </d-math>\] \[\hat{\theta}_{\text{MLE}} = \frac{k}{n}.\] \[\hat{\theta}{\text{MAP}} = \arg\max{\theta} P(\theta \mid \text{data}) = \arg\max_{\theta} P(\text{data} \mid \theta),P(\theta).\]

Formally:

\[\hat{\theta}_{\text{reg}} = \arg\max_{\theta} \Big[ \log L(\theta) - \lambda R(\theta) \Big] \quad\Longleftrightarrow\quad \hat{\theta}_{\text{MAP}}\]

6. Linear Regression

Linear regression models the relationship between inputs (features) and outputs (responses).

\[R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}\] \[MSE = \frac{1}{n} \sum_i (y_i - \hat{y}_i)^2\] \[MAE = \frac{1}{n} \sum_i |y_i - \hat{y}_i|\] \[\hat{\beta}_{\text{OLS}} = \arg\min_{\beta} \|y - X\beta\|^2\] \[\hat{\beta}_{\text{OLS}} = (X^TX)^{-1}X^Ty\] \[\hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \|y - X\beta\|^2 + \lambda \|\beta\|_2^2\]

Equivalent MAP interpretation: Gaussian prior $\beta \sim N(0, \sigma^2I)$.

\[\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \|y - X\beta\|^2 + \lambda \|\beta\|_1\]

Equivalent MAP interpretation: Laplace prior $\beta \sim \text{Laplace}(0, b)$.
Encourages sparsity (many coefficients shrink to 0).