Lecture 03

Statistics / linear algebra / calculus review

Today’s Topics:

1. Tensors in Deep Learning

2. Tensors and PyTorch

3. Vectors, Matrices, and Broadcasting

\(\mathbf{X} = \begin{bmatrix} x^{[1]}_1 & x^{[1]}_2 & \cdots & x^{[1]}_m \\ x^{[2]}_1 & x^{[2]}_2 & \cdots & x^{[2]}_m \\ \vdots & \vdots & \ddots & \vdots \\ x^{[n]}_1 & x^{[n]}_2 & \cdots & x^{[n]}_m \end{bmatrix} \in \mathbb{R}^{n \times m}, \quad \mathbf{w} = \begin{bmatrix} w_1 \\ w_2 \\ \vdots \\ w_m \end{bmatrix} \in \mathbb{R}^{m \times 1}, \quad \mathbf{z} = \begin{bmatrix} z^{[1]} \\ z^{[2]} \\ \vdots \\ z^{[n]} \end{bmatrix} \in \mathbb{R}^{n \times 1}\)

4. Probability Basics

5. Estimation Methods

The goal of estimation is to infer unknown parameters $\theta$ from observed data.

\[\hat{\theta}_{\text{MAP}} = \arg\max_{\theta} P(\theta \mid \text{data}) = \arg\max_{\theta} \big[ P(\text{data} \mid \theta)\, P(\theta) \big].\] \[\ell(\theta) = \log L(\theta) = \sum_i \Big[ x_i \log \theta + (1-x_i)\log(1-\theta) \Big]. </d-math>\] \[\hat{\theta}_{\text{MLE}} = \frac{k}{n}.\] \[\hat{\theta}{\text{MAP}} = \arg\max{\theta} P(\theta \mid \text{data}) = \arg\max_{\theta} P(\text{data} \mid \theta),P(\theta).\]

Formally:

\[\hat{\theta}_{\text{reg}} = \arg\max_{\theta} \Big[ \log L(\theta) - \lambda R(\theta) \Big] \quad\Longleftrightarrow\quad \hat{\theta}_{\text{MAP}}\]

6. Linear Regression

Linear regression models the relationship between inputs (features) and outputs (responses).

\[R^2 = 1 - \frac{SS_{\text{res}}}{SS_{\text{tot}}}\] \[MSE = \frac{1}{n} \sum_i (y_i - \hat{y}_i)^2\] \[MAE = \frac{1}{n} \sum_i |y_i - \hat{y}_i|\] \[\hat{\beta}_{\text{OLS}} = \arg\min_{\beta} \|y - X\beta\|^2\] \[\hat{\beta}_{\text{OLS}} = (X^TX)^{-1}X^Ty\] \[\hat{\beta}_{\text{ridge}} = \arg\min_{\beta} \|y - X\beta\|^2 + \lambda \|\beta\|_2^2\]

Equivalent MAP interpretation: Gaussian prior $\beta \sim N(0, \frac{\sigma^2}{\lambda})$.

\[\hat{\beta}_{\text{lasso}} = \arg\min_{\beta} \|y - X\beta\|^2 + \lambda \|\beta\|_1\]

Equivalent MAP interpretation: Laplace prior $\beta \sim \text{Laplace}(0, \frac{\sigma}{\lambda})$.
Encourages sparsity (many coefficients shrink to 0).