# Intro: Metrics, Math Basics
Slides: [lec1](https://www.dropbox.com/sh/nszfnsi4jendkch/AADGxuRRC_vSkp7bXCKQRE7Da?dl=0), [lec2](https://www.dropbox.com/sh/6c1jjhqodhutwh1/AACaSynpzebBAoEXklnm8sLaa?dl=0).
## Metrics
### Confusion matrix
![[1_intro 2023-04-05 09.49.00.excalidraw.svg|600]]
### Error Metrics
$\begin{align*}
\begin{array}{c|c|c}
\text{MAE} & \text{Mean absolute error} & \frac{1}{n} \sum_{i=1}^n\left|y_i-\hat{y}_i\right| \\\hline
\text{MSE} & \text{Mean square error} & \frac{1}{n} \sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2\\\hline
\text{RMSE} & \text{Rooted MSE } & \sqrt{\sum_{i=1}^n \frac{\left(\hat{y}_i-y_i\right)^2}{n}}\\
\end{array}
\end{align*}$
### Ranking measurements
**AUC**: Area under precision-recall curve
**MAP**: Mean average precision
**nDCG**: Normalized discounted cumulative gain
**MRR**: Mean reciprocal rank (focuses on 1st correct on in the ranked list)
## Data Statistics
Notation: $n$ is sample size, $N$ is population size
### Mean, median, mode
**Mean**: $\bar{x}=\frac{1}{n}\sum\limits_{i=1}^nx_i$, or $\mu=\frac{1}{N}\sum\limits x$
- **Weighted arithmetic mean**: $\bar{x}=\sum\limits_i^nw_ix_i / \sum\limits_{i=1}^nw_i$
- **Trimmed mean**: Chopping extreme values
**Median**: Middle value or avg. of the two middle values
Median for *grouped data* is estimated by *interpolation*
$\begin{align*}
\text{mid} &= L_1+\Big(\frac{n/2-(\sum\limits\text{freq})_l}{\text{freq}_{\text{median}}}\Big)\text{width}
\end{align*}$
- $L_1$: Lower interval limit
- $\text{width}=L_2-L_1$: Median interval width
- $(\sum\limits\text{freq})_l$: Sum of frequency before median interval
**Mode**: Most frequently occurring data
- *Unimodal* (empirical formula): $mean-mode=3\times (mean-median)$
- *Multi-modal*: Bimodal, Trimodal
![[1_intro 2023-04-07 11.41.28.excalidraw.svg|600]]
### Symmetric vs. Skewed
![[Pasted image 20230407113744.png|600]]
### Normal distribution and plots
```start-multi-column
ID: ID_h7c1
Number of Columns: 2
Largest Column: standard
```
$N(\mu, \sigma^2)$
- Mean: $\mu$. Variance: $\sigma$. Std: $\sigma^2$
- Z-score: $z=\frac{x-\mu}{\sigma}$
--- column-end ---
![[Pasted image 20230407114438.png|200]]
=== end-multi-column
```start-multi-column
ID: ID_msdc
Number of Columns: 3
Largest Column: standard
```
**Population vs. Sample** ([article](https://www.simplilearn.com/tutorials/machine-learning-tutorial/population-vs-sample))
- Population: Entire set, from which data is drawn.
- Sample: Subset of population
- *Population std dev*: $s^2=\frac{1}{n-1}\sum_i(x_i-\bar{x})^2=\frac{1}{n-1}\Big[\sum_ix_i^2-\frac{1}{n}(\sum_ix_i)^2\Big]$
- *Sample std dev*: $\sigma^2=\frac{1}{N}\sum_i(x_i-\mu)^2=\frac{1}{N}\sum_i x_i^2-\mu^2$
--- column-end ---
**Quantiles**, **Outliers**, **Boxplot**
- *Quartiles*: $Q_1,Q_2,Q_3,Q_4 \rightarrow$ 25th, 50th, 75th, 100th percentile
- *Inter-quartile range (IQR)*: $Q_3-Q_1$
- *Boxplot*: Box shows $Q_1, Q_3, IQR, \text{median}$. Whiskers extends to $\min,\max$
- *Five-number summary*: $\min, Q_1, \text{median}, Q_3, \max$
- *Outlier*: (empirical) Value higher/lower than $1.5 \times IQR$
--- column-end ---
![[Pasted image 20230407120702.png|200]]
=== end-multi-column
**Histogram**: More informative than boxplot
![[1_intro 2023-04-07 12.12.07.excalidraw.svg|300]]
The two histograms have the same boxplot, i.e. same 5-number summary. But they have different data distributions.
## Linear Algebra Properties
**Euclidean distance as norm**: $\Vert u-v \Vert_2=\sqrt{\sum_{i=1}^d(u_i-v_i)^2}$
**Squared euclidean distance as quadratic**: $\Vert u-v \Vert_2^2=(u-v)^2=\Vert u \Vert_2^2+\Vert v \Vert_2^2-2u^Tv$
**Cauchy-Schwartz Inequality**: $u^Tv\leq \Vert u \Vert\cdot \Vert v \Vert$