# Intro: Metrics, Math Basics Slides: [lec1](https://www.dropbox.com/sh/nszfnsi4jendkch/AADGxuRRC_vSkp7bXCKQRE7Da?dl=0), [lec2](https://www.dropbox.com/sh/6c1jjhqodhutwh1/AACaSynpzebBAoEXklnm8sLaa?dl=0). ## Metrics ### Confusion matrix ![[1_intro 2023-04-05 09.49.00.excalidraw.svg|600]] ### Error Metrics $\begin{align*} \begin{array}{c|c|c} \text{MAE} & \text{Mean absolute error} & \frac{1}{n} \sum_{i=1}^n\left|y_i-\hat{y}_i\right| \\\hline \text{MSE} & \text{Mean square error} & \frac{1}{n} \sum_{i=1}^n\left(y_i-\hat{y}_i\right)^2\\\hline \text{RMSE} & \text{Rooted MSE } & \sqrt{\sum_{i=1}^n \frac{\left(\hat{y}_i-y_i\right)^2}{n}}\\ \end{array} \end{align*}$ ### Ranking measurements **AUC**: Area under precision-recall curve **MAP**: Mean average precision **nDCG**: Normalized discounted cumulative gain **MRR**: Mean reciprocal rank (focuses on 1st correct on in the ranked list) ## Data Statistics Notation: $n$ is sample size, $N$ is population size ### Mean, median, mode **Mean**: $\bar{x}=\frac{1}{n}\sum\limits_{i=1}^nx_i$, or $\mu=\frac{1}{N}\sum\limits x$ - **Weighted arithmetic mean**: $\bar{x}=\sum\limits_i^nw_ix_i / \sum\limits_{i=1}^nw_i$ - **Trimmed mean**: Chopping extreme values **Median**: Middle value or avg. of the two middle values Median for *grouped data* is estimated by *interpolation* $\begin{align*} \text{mid} &= L_1+\Big(\frac{n/2-(\sum\limits\text{freq})_l}{\text{freq}_{\text{median}}}\Big)\text{width} \end{align*}$ - $L_1$: Lower interval limit - $\text{width}=L_2-L_1$: Median interval width - $(\sum\limits\text{freq})_l$: Sum of frequency before median interval **Mode**: Most frequently occurring data - *Unimodal* (empirical formula): $mean-mode=3\times (mean-median)$ - *Multi-modal*: Bimodal, Trimodal ![[1_intro 2023-04-07 11.41.28.excalidraw.svg|600]] ### Symmetric vs. Skewed ![[Pasted image 20230407113744.png|600]] ### Normal distribution and plots ```start-multi-column ID: ID_h7c1 Number of Columns: 2 Largest Column: standard ``` $N(\mu, \sigma^2)$ - Mean: $\mu$. Variance: $\sigma$. Std: $\sigma^2$ - Z-score: $z=\frac{x-\mu}{\sigma}$ --- column-end --- ![[Pasted image 20230407114438.png|200]] === end-multi-column ```start-multi-column ID: ID_msdc Number of Columns: 3 Largest Column: standard ``` **Population vs. Sample** ([article](https://www.simplilearn.com/tutorials/machine-learning-tutorial/population-vs-sample)) - Population: Entire set, from which data is drawn. - Sample: Subset of population - *Population std dev*: $s^2=\frac{1}{n-1}\sum_i(x_i-\bar{x})^2=\frac{1}{n-1}\Big[\sum_ix_i^2-\frac{1}{n}(\sum_ix_i)^2\Big]$ - *Sample std dev*: $\sigma^2=\frac{1}{N}\sum_i(x_i-\mu)^2=\frac{1}{N}\sum_i x_i^2-\mu^2$ --- column-end --- **Quantiles**, **Outliers**, **Boxplot** - *Quartiles*: $Q_1,Q_2,Q_3,Q_4 \rightarrow$ 25th, 50th, 75th, 100th percentile - *Inter-quartile range (IQR)*: $Q_3-Q_1$ - *Boxplot*: Box shows $Q_1, Q_3, IQR, \text{median}$. Whiskers extends to $\min,\max$ - *Five-number summary*: $\min, Q_1, \text{median}, Q_3, \max$ - *Outlier*: (empirical) Value higher/lower than $1.5 \times IQR$ --- column-end --- ![[Pasted image 20230407120702.png|200]] === end-multi-column **Histogram**: More informative than boxplot ![[1_intro 2023-04-07 12.12.07.excalidraw.svg|300]] The two histograms have the same boxplot, i.e. same 5-number summary. But they have different data distributions. ## Linear Algebra Properties **Euclidean distance as norm**: $\Vert u-v \Vert_2=\sqrt{\sum_{i=1}^d(u_i-v_i)^2}$ **Squared euclidean distance as quadratic**: $\Vert u-v \Vert_2^2=(u-v)^2=\Vert u \Vert_2^2+\Vert v \Vert_2^2-2u^Tv$ **Cauchy-Schwartz Inequality**: $u^Tv\leq \Vert u \Vert\cdot \Vert v \Vert$