# On-line Statistics Slides:[[lec4_stats.pdf|stats]], [[lec5_online_learning.pdf|online stats]] ## 1 Statistics ### 1.1 Mean, median, var *Empirical distribution*: The empirical distribution $p_n$ put mass $1/n$ at each of $\set{x_1,x_2,\cdots,x_n}$ *Empirical mean*: $\frac{1}{n}\sum\limits_{i=1}^nx_i$. Is an estimation of actual mean if $x_i \sim p$ ***Mean*** (expected value) of random variable $X$: $\mu=\mathbb{E}(x)=\begin{cases}\sum_x x \operatorname{Pr}(X=x) & \text { if } X \text { is discrete } \\ \int x p(x) d x & \text { if } X \text { is continuous with density } p(x)\end{cases}$ ***Median***: Any value $m$ such that $Pr(x\geq m)\geq \frac{1}{2}$ and $Pr(X\leq m)\geq \frac{1}{2}$ - Pro: Stable to outliers - Con: Hard to compute ($O(n)$) ***Variance***: The expected (squared) distance from mean. $\operatorname{var}(X)=\mathbb{E}(X-\mu)^2=\mathbb{E}\left(X^2\right)-\mu^2$. ![[Pasted image 20230413130558.png|300]] ***Standard deviation***: $\operatorname{std}(X)=\sqrt{\operatorname{var}(X)}$ - Note: $\mathbb{E}(z^2)\geq(\mathbb{E}[z])^2$ - $\text{var}(z)=\mathbb{E}[z^2]-(\mathbb{E}[z])^2$ - $(\mathbb{E}[|x-\mu|])^2\leq \mathbb{E}[|x-\mu|^2]=\text{var}(x)$ ### 1.2 Independence and correlation Random vars X,Y are ***independent*** if $\operatorname{Pr}(\boldsymbol{X}=x, Y=y)=\operatorname{Pr}(\boldsymbol{X}=x) \operatorname{Pr}(Y=y)$ #example X,Y are dependent: $\begin{align*} \begin{array}{cc|ccc} & & &Y \\ & & -1 & 0 & 1 \\ \hline & -1 & 0.4 & 0.16 & 0.24 \\ X & 0 & 0.05 & 0.02 & 0.03 \\ & 1 & 0.05 & 0.02 & 0.03 \end{array} \end{align*}$ *Testing for independence is hard* - Suppose $(X,Y)$ from bivariate distribution: $\left(x_1, y_1\right), \ldots,\left(x_n, y_n\right) \in \mathbb{R}^2$ - Already hard to test for independence *Positive correlation*: $\mathbb{E}[XY]>\mathbb{E}[X] \mathbb{E}[Y]$ *Negative correlation*: $\mathbb{E}[X Y]<\mathbb{E}[X] \mathbb{E}[Y]$ *No correlation*: $\mathbb{E}[X Y]=\mathbb{E}[X] \mathbb{E}[Y]$ ***Correlation coefficient***: ![[2_geometry_of_dataspace 2023-04-13 13.21.00.excalidraw.svg|600]] ***Covariance***: Maximized when X = Y , in which case it is var(X ). In general, it is at most std(X )std(Y). (Not bounded) $\begin{aligned} \operatorname{cov}(X, Y) & =\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])] \\ & =\mathbb{E}[X Y]-\mathbb{E}[X] \mathbb{E}[Y] \end{aligned}$ ***Correlation***: Always in the range $[-1,1]$ $\operatorname{corr}(X, Y)=\frac{\operatorname{cov}(X, Y)}{\operatorname{std}(X) \operatorname{std}(Y)}$ Cauchy-Schwarz: - $E([u,v])\leq \sqrt{E[u^2]\cdot E[v^2]}$ - $cov(X,Y)=E((X-\mu_x)(Y-\mu_y))\leq \sqrt{E[(x-\mu_x)^2]\cdot E[y-\mu_y]}=std(X)std(Y)$ Independent $\not\equiv$ Uncorrelated - Independent $\rightarrow$ Uncorrelated - Uncorrelated $\not \rightarrow$ Independent #example Uncorrelated by dependent X and Y | | | -1 | 0 | 1 | | --- | --- | --- | --- | --- | | -1 | | | 1/4 | | | 0 | | 1/4 | | 1/4 | | 1 | | | 1/4 | | ### 1.3 Linear function of RV 1. If you double a set of numbers: mean * 2, var * 4 2. If you increase a set of numbers by 1 Linearity of expectation ## 2 On-line Stats Computation Big picture: Endless incoming data, want $O(\text{model size})$ space complexity ![[Pasted image 20230418131408.png|300]] ### 2.1 Online mean 1. Initialize $\mu=0$ 2. For $t=1,2,3,\cdots$: Get $x_t$ and update $\mu$ $\begin{align*} \mu\leftarrow (1- \frac{1}{t})\mu+ \frac{1}{t} x_t \end{align*}$ *Proof*. $\begin{align*} \text{mean}(\set{x_1,\cdots,x_t})= \frac{1}{t}\sum\limits_{i=1}^t x_i=\frac{1}{t}\sum\limits_{i=1}^{t-1}x_i+\frac{1}{t}x_t\\ = \frac{(t-1)}{t}\cdot\text{mean}(\set{x_1,\cdots,x_{t-1}})+\frac{1}{t}x_t \end{align*}$ ### 2.2 Maintain $k$ random samples **Method 1**: Maintain 1 random sample 1. Initialize random sample $s=x_1$ 2. For $t=1,2,\cdots$: Get $x_t$, update $s$. $\begin{align*} s\leftarrow x_t \quad\text{with } \text{Pr}= 1/t \end{align*}$ *Proof*. Pick any time $t$ and any $x_j,j\leq t$, WTS: $\text{Pr}(s=x_j)=1/t$. $\begin{align*} Pr(s=x_j)&= Pr(\text{Picked } x_j \text{ at time } j)\cdot Pr(\text{picked nothing after time } j)\\ &= \frac{1}{\cancel{j}}\cdot\frac{\cancel{j}}{\cancel{j+1}}\cdot\frac{\cancel{j+1}}{\cancel{j+2}}\cdot\;\cdots\;\cdot\frac{\cancel{t-1}}{t}=\frac{1}{t} \end{align*}$ **Method 2** (preferred): Maintain $k$ random samples (w/ replacements) 1. Initialize $k$ random samples $s_1,\cdots,s_k=x_1$ 2. For $t=1,2,\cdots$: Get $x_t$ - for $j=1,\cdots,k$ perform *method 1*. *Proof*. Running $k$ independent parallel copies of method 1. **Method 3**: Maintain $k$ random samples (w/o replacements) 1. $\left(s_1, s_2, \ldots, s_k\right)=\left(x_1, x_2, \ldots, x_k\right)$ 2. For $t=k+1, k+2, \cdots$: Get $x_t$, update: - With $\text{Pr}= \frac{k}{t}$, pick $j\in\set{1,\cdots,k}$ at random, set $s_j=x_t$ *Proof*. (HW) ### 2.3 Online median - Maintain a random sample of $k$ elements with replacement (method 2 in [[#2.2 Maintain $k$ random samples|#2.2]]) - At any time $t$: let $m_t$ be the median of these $k$ elements **Claim**: Pick any $0<\delta, \epsilon<1$. if $k \geq \frac{1}{2 \epsilon^2} \ln \frac{2}{\delta}$ then for any time $t$, with probability at least $1-\delta$, the value $m_t$ is a $(1 / 2 \pm \epsilon)$-fractile of $\left\{x_1, \ldots, x_t\right\}$. ### 2.4 The heavy hitter problem **Problem:** With stream of data $x_1, x_2, \ldots, x_m \in \mathcal{X}$ (where $m$ is unknown). Parameter $\epsilon \in(0,1)$ Want to keep track of *heavy hitters* (as well as their frequencies): $\begin{align*} \text { Elements with frequency } \geq \epsilon m \end{align*}$ *Space complexity* requirement, where $n=|\mathcal{X}|$. $\begin{align*} \text{\# heavy hitters}\leq 1/\epsilon \Longleftrightarrow \text{Total space} &\approx \frac{1}{\epsilon} (\log n+ \log m) \end{align*}$ ***Misra-Gries algo.*** runs w/ space complexity $O(\frac{1}{\epsilon} (\log n+ \log m))$ ![[Pasted image 20230418130158.png|400]] - $V[x]$ is the number of times item $z$ has been seen. - Think of $V[x]$ as holding the number of occurrences of each item $x$ - Once in a while, $k+1$ of these values are decremented - By time $t$, the maximum number of such decrement-steps is $t/(k+1)$ **Claim**: Suppose that the number of times $x$ appears in $x_1, \ldots, x_t$ is $\text{freq}_t(x)$. The following is true at all times $t$, for every $x \in \mathcal{X}$ (Take $V[x]=0$ for any $x \notin T$): $ \operatorname{freq}_t(x)-t /(k+1) \leq V[x] \leq \operatorname{freq}_t(x) . $ *Space complexity*. $\frac{1}{\epsilon}(\log n + \log n)$ ## Reference [Online-variance computation](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance)