# On-line Statistics
Slides:[[lec4_stats.pdf|stats]], [[lec5_online_learning.pdf|online stats]]
## 1 Statistics
### 1.1 Mean, median, var
*Empirical distribution*: The empirical distribution $p_n$ put mass $1/n$ at each of $\set{x_1,x_2,\cdots,x_n}$
*Empirical mean*: $\frac{1}{n}\sum\limits_{i=1}^nx_i$. Is an estimation of actual mean if $x_i \sim p$
***Mean*** (expected value) of random variable $X$:
$\mu=\mathbb{E}(x)=\begin{cases}\sum_x x \operatorname{Pr}(X=x) & \text { if } X \text { is discrete } \\ \int x p(x) d x & \text { if } X \text { is continuous with density } p(x)\end{cases}$
***Median***: Any value $m$ such that $Pr(x\geq m)\geq \frac{1}{2}$ and $Pr(X\leq m)\geq \frac{1}{2}$
- Pro: Stable to outliers
- Con: Hard to compute ($O(n)$)
***Variance***: The expected (squared) distance from mean. $\operatorname{var}(X)=\mathbb{E}(X-\mu)^2=\mathbb{E}\left(X^2\right)-\mu^2$.
![[Pasted image 20230413130558.png|300]]
***Standard deviation***: $\operatorname{std}(X)=\sqrt{\operatorname{var}(X)}$
- Note: $\mathbb{E}(z^2)\geq(\mathbb{E}[z])^2$
- $\text{var}(z)=\mathbb{E}[z^2]-(\mathbb{E}[z])^2$
- $(\mathbb{E}[|x-\mu|])^2\leq \mathbb{E}[|x-\mu|^2]=\text{var}(x)$
### 1.2 Independence and correlation
Random vars X,Y are ***independent*** if $\operatorname{Pr}(\boldsymbol{X}=x, Y=y)=\operatorname{Pr}(\boldsymbol{X}=x) \operatorname{Pr}(Y=y)$
#example X,Y are dependent:
$\begin{align*}
\begin{array}{cc|ccc}
& & &Y \\
& & -1 & 0 & 1 \\
\hline & -1 & 0.4 & 0.16 & 0.24 \\
X & 0 & 0.05 & 0.02 & 0.03 \\
& 1 & 0.05 & 0.02 & 0.03
\end{array}
\end{align*}$
*Testing for independence is hard*
- Suppose $(X,Y)$ from bivariate distribution: $\left(x_1, y_1\right), \ldots,\left(x_n, y_n\right) \in \mathbb{R}^2$
- Already hard to test for independence
*Positive correlation*: $\mathbb{E}[XY]>\mathbb{E}[X] \mathbb{E}[Y]$
*Negative correlation*: $\mathbb{E}[X Y]<\mathbb{E}[X] \mathbb{E}[Y]$
*No correlation*: $\mathbb{E}[X Y]=\mathbb{E}[X] \mathbb{E}[Y]$
***Correlation coefficient***:
![[2_geometry_of_dataspace 2023-04-13 13.21.00.excalidraw.svg|600]]
***Covariance***: Maximized when X = Y , in which case it is var(X ). In general, it is at most std(X )std(Y). (Not bounded)
$\begin{aligned}
\operatorname{cov}(X, Y) & =\mathbb{E}[(X-\mathbb{E}[X])(Y-\mathbb{E}[Y])] \\
& =\mathbb{E}[X Y]-\mathbb{E}[X] \mathbb{E}[Y]
\end{aligned}$
***Correlation***: Always in the range $[-1,1]$
$\operatorname{corr}(X, Y)=\frac{\operatorname{cov}(X, Y)}{\operatorname{std}(X) \operatorname{std}(Y)}$
Cauchy-Schwarz:
- $E([u,v])\leq \sqrt{E[u^2]\cdot E[v^2]}$
- $cov(X,Y)=E((X-\mu_x)(Y-\mu_y))\leq \sqrt{E[(x-\mu_x)^2]\cdot E[y-\mu_y]}=std(X)std(Y)$
Independent $\not\equiv$ Uncorrelated
- Independent $\rightarrow$ Uncorrelated
- Uncorrelated $\not \rightarrow$ Independent
#example Uncorrelated by dependent X and Y
| | | -1 | 0 | 1 |
| --- | --- | --- | --- | --- |
| -1 | | | 1/4 | |
| 0 | | 1/4 | | 1/4 |
| 1 | | | 1/4 | |
### 1.3 Linear function of RV
1. If you double a set of numbers: mean * 2, var * 4
2. If you increase a set of numbers by 1
Linearity of expectation
## 2 On-line Stats Computation
Big picture: Endless incoming data, want $O(\text{model size})$ space complexity
![[Pasted image 20230418131408.png|300]]
### 2.1 Online mean
1. Initialize $\mu=0$
2. For $t=1,2,3,\cdots$: Get $x_t$ and update $\mu$
$\begin{align*}
\mu\leftarrow (1- \frac{1}{t})\mu+ \frac{1}{t} x_t
\end{align*}$
*Proof*.
$\begin{align*}
\text{mean}(\set{x_1,\cdots,x_t})= \frac{1}{t}\sum\limits_{i=1}^t x_i=\frac{1}{t}\sum\limits_{i=1}^{t-1}x_i+\frac{1}{t}x_t\\
= \frac{(t-1)}{t}\cdot\text{mean}(\set{x_1,\cdots,x_{t-1}})+\frac{1}{t}x_t
\end{align*}$
### 2.2 Maintain $k$ random samples
**Method 1**: Maintain 1 random sample
1. Initialize random sample $s=x_1$
2. For $t=1,2,\cdots$: Get $x_t$, update $s$.
$\begin{align*}
s\leftarrow x_t \quad\text{with } \text{Pr}= 1/t
\end{align*}$
*Proof*. Pick any time $t$ and any $x_j,j\leq t$, WTS: $\text{Pr}(s=x_j)=1/t$.
$\begin{align*}
Pr(s=x_j)&= Pr(\text{Picked } x_j \text{ at time } j)\cdot Pr(\text{picked nothing after time } j)\\
&= \frac{1}{\cancel{j}}\cdot\frac{\cancel{j}}{\cancel{j+1}}\cdot\frac{\cancel{j+1}}{\cancel{j+2}}\cdot\;\cdots\;\cdot\frac{\cancel{t-1}}{t}=\frac{1}{t}
\end{align*}$
**Method 2** (preferred): Maintain $k$ random samples (w/ replacements)
1. Initialize $k$ random samples $s_1,\cdots,s_k=x_1$
2. For $t=1,2,\cdots$: Get $x_t$
- for $j=1,\cdots,k$ perform *method 1*.
*Proof*. Running $k$ independent parallel copies of method 1.
**Method 3**: Maintain $k$ random samples (w/o replacements)
1. $\left(s_1, s_2, \ldots, s_k\right)=\left(x_1, x_2, \ldots, x_k\right)$
2. For $t=k+1, k+2, \cdots$: Get $x_t$, update:
- With $\text{Pr}= \frac{k}{t}$, pick $j\in\set{1,\cdots,k}$ at random, set $s_j=x_t$
*Proof*. (HW)
### 2.3 Online median
- Maintain a random sample of $k$ elements with replacement (method 2 in [[#2.2 Maintain $k$ random samples|#2.2]])
- At any time $t$: let $m_t$ be the median of these $k$ elements
**Claim**: Pick any $0<\delta, \epsilon<1$. if
$k \geq \frac{1}{2 \epsilon^2} \ln \frac{2}{\delta}$
then for any time $t$, with probability at least $1-\delta$, the value $m_t$ is a $(1 / 2 \pm \epsilon)$-fractile of $\left\{x_1, \ldots, x_t\right\}$.
### 2.4 The heavy hitter problem
**Problem:** With stream of data $x_1, x_2, \ldots, x_m \in \mathcal{X}$ (where $m$ is unknown). Parameter $\epsilon \in(0,1)$
Want to keep track of *heavy hitters* (as well as their frequencies):
$\begin{align*}
\text { Elements with frequency } \geq \epsilon m
\end{align*}$
*Space complexity* requirement, where $n=|\mathcal{X}|$.
$\begin{align*}
\text{\# heavy hitters}\leq 1/\epsilon
\Longleftrightarrow
\text{Total space} &\approx \frac{1}{\epsilon} (\log n+ \log m)
\end{align*}$
***Misra-Gries algo.*** runs w/ space complexity $O(\frac{1}{\epsilon} (\log n+ \log m))$
![[Pasted image 20230418130158.png|400]]
- $V[x]$ is the number of times item $z$ has been seen.
- Think of $V[x]$ as holding the number of occurrences of each item $x$
- Once in a while, $k+1$ of these values are decremented
- By time $t$, the maximum number of such decrement-steps is $t/(k+1)$
**Claim**: Suppose that the number of times $x$ appears in $x_1, \ldots, x_t$ is $\text{freq}_t(x)$. The following is true at all times $t$, for every $x \in \mathcal{X}$ (Take $V[x]=0$ for any $x \notin T$):
$
\operatorname{freq}_t(x)-t /(k+1) \leq V[x] \leq \operatorname{freq}_t(x) .
$
*Space complexity*. $\frac{1}{\epsilon}(\log n + \log n)$
## Reference
[Online-variance computation](https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance)