The [[sampling distribution]] of the least squares estimator $\hat \beta$ is defined as the probability distribution of $\hat \beta$ treated as a random variable across random samples of size $n$. With the sampling distribution, we can compute statistical significance and confidence intervals. $\hat \beta$ is a [[linear combination of normal random variables]], and so also has a [[normal distribution]] $\hat \beta \sim N(\vec \beta, \ \sigma^2(X^T X)^{-1})$ where the matrix $\sigma^2 (X^TX)^{-1}$ holds, on the diagonals, the [[variance]] of each parameter coefficient in the model and, on the off diagonal, the [[covariance]] of each pair of parameters. This is called the **variance-covariance matrix**. To prove this, we calculate the expectation and variance of $\hat \beta$. # expectation of $\hat \beta$ We can show that the expectation of $\hat \beta$ is equal to $\beta$. Thus $\hat \beta_0$ is an unbiased [[estimator]] of $\beta$. By definition, $E(\hat \beta) = E \Big [ (X^TX)^{-1} X^T \vec Y \Big ]$ Note that $(X^TX)^{-1}X^T$ is a constant as we are assuming that the $x_is are fixed. It can be pulled out of the expectation. $E(\hat \beta) = (X^TX)^{-1} X^T E [ \vec Y ]$ The expectation of $\vec Y$ is simply $X \beta$. $E(\hat \beta) = (X^TX)^{-1} X^T X \beta$ $(X^TX)^{-1})X^TX$ is the identity matrix of size $p+1$. $\beta$ is of size $(p+1) \times 1$. Thus we find that the expected value of $\hat \beta$ is $\beta$. $E(\hat \beta) = \beta$ # variance of $\hat \beta$ We can also show that $Var(\hat \beta) = \sigma^2 (X^TX)^{-1}$. Again note that the $X^TX$ is a constant as the data are assumed to be fixed. $Var(\hat \beta) = Var\Big [(X^TX)^{-1}X^T \vec Y \Big]$ To pull constants out of a variance, recall we must square them. In linear algebra, this means pulling it out front and multiplying it transposed behind. $Var(\hat \beta) = (X^TX)^{-1}X^T \cdot Var (\vec Y ) \cdot \Big [(X^TX)^{-1}X^T \Big ]^T$ $Var (\vec Y)$ is $\sigma^2$ multiplied by the $n \times n$ identity matrix $I_n$. This gives the variance-covariance matrix with $\sigma^2$ along the diagonal and $0$ everywhere else because we assume that $Y_is are independent. We can also simplify the last term by taking the transpose of the product of the matrices, noting that the transpose of $(X^TX)^{-1}$ is itself again $(X^TX)^{-1}$. $Var(\hat \beta) = (X^TX)^{-1}X^T \cdot \sigma^2 I_n \cdot X(X^TX)^{-1}$ We can pull the constant $\sigma^2$ to the front and drop the $I_n$ identity matrix (after confirming the dimensions will result in a well formed multiplication). $Var(\hat \beta) = \sigma^2 (X^TX)^{-1}X^T X(X^TX)^{-1}$ Finally we note that $X^T X(X^TX)^{-1}$ is an identity matrix that can also be dropped. Thus $Var(\hat \beta) = \sigma^2 (X^TX)^{-1}$ # sampling distribution for simple linear regression The sample distribution of the intercept term and first coefficient for simple linear regression will be $\hat \beta_0 \sim N \Big ( \beta_0, \ \sigma^2 \Big ( \frac{1}{n} + \frac{\bar x^2}{\sum({x_i - \bar x)^2}} \Big ) \Big)$ and $\hat \beta_1 \sim N \Big ( \beta_1, \ \sigma^2 \Big ( \frac{1}{\sum({x_i - \bar x)^2}} \Big ) \Big)$