up:: [[Correlation testing]]
Tags:: #🌱
# Linear regression
Just like a mean identifies the measure of central tendency for a data set the line of best fit identifies the measure of central tendency for a correlation.
The mean for a dataset is the point at which the sum of squared deviations from the mean is the smallest. Similarly, the line of best fit is the line where the sum of squared deviations from that line is the smallest.
Equation for sum of squared deviations for the line of best fit:
![[Pasted image 20221114150428.png|500]]
###### What should a correct regression equation create when plotted as a histogram?
It should create a normal distribution. If it doesn't you either don't have a linear relationship or you don't have the correct line of best fit.
###### What are the measures of y in a linear regression?
Y equals value of y for a point.
Y hat equals value of that y point on the line of best fit.
Deviation equals y minus y hat.
### Equation for line of best fit
![[Pasted image 20221114152124.png]]
The point on the line of best fit always goes through ![[Pasted image 20221114151723.png]] or the mean value of x and the mean value of y.
# Steps to finding the Line of Best Fit Equation
1. Fill out table with all values of X, Y, sum of X^2, sum of Y^2, and sum of X times Y.
2. Find the sum of squares for both x and y.
3. Find SP from [[Pearson's correlation]] using the standard deviations.
4. Find Pearson's R from [[Pearson's correlation]].
5. Find the slope of the line (b) using one of the two equations put up above.
6. Find the intercept a by plugging in the means of both x and y into the equation with the slop to find a
7. Plug all values in to finish the equation!
# Statistical Analysis of Regression Lines
#### How do we know there is a true linear relationship?
After we have found the line of best fit for the data how do we know that there is truly a linear relationship inside of the data? We calculate the **standard error of the estimate**. It measures the variability in y that is not explained by the regression line.
![[Pasted image 20221114153931.png]]
The standard error of the estimate equation is very similar to the standard deviation equation except that the df value is n - 2 instead of n - 1.
![[Pasted image 20221114154558.png|700]]
The SSregression measures the variability in y that is predicted by the line of best fit where as the SSresidual measures the variability in y that is not explained by the line of best fit or sampling error.
### Hypothesis testing for line of best fit
H0: The linear regression equation doesn't account for a significant proportion of variance in the variables.
H1: The linear regression equation accounts for a significant proportion of variance in the variables.
![[Pasted image 20221114154840.png]]
# Steps to analysis of regression lines
We use this table to organize our findings of the values.
![[Pasted image 20221114155726.png]]
Find the variance for both x and y. Remember that the df value is n - 2 for a correlation analysis rather than n - 1.
1. Find the sum of squares for both x and y.
2. Find SP from [[Pearson's correlation]] using the standard deviations.
3. Find Pearson's R from [[Pearson's correlation]].
4. Find the slope of the line (b) using one of the two equations put up above.
5. Find the intercept a by plugging in the means of both x and y into the equation with the slop to find a
6. Plug all values in to finish the equation for the line of best fit.
7. Find the SSresidual and SSregression values using r^2
8. Find the varianceregression and varianceresidual and then perform a f-test using the [[F unit table]].
Related:
___
# Resources