# Week 1
# Introduction to Statistics, Stanford University
https://www.coursera.org/learn/stanford-statistics/lecture/SLz3n/meet-guenther-walther
# Week 1
- Descriptive Statistics
- Presenting data
- Communicate information
- Support reasoning about data
- Summaries
- Pie Chart, Bar Graph and Histograms
- Ways to summaries data. Graphical summary
- Qualitative (Like colors)
- Using Pie Chart, or Dot Plot (To show proportion of that categories)
- Bar Graph
- Quantitative -
- Histogram - Areas of the blocks are proportional to frequency
- Information to gain from histogram: **Density** (Crowding) - The height of the bar tells how many subjects there are for one unit on the horizontal scale.
- **Percentages** (relative frequencies) : Area = height x weight
-
-
-
- Box-and-Whisker Plot (Box Plot)
- It shows 5 key numbers of the data.
- Lower line - smallest numbers,
- Upper line - larger numbers
- Median -
- 1st quarter - 1/4
- 3rd quartile - 3/4
- Can combine different groups in different box plot.
- Scatter Plot
- e.g year of education compare to salary
- To visualise relationships between two variables
- Providing Context is Key for Statistical Analyses
- "Statistical analyses typically compare the observed data to a reference. Therefore context is essential for graphical integrity"
- "The visual display of quantitative information" by Edward Tufte (p.74)
- Principles of Small multiples
- Combining multiple summaries of data.
- Box plot of monthly weather, plot in a graph for the year.
- Overlaying multiple layer of information / context
- Pitfalls when visualising information
## Mean and Median
- Mean (simply average)
- Median - When half the data is larger and half the data is smaller.
- The Mid point. (But why is this important ah?)
- - Why is knowing median important?
- E.g Household incomes of the country
- When looking at the median - it can shows that the mean is larger then the median, because it's skew to the right. MOST people earn lesser. Skewed Histogram
- One single larger data point can skew the whole picture. So not accurate representation.
# Percentiles
- Why is this important? How will people use that?
- Divide into 4 equal part.
- 25 percentile is 1st quarter
## Percentiles, The 5 number summary and standard deviation
In the box plot. The distance between the 1st and 3rd quartile, tell you how much the the data is spread out. (Interquartile Range)
Measure of spread is call "Standard Deviation"
- So in research paper, they usually tell us the "mean" - the sense of center, and the S.D, is how data are spread out.
---
Quiz https://niyander.blogspot.com/2022/07/Introduction%20to%20Statistics%20Week%201%20Quiz%20Answer.html
Suppose all household incomes in California increase by 5%. How does that change the median household income?
- median household income goes up by 5%
Suppose all household incomes in California increase by 5%. How does that change the standard deviation of the household incomes?
- **the standard deviation of the household incomes goes up by 5%**
Suppose all household incomes in California increase by $5,000. How does that change the standard deviation of the household incomes?
- cannot be determined from the information given
- **the standard deviation of the household incomes doesn't change**
- the standard deviation of the household incomes goes up by $5,000
Question: Why increase % change SD, but not increase of price?
----
# Week 2
- What is Statistical inference? By estimate this percentage.
- Population - the entire group of subject which we want information
- Parameter - e.g "approval percentage among all U.S voters"
- Sample - The part of the population from which we collect information
- Statistic (estimate) -
Simple Random Sampling
Stratified Random Sampling
Biases
- Non response
- Voluntary response
- Selection
"The weird power of the placebo effect, explained" by Brian Resnick (7/7/2017)
## Probability
Its always between 0 and 1
- Complement Rule
- Rule for equally likely outcomes
- Mutually exclusive
- Addition Rule (and and Or)
- Independent
Multiplication rule (and)
## Bayesian Analysis
---
## Week 3 - Normal Approximation and Binomial Distribution
The Empirical Rule for normal bell curve
- 68% of the data fall within 1 standard deviation
- 95% fall within 2 standard deviation
Standardizing Data
- Why do we need to standardize data?
- Data subtract Mean and divide by the standard deviation
- z
- z have no unit
It is a way to say, that data point is how many standard deviation to the mean. e.g a Z score of 2 means it is 2 standard deviations above mean.
Standardization is good when data follows a normal distribution.
Normal approximation
- To answer questions such as "what percentage of fathers have heights between 67 in and 71 inc?"
- We know the mean and the SD, we can find out the area under the curve (which is the percentage)
[[30-01-2023]]
Computing Percentiles for normal data.
e.g what is the 30th percentile of the fathers' height.
1. calculate the Z position first.
**The Binomial Setting and Binomial Coefficient**
The binomial coefficient counts the number of ways one can arrange K successes in n experiments.
https://www.omnicalculator.com/math/binomial-coefficient
The Binomial Formula
- The binomial formula describes the probability of getting a certain number of successes and failures in an experiment.
![[Screenshot 2023-01-30 at 6.09.57 AM.png]]
**Random Variables and Probability Histograms**
- Purely a theoretical construct to show possibilities
**Normal Approximation to the Binomial; Sampling Without Replacement**
![[Screenshot 2023-01-30 at 6.18.50 AM.png]]
As the number of experiments n get larger, the probability increases, and look more like a normal curve.
---
# Week 4 - Regression
- Understanding the basics of regression, inference, and regression diagnostics
- Understanding why regression is such an important statistical technique
- Ability to apply the knowledge about regression on real data
https://www.coursera.org/learn/stanford-statistics/lecture/3mFTj/prediction-is-a-key-task-of-statistics
Prediction is a key task of Statistics
Correlation Coefficient
Correlation measures linear association. R=1 means two variables are perfectly correlated. + or - represent direction of the slope.
Only useful for linear association, linear scatter.
Remember it's not causation.