chi-square goodness of fit test

A chi-square goodness of fit test can be used to test whether a sample comes from a particular, fully-specified distribution. For a [[discrete random variable]], we can compare the observed counts of each unique value with the expected counts provided by the [[probability mass function|pmf]]. A test statistic (see below) can be calculated that has approximately a [[chi-squared distribution]]. Using this test statistic we can conduct [[hypothesis testing]] to determine the probability the data came from the specified distribution. For a [[continuous random variable]], group the values into bins and do the test on this finite number of categories. This test can be sensitive to bin width. Try a few different bin widths and take caution if the results are highly variable. > [!NOTE] > You can use a chi-square goodness of fit test to assess whether the sample is from a [[normal distribution]], however it is better instead look at a histogram, QQ-plot, or K-S Test. Suppose that $X_1, X_2, \dots, X_n$ is a random sample from some distribution. Consider the test statistic $W = \sum_{i=1}^k \frac{(O_i-E_i)^2}{E_i}$ where $O_i$ is the observed count and $E_i$ is the expected count for the $ith$ unique value of each of $k$ categories. For a large sample, under the null hypothesis that the data do come from the specified distribution, $W$ [[converges in distribution]] to a chi-square distribution with $k-1$ [[degrees of freedom]] where $k$ is the number of unique values or categories. $W \overset{d}{\to} \chi^2(k-1)$ The form of the test is to reject the null hypothesis if $W > \chi^2_{\alpha, \ k-1}$ A large sample, in this case, is one that would result in the expected number in each category (under the null hypothesis) to be at least 5. Thus, for a distribution whose pmf includes a category with a small probability, you will need many samples (potentially many more that 30) to apply the chi-square goodness of fit test. The proof of this relies on - The [[Central Limit Theorem]] assuming our observed values can be written as a sum of random variables. - The sum of $k$ independent $\chi^2(1)$ random variables ahs a $\chi^2(k)$ distribution. - However, the proof is complicated by the fact that the last category is not independent of the rest (because the count of the last category is fully dependent on the sum of the counts in the other categories). ## R In R, the chi-squared test can be run with `chisq.test()`. Optionally, provide a vector of probabilities of same length as the observed vector. ```R # Generate some observed data observed <- c(74, 98, 75, 53) # Specify the expected probabilities probabilities <- c(0.25, 0.35, 0.25, 0.15) # Perform test chisq.test(observed, p=probabilities) ``` The output will look like this: ``` Chi-squared test for given probabilities data: observed X-squared = 1.9022, df = 3, p-value = 0.5929 ``` - `X-squared`: value of the chi-squared test statistic - `df`: degrees of freedom used - `p-value`: p-value Let's get the same result manually. ```R # Generate some observed data observed <- c(74, 98, 75, 53) # Specify the expected probabilities probabilities <- c(0.25, 0.35, 0.25, 0.15) # Calculate the expected values per category expected <- probabilities * sum(observed) # Calculate chi-squared test statistic x_squared <- sum((observed - expected)**2 / expected) # Calculate the p-value df <- length(observed) - 1 pval <- pchisq(x_squared, df=df, lower.tail=FALSE) # print results print(paste("X-squared:", round(x_squared, 4))) print(paste("df:", df)) print(paste("p-value:", round(pval, 4))) ``` We should get the same result: ```R "X-squared: 1.9022" "df: 3" "p-value: 0.5929" ```