Heights and Weights Data Set - Scott A. Strong

If you go to kaggle.com you can find the [Heights and Weights Dataset](https://www.kaggle.com/datasets/burnoutminer/heights-and-weights-dataset), which contains roughly 25,000 data points. The following is a code snippet to load the data into R. ``` # Set the working directory to the R-script directory setwd(dirname(rstudioapi::getSourceEditorContext()$path)) # Load the dataset data <- read.csv("SOCR-HeightWeight3.csv") ``` A fully functioning R-Script appears in the dropbox as does the kaggle data file. After we calculate the mean and standard deviations of the sampled data, we check for normality through the [Kolmogorov-Smirnov test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test). See also [[GPT ELI5 Kolmogorov-Smirnov test]] ![](https://www.youtube.com/watch?v=02I84i8Knas) The null hypothesis is that the sampled data does not significantly differ from the normal distribution, running these tests in R gives, - Height: p-value: 0.9785 - Weight: p-value: 0.8082 suggesting that these data are likely normally distributed. Seeing that this is the case, we can form the probability density functions for each using the sampled mean and variances in our formulation of the normal distribution, $f(x) = \frac{1}{\sigma \sqrt{2\pi}} e^{-\frac{(x - \mu)^2}{2\sigma^2}}$ - For Height: - Mean: $\mu$ = 67.99 inches - Standard Deviation: $\sigma$ = 1.90 inches The PDF for height is: $ f_{\text{height}}(x) = \frac{1}{1.90 \sqrt{2\pi}} e^{-\frac{(x - 67.99)^2}{2 \times 1.90^2}} $ - For Weight: - Mean: $\mu$ = 127.08 pounds - Standard Deviation: $\sigma$ = 11.66 pounds The PDF for weight is: $ f_{\text{weight}}(x) = \frac{1}{11.66 \sqrt{2\pi}} e^{-\frac{(x - 127.08)^2}{2 \times 11.66^2}} $ Now, let's consider sampling the human population and finding André René Roussimoff, better known as André the Giant, who stood a 7' 4 and weighed 520lbs. To calculate the probability of this event, plus or minus 10%, we can integrate these PDFs about his data. Doing so we find: - Probability of sampling an individual in this weight range: $9.706757\times 10^{-249}$ - Probability of sampling an individual in this height range: $1.894462 \times 10^{-9}$ As there are only estimated to be $10^{80}$ to $10^{82}$ atoms in the known universe, the first probability doesn't even make sense, i.e., way more unlikely than selecting a specific atom in the universe with a random sample. The second is somewhat more reasonable and here is what GPT had to say. A probability of approximately 0.000000189% (or 1.89 × 10⁻⁹) is extremely small, indicating an event that is highly unlikely to occur. Here are a few scenarios with similarly low probabilities: 1. **Winning the Powerball Jackpot Twice**: - The probability of winning the Powerball jackpot is about 1 in 292 million (approximately 0.000000342%). Winning it twice in a row would have a probability in the same ballpark as 0.000000189%. 2. **Getting Struck by Lightning Multiple Times**: - The odds of being struck by lightning in a given year are about 1 in 1.2 million. Being struck by lightning multiple times in a year would have a probability similar to 0.000000189%. 3. **Randomly Guessing a 30-Digit Number Correctly**: - If you were to guess a randomly chosen 30-digit number correctly, the probability is about 1 in 10 billion billion, which is similar to the probability of 0.000000189%. 4. **Finding a Specific Atom in a Human Body**: - The probability of randomly picking out a specific atom from a human body is roughly in the range of 10⁻⁹ to 10⁻¹², depending on the size of the atom. This is similar to the probability of 0.000000189%. These examples illustrate events that are theoretically possible but so rare that they are nearly impossible to witness or experience in reality. I talked with GPT about [Andre as an outlier](https://chatgpt.com/share/499b6a80-1497-4852-9d70-c52b26e41247) and reconciled the idea that while gigantism affects about 1 in 1,000,000 individuals, our probability calculation is almost zero. Some key points: - The near-zero probability predicted by the normal distribution reflects the fact that, without considering the presence of a specific medical condition, an individual reaching André’s height or weight due to random genetic variation alone is extremely unlikely. - The actual existence of individuals like André the Giant is better explained by the occurrence of rare medical conditions, which the normal distribution does not account for. - **5% Prevalence**: At around 5% prevalence, the distribution would likely begin to exhibit noticeable skewness, deviating from normality. The mean would shift higher, and the distribution’s tails would be fatter. -