Predicting Height

Criminal investigators would sometimes like to predict the height of a suspect based on limited evidence, such as a footprint. Below is data from a sample of 20 statistics students, comparing footlength (in centimeters) against height (in feet).

results = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/FootHeight.csv")
myLM = lm(height~foot,data=results)
plot(results$foot,results$height,xlab='Foot length (cm)',ylab='Height (in)',pch=20)
abline(myLM)

Checking Inference Conditions

Before you can use inference tools for regression, you should check the following conditions:

Sample is a SRS of the whole population.
Data has a linear relationship (check the scatterplot).
Variance of the residuals is the same for all x-values (check the residual plot).
Residuals are normally distributed (check a qqplot of the residuals).

Confidence Intervals

This linear model has the following relevant statistics. The average y-value is 67.75 and the average x-value is 28.5. The standard deviation of the y-values is 5.0039458 and the standard deviation of the x-values is 3.4450575. The sample size was 20. The residual standard error \(s\) is 3.613. A formula for the regression line is \[\hat{y} = 38.3 + 1.033 x\]

Use www.desmos.com to plot the 95% confidence interval endpoints as functions of \(x^*\) using the formula: \[\hat{y} \pm t^* s \sqrt{\frac{1}{N}+ \frac{(x^*-\bar{x})^2}{(N-1)s_x^2}}\] Note that the \(t^*\) for a 95% confidence interval with 18 degrees of freedom is 2.100922. You should set your viewing window so that you have a range of x and y-values similar to the ones above. Describe what you see. Why do the confidence intervals get wider as \(x^*\) gets further from \(\bar{x}\)?

Prediction Intervals

Predicting an individual’s height based on their foot length is harder than predicting the average height based on the same information. This is because individuals are even more variable than groups. Essentially you have to add the variance in the residuals to the the variance for average y-values. This leads to this formula:

\[\hat{y} \pm t^* s \sqrt{1+\frac{1}{N}+ \frac{(x^*-\bar{x})^2}{(N-1)s_x^2}}\]

When I measured my foot with my shoes on, they were 30 cm long. So a 90% prediction interval for my height in inches would be (note the weird R syntax for the second argument):

predict(myLM,data.frame(foot=30),interval='prediction')

##        fit      lwr      upr
## 1 69.29989 61.48439 77.11538

This means that I can be 90% sure that the height of one individual (like me!) with a 30cm footprint is between 61.48439 and 77.11538 inches tall. That is a pretty lousy prediction! But it is all that a sample of size 20 can do.