Criminal investigators would sometimes like to predict the height of a suspect based on limited evidence, such as a footprint. Below is data from a sample of 20 statistics students, comparing footlength (in centimeters) against height (in feet).

results = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/FootHeight.csv")
myLM = lm(height~foot,data=results)
plot(results$foot,results$height,xlab='Foot length (cm)',ylab='Height (in)',pch=20)
abline(myLM)

Checking Inference Conditions

Before you can use inference tools for regression, you should check the following conditions:

  1. Observations are independent (ideally the data is a SRS)

  2. Data has a linear relationship (check the scatterplot or a residual plot).

  3. Variance of the residuals is the same for all x-values (check the residual plot).

  4. Residuals are normally distributed (check a histogram or qqplot of the residuals).

To get the residuals so that you can make a residual plot or check to see if they are normal, use the command resid(myLM)

hist(resid(myLM),col='gray',main='Distribution of Residuals',xlab='Residual (in)')

That’s an okay shape for a sample with only 20 observations.

plot(results$foot,resid(myLM),xlab='Foot length (cm)', ylab = 'Residual (in)', main='Residual Plot', pch=20)
abline(0,0)

Confidence Intervals

This linear model has the following relevant statistics. The average y-value is 67.75 and the average x-value is 28.5. The standard deviation of the y-values is 5.0039458 and the standard deviation of the x-values is 3.4450575. The sample size was 20. The residual standard error \(s\) is 3.613. A formula for the regression line is \[\hat{y} = 38.3 + 1.033 x\]

Use www.desmos.com to plot the 95% confidence interval endpoints as functions of \(x^*\) using the formula: \[\hat{y} \pm t^* s \sqrt{\frac{1}{N}+ \frac{(x^*-\bar{x})^2}{(N-1)s_x^2}}\] Note that the \(t^*\) for a 95% confidence interval with 18 degrees of freedom is 2.100922. You should set your viewing window so that you have a range of x and y-values similar to the ones above. Describe what you see. Why do the confidence intervals get wider as \(x^*\) gets further from \(\bar{x}\)?

The R command to make a confidence interval is:

predict(myLM,data.frame(foot=30),interval='confidence',level = 0.95)
##        fit      lwr      upr
## 1 69.29989 67.44079 71.15899

Notice the weird syntax for inputing the \(x^*\) value as a data frame.

Prediction Intervals

Predicting an individual’s height based on their foot length is harder than predicting the average height based on the same information. This is because individuals are even more variable than groups. Essentially you have to add the variance in the residuals to the variance for average y-values. This leads to this formula:

\[\hat{y} \pm t^* s \sqrt{1+\frac{1}{N}+ \frac{(x^*-\bar{x})^2}{(N-1)s_x^2}}\]

When I measured my foot with my shoes on, they were 30 cm long. So a 95% prediction interval for my height in inches would be:

predict(myLM,data.frame(foot=30),interval='prediction')
##        fit      lwr      upr
## 1 69.29989 61.48439 77.11538

This means that I can be 95% sure that the height of one individual (like me!) with a 30cm footprint is between 61.48439 and 77.11538 inches tall. That is a pretty lousy prediction! But it is all that a sample of size 20 can do.

summary(myLM)
## 
## Call:
## lm(formula = height ~ foot, data = results)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -7.1003 -2.2251 -0.7833  2.1330  8.7334 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  38.3021     6.9050   5.547 2.89e-05 ***
## foot          1.0333     0.2406   4.294 0.000437 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.613 on 18 degrees of freedom
## Multiple R-squared:  0.506,  Adjusted R-squared:  0.4786 
## F-statistic: 18.44 on 1 and 18 DF,  p-value: 0.0004367