Evolutionary biologists Harris and Steudel (2002) investigated factors that are related to the jumping ability of domestic cats. The scientists measured the takeoff velocity (using high-speed cameras) as a proxy for jumping ability in 18 healthy adult cats. Several traits that might be related to takeoff velocity were also recorded including: gender, relative limb length (hindlimb), relative extensor muscle mass (musclemass), body mass, and fat mass relative to lean body mass (percent body fat).
results = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/CatJumping.txt")
results
## sex bodymass hindlimb musclemass percentbodyfat velocity
## 1 F 3640 29.10 51.15 29 334.5
## 2 F 2670 28.55 46.05 17 387.3
## 3 M 5600 31.74 95.90 31 410.8
## 4 F 4130 26.90 55.65 39 318.6
## 5 F 3020 26.11 57.20 15 368.7
## 6 F 2660 26.69 48.67 11 358.8
## 7 F 3240 26.74 64.55 21 344.6
## 8 M 5140 27.71 78.80 35 324.6
## 9 F 3690 25.47 54.60 33 301.4
## 10 F 3620 28.18 55.50 15 331.8
## 11 F 5310 28.45 68.80 42 312.6
## 12 M 5560 28.65 79.80 37 316.8
## 13 M 3970 29.82 69.40 20 375.6
## 14 F 3770 26.66 60.25 26 372.4
## 15 F 5100 27.84 60.70 41 314.3
## 16 F 2950 27.89 55.65 25 367.5
## 17 M 7930 30.58 98.95 48 286.3
## 18 F 3550 28.06 79.25 16 352.5
What is the response variable and what are the explanatory variables in this data set?
The scatterplots below show the relationships between each explanatory variable and the response variable. For each plot, comment on the (a) direction, (b) linearity, and (c) strength of the trends. Because it is a little hard to see the plots from the pairs()
function, here are the explanatory variables each plotted against the response variable.
par(mfrow=c(2,3))
plot(results$sex,results$velocity,ylab='Velocity',xlab='Sex')
plot(results$bodymass,results$velocity,ylab='Velocity',xlab='Body Mass')
plot(results$musclemass,results$velocity,ylab='Velocity',xlab='Muscle Mass')
plot(results$percentbodyfat,results$velocity,ylab='Velocity', xlab='Pecent Body Fat')
plot(results$hindlimb,results$velocity,ylab='Velocity',xlab = 'Hind Limb Length')
I used 4 different methods to find a good model for predicting velocity
based on the other variables. Here were the results.
By adding variables one at a time based on whether and how much they increased the \(R^2_{adj}\), we get a model that includes every variable except sex
.
mylm = lm(velocity~percentbodyfat+musclemass+hindlimb+bodymass,data=results)
summary(mylm)
##
## Call:
## lm(formula = velocity ~ percentbodyfat + musclemass + hindlimb +
## bodymass, data = results)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.229 -11.119 -1.931 8.156 39.486
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.000774 109.100588 0.422 0.6802
## percentbodyfat -0.005902 1.040256 -0.006 0.9956
## musclemass 1.263113 0.712982 1.772 0.0999 .
## hindlimb 12.540988 4.337745 2.891 0.0126 *
## bodymass -0.032726 0.013209 -2.478 0.0277 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 20.14 on 13 degrees of freedom
## Multiple R-squared: 0.7165, Adjusted R-squared: 0.6292
## F-statistic: 8.213 on 4 and 13 DF, p-value: 0.001565
This leads to a model with three explanatory variables: musclemass
, hindlimb
, and bodymass
.
mylm = lm(velocity~musclemass+hindlimb+bodymass,data=results)
summary(mylm)
##
## Call:
## lm(formula = velocity ~ musclemass + hindlimb + bodymass, data = results)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.232 -11.126 -1.947 8.163 39.491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.758845 96.768264 0.473 0.64359
## musclemass 1.265299 0.578058 2.189 0.04605 *
## hindlimb 12.548448 3.983319 3.150 0.00709 **
## bodymass -0.032792 0.006172 -5.313 0.00011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.41 on 14 degrees of freedom
## Multiple R-squared: 0.7165, Adjusted R-squared: 0.6557
## F-statistic: 11.79 on 3 and 14 DF, p-value: 0.0004012
The first variable to add is percentbodyfat
followed by hindlimb
. After that, there are no other significant variables to add, so we stop.
mylm = lm(velocity~percentbodyfat+hindlimb,data=results)
summary(mylm)
##
## Call:
## lm(formula = velocity ~ percentbodyfat + hindlimb, data = results)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41.981 -11.090 -1.173 12.371 43.004
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 164.5979 99.1876 1.659 0.117779
## percentbodyfat -2.2978 0.5212 -4.409 0.000508 ***
## hindlimb 8.6462 3.6379 2.377 0.031211 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 22.77 on 15 degrees of freedom
## Multiple R-squared: 0.5818, Adjusted R-squared: 0.5261
## F-statistic: 10.44 on 2 and 15 DF, p-value: 0.001446
Of all the model selection methods, this is this easiest. You start with the full model and at each step, you eliminate the variable with the largest p-value until all remaining variables are statistically significant (usually at the \(\alpha = 5\%\) level). I ended up with a model with three explanatory variables: muslemass
, hindlimb
, and bodymass
.
mylm = lm(velocity~musclemass+hindlimb+bodymass,data=results)
summary(mylm)
##
## Call:
## lm(formula = velocity ~ musclemass + hindlimb + bodymass, data = results)
##
## Residuals:
## Min 1Q Median 3Q Max
## -29.232 -11.126 -1.947 8.163 39.491
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 45.758845 96.768264 0.473 0.64359
## musclemass 1.265299 0.578058 2.189 0.04605 *
## hindlimb 12.548448 3.983319 3.150 0.00709 **
## bodymass -0.032792 0.006172 -5.313 0.00011 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 19.41 on 14 degrees of freedom
## Multiple R-squared: 0.7165, Adjusted R-squared: 0.6557
## F-statistic: 11.79 on 3 and 14 DF, p-value: 0.0004012