The data set below includes the results for 248 runners in a 5k race that took place in California in 2013. Below is an example of how to construct a multiple linear regression model in R.
raceData = read.csv('http://www.rossmanchance.com/iscam2/data/Talley5K2013.txt',sep='\t')
head(raceData)
## BIB Name Hometown Gender AgeRank GenderRank
## 1 1538 Ricketts, Christian Grover Beach, Ca M 1 1
## 2 1581 Mccarty, Travis Arroyo Grande, Ca M 1 2
## 3 1679 Bounds, Julia Redwood City, Ca F 1 1
## 4 91506 Krichevsky, Daniel San Luis Obispo, Ca M 1 3
## 5 1591 Shea, Owen Slo, Ca M 1 4
## 6 1542 Gillespie, Tyler Arroyo Grande, Ca M 2 5
## OverallRank Time Age
## 1 1 16.18 13
## 2 2 17.15 34
## 3 3 18.80 14
## 4 4 19.02 27
## 5 5 19.07 49
## 6 6 19.38 25
myLM = lm(Time~Age+Gender,data=raceData)
summary(myLM)
##
## Call:
## lm(formula = Time ~ Age + Gender, data = raceData)
##
## Residuals:
## Min 1Q Median 3Q Max
## -13.789 -5.440 -1.831 3.768 53.025
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.98722 1.41269 21.227 < 2e-16 ***
## Age 0.12259 0.03196 3.836 0.000159 ***
## GenderM -5.20582 1.09495 -4.754 3.4e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.485 on 245 degrees of freedom
## Multiple R-squared: 0.1318, Adjusted R-squared: 0.1247
## F-statistic: 18.6 on 2 and 245 DF, p-value: 3.021e-08
We should also plot these variables to see visually how they interact, and to decide whether or not a linear model is appropriate.
par(mfrow=c(2,2))
plot(raceData$Age,raceData$Time,xlab='Age',ylab='Race Time (min)')
plot(raceData$Gender,raceData$Time,xlab="Gender",ylab='Race Time (min)')
plot(raceData$Gender,raceData$Age,xlab="Gender",ylab="Age")