Not surprisingly, there is a strong correlation between a student’s grades on different midterm exam. Here is data from my sections of statistics from 2013-2015.
results = read.csv('http://people.hsc.edu/faculty-staff/blins/StatsExamples/midtermRegressionS13_F15.csv')
head(results)
## Midterm1 Midterm2
## 1 52 54
## 2 70 73
## 3 86 89
## 4 73 62
## 5 92 85
## 6 47 72
It is very easy to make scatterplots in R.
plot(results$Midterm1,results$Midterm2,xlab="Midterm 1 grade",ylab="Midterm 2 grade")
The cor()
function calculates the correlation between two different vectors of data.
cor(results$Midterm1,results$Midterm2)
## [1] 0.6094617
The command lm()
will construct a least squares linear regression model for predicting one variable based on another. Here lm
stands for linear model. The syntax is a little different than other commands. Here is an example of how to use it.
myLM = lm(Midterm2~Midterm1,data=results)
myLM
##
## Call:
## lm(formula = Midterm2 ~ Midterm1, data = results)
##
## Coefficients:
## (Intercept) Midterm1
## 29.6567 0.5797
Notice that the linear model contains the slope and the y-intercept for the least squares regression line. The slope is the coefficient on the Midterm1 grade, so it is the number 0.5797. This means: for every 1 point higher a student gets on midterm 1, their midterm 2 grade tends to be 0.5797 points higher, on average.
You can add the trendline to the scatterplot with the command abline()
.
plot(results$Midterm1,results$Midterm2,xlab="Midterm 1 grade",ylab="Midterm 2 grade")
abline(myLM) #adds the trendline to the scatterplot
If you want to predict a y-value based on a particular x-value, use the predict()
function. The arguments of predict()
are a linear model and a data frame with a header that has the same name as the x-variable in the linear model. In this example, the second argument is a simple data frame with two values for Midterm 1, one student who gets a 50 and another student who gets a 100.
predict(myLM,data.frame(Midterm1=c(50,100)))
## 1 2
## 58.64292 87.62917
Based on the results, it looks like students who get a 50 on midterm 1 will have an average grade of about 58.6 on midterm 2. On the other hand, students with a perfect score on midterm 1 will only have an average grade of 87.6 on midterm 2. This phenomenon is known as regression to the mean: the predicted outcomes in a linear regression model tend to be less extreme than then inputs.
A linear model in R contains a lot of useful information that can be accessed using the summary()
function. This information will be particularly useful when we apply statistical inference techniques to linear regression.
summary(myLM)
##
## Call:
## lm(formula = Midterm2 ~ Midterm1, data = results)
##
## Residuals:
## Min 1Q Median 3Q Max
## -44.672 -6.336 1.096 7.918 26.821
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 29.65667 4.41632 6.715 2.43e-10 ***
## Midterm1 0.57972 0.05652 10.256 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 10.98 on 178 degrees of freedom
## Multiple R-squared: 0.3714, Adjusted R-squared: 0.3679
## F-statistic: 105.2 on 1 and 178 DF, p-value: < 2.2e-16
The function resid()
takes a linear model (like myLM
) and returns the residuals. You can use this to plot the residuals against the x-values which is often useful if you want to assess whether data really follows a linear model and whether the conditions for statistical inference are met.
plot(results$Midterm1,resid(myLM))
abline(0,0) # add line with slope = 0 and y-intercept = 0 to the residual plot