Statistics Midterm Grades

Not surprisingly, there is a strong correlation between a student’s grades on different midterm exam. Here is data from my sections of statistics from 2013-2015.

results = read.csv('http://people.hsc.edu/faculty-staff/blins/StatsExamples/midtermRegressionS13_F15.csv')
head(results)

##   Midterm1 Midterm2
## 1       52       54
## 2       70       73
## 3       86       89
## 4       73       62
## 5       92       85
## 6       47       72

Scatterplots

It is very easy to make scatterplots in R.

plot(results$Midterm1,results$Midterm2,xlab="Midterm 1 grade",ylab="Midterm 2 grade")

Correlation

The cor() function calculates the correlation between two different vectors of data.

cor(results$Midterm1,results$Midterm2)

## [1] 0.6094617

Linear Regression

The command lm() will construct a least squares linear regression model for predicting one variable based on another. Here lm stands for linear model. The syntax is a little different than other commands. Here is an example of how to use it.

myLM = lm(Midterm2~Midterm1,data=results)
myLM

## 
## Call:
## lm(formula = Midterm2 ~ Midterm1, data = results)
## 
## Coefficients:
## (Intercept)     Midterm1  
##     29.6567       0.5797

Notice that the linear model contains the slope and the y-intercept for the least squares regression line. The slope is the coefficient on the Midterm1 grade, so it is the number 0.5797. This means: for every 1 point higher a student gets on midterm 1, their midterm 2 grade tends to be 0.5797 points higher, on average.

Plotting Trendlines

You can add the trendline to the scatterplot with the command abline().

plot(results$Midterm1,results$Midterm2,xlab="Midterm 1 grade",ylab="Midterm 2 grade")
abline(myLM) #adds the trendline to the scatterplot

Making Predictions

If you want to predict a y-value based on a particular x-value, use the predict() function. The arguments of predict() are a linear model and a data frame with a header that has the same name as the x-variable in the linear model. In this example, the second argument is a simple data frame with two values for Midterm 1, one student who gets a 50 and another student who gets a 100.

predict(myLM,data.frame(Midterm1=c(50,100)))

##        1        2 
## 58.64292 87.62917

Based on the results, it looks like students who get a 50 on midterm 1 will have an average grade of about 58.6 on midterm 2. On the other hand, students with a perfect score on midterm 1 will only have an average grade of 87.6 on midterm 2. This phenomenon is known as regression to the mean: the predicted outcomes in a linear regression model tend to be less extreme than then inputs.

Other Information

A linear model in R contains a lot of useful information that can be accessed using the summary() function. This information will be particularly useful when we apply statistical inference techniques to linear regression.

summary(myLM)

## 
## Call:
## lm(formula = Midterm2 ~ Midterm1, data = results)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -44.672  -6.336   1.096   7.918  26.821 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 29.65667    4.41632   6.715 2.43e-10 ***
## Midterm1     0.57972    0.05652  10.256  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 10.98 on 178 degrees of freedom
## Multiple R-squared:  0.3714, Adjusted R-squared:  0.3679 
## F-statistic: 105.2 on 1 and 178 DF,  p-value: < 2.2e-16

Residual Plots

The function resid() takes a linear model (like myLM) and returns the residuals. You can use this to plot the residuals against the x-values which is often useful if you want to assess whether data really follows a linear model and whether the conditions for statistical inference are met.

plot(results$Midterm1,resid(myLM)) 
abline(0,0) # add line with slope = 0 and y-intercept = 0 to the residual plot