Two-way ANOVA Example: Gender and Major vs. SAT Scores

The following data is from a study of undergraduate students who initially planned to major in computer science. These students all began the same year and were classified according to gender. We also look at the declared major at the end of the second year. We use the labels CS for computer science majors, EO for engineering and other science majors, and O for other majors.

Here we look at differences in SAT math scores for these student vs. their gender and which type of major they switched to.

myData = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/majors.csv")
head(myData)
##   sex maj satm satv hsm hss hse  gpa
## 1   F  CS  640  530   8   6   8 2.35
## 2   F  CS  670  600   9  10   7 2.08
## 3   F  CS  600  400   8   8   7 3.21
## 4   F  CS  570  480   7   7   6 2.34
## 5   F  CS  510  530   6   8   8 1.40
## 6   F  CS  750  610  10   9   9 1.43
dim(myData)
## [1] 234   8
table(myData$sex,myData$maj)
##    
##     CS EO  O
##   F 39 39 39
##   M 39 39 39
tapply(myData$satm,list(myData$sex,myData$maj),mean)
##         CS       EO        O
## F 629.6923 617.8974 589.5128
## M 582.3590 632.0513 542.3590

Now let’s look at how (and whether) our factors influence income.

par(mfrow=c(1,2))
boxplot(satm~sex,data=myData,ylab='Math SAT',col='gray')
boxplot(satm~maj,data=myData,ylab='Math SAT',col='gray')

Looking at Interactions

We can make two interaction plots for this data.

interaction.plot(myData$sex,myData$maj,myData$satm,main='Interaction Plot',xlab='Gender',ylab='Math SAT',col=c(1,2,3,4),lwd=3)

interaction.plot(myData$maj,myData$sex,myData$satm,main='Interaction Plot',xlab='New Major',ylab='Math SAT',col=c(1,2,3,4),lwd=3)

myAOV = aov(satm~sex*maj,data=myData)
summary(myAOV)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## sex           1   41947   41947   7.843  0.00554 ** 
## maj           2  141746   70873  13.252 3.59e-06 ***
## sex:maj       2   49006   24503   4.582  0.01120 *  
## Residuals   228 1219389    5348                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: the pooled standard deviation \(s_p\) is the square root of the MSE. In this case it is: 73.1. This is the best estimate for the population standard deviation \(\sigma\) of the residuals. Before we use this, we should check that the variances really are the same in each group:

tapply(myData$satm,list(myData$sex,myData$maj),sd)
##         CS       EO        O
## F 78.55710 62.59090 74.52194
## M 72.47776 61.73711 85.92055