Gender and Major vs. SAT Scores

The following data is from a study of undergraduate students who initially planned to major in computer science. These students all began the same year and were classified according to gender. We also look at the declared major at the end of the second year. We use the labels CS for computer science majors, EO for engineering and other science majors, and O for other majors.

First Look at the Data

Here we look at differences in SAT math scores for these student vs. their gender and which type of major they switched to.

myData = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/majors.csv")
head(myData)
##   sex maj satm satv hsm hss hse  gpa
## 1   F  CS  640  530   8   6   8 2.35
## 2   F  CS  670  600   9  10   7 2.08
## 3   F  CS  600  400   8   8   7 3.21
## 4   F  CS  570  480   7   7   6 2.34
## 5   F  CS  510  530   6   8   8 1.40
## 6   F  CS  750  610  10   9   9 1.43
dim(myData)
## [1] 234   8

As you can see this data set includes 234 students who all initially wanted to major in CS. Here are the numbers in each subgroup:

table(myData$sex,myData$maj)
##    
##     CS EO  O
##   F 39 39 39
##   M 39 39 39

The fact that all of the subgroups had the same number of students isn’t a coincidence. This was a study with a 2-factor design. The two controlled factors are sex of the student and their eventual major. We want to see what the relationship is between these factors and SAT math scores.

Here are the five number summaries for each group:

par(mfrow=c(1,2))
boxplot(satm~sex,data=myData,ylab='Math SAT',col='gray')
boxplot(satm~maj,data=myData,ylab='Math SAT',col='gray')

And here are the mean SAT math scores:

tapply(myData$satm,list(myData$sex,myData$maj),mean)
##         CS       EO        O
## F 629.6923 617.8974 589.5128
## M 582.3590 632.0513 542.3590

Looking at Interactions

We can make two interaction plots for this data.

interaction.plot(myData$sex,myData$maj,myData$satm,main='Interaction Plot',xlab='Gender',ylab='Math SAT',col=c(1,2,3,4),lwd=3)

interaction.plot(myData$maj,myData$sex,myData$satm,main='Interaction Plot',xlab='New Major',ylab='Math SAT',col=c(1,2,3,4),lwd=3)

myAOV = aov(satm~sex*maj,data=myData)
summary(myAOV)
##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## sex           1   41947   41947   7.843  0.00554 ** 
## maj           2  141746   70873  13.252 3.59e-06 ***
## sex:maj       2   49006   24503   4.582  0.01120 *  
## Residuals   228 1219389    5348                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Note: the pooled standard deviation \(s_p\) is the square root of the MSE. In this case it is: 73.1. This is the best estimate for the population standard deviation \(\sigma\) of the residuals.

Checking Conditions

  1. Independence and Bias We will assume this data is a SRS of college students who were interested in majoring in Computer Science.

  2. Constant Variance In two-way ANOVA, the tapply function is useful to calculate the standard deviations for each group. As long as the biggest standard deviation is not more than twice the smallest, we are probably okay using two-way ANOVA.

tapply(myData$satm,list(myData$sex,myData$maj),sd)
##         CS       EO        O
## F 78.55710 62.59090 74.52194
## M 72.47776 61.73711 85.92055
  1. Normality Each group should have a roughly normal distribution. Typically in ANOVA it is reasonable to just check the residuals.
qqnorm(resid(myAOV))
qqline(resid(myAOV))

That looks pretty normal, and since normality is the least important of our three conditions, we can be fairly confident in the ANOVA results.

Estimating Main Effects

The built-in Tukey’s Honest Significant Differences function in R (TukeyHSD) is a quick and easy way to not only estimate the size of the main effects, but also make confidence intervals for the differences.

TukeyHSD(myAOV,which="sex")
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = satm ~ sex * maj, data = myData)
## 
## $sex
##          diff       lwr       upr     p adj
## M-F -26.77778 -45.61797 -7.937583 0.0055388
TukeyHSD(myAOV,which="maj")
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = satm ~ sex * maj, data = myData)
## 
## $maj
##            diff        lwr       upr     p adj
## EO-CS  18.94872  -8.677539  46.57498 0.2400579
## O-CS  -40.08974 -67.716001 -12.46349 0.0021096
## O-EO  -59.03846 -86.664719 -31.41220 0.0000028