The following data is from a study of undergraduate students who initially planned to major in computer science. These students all began the same year and were classified according to gender. We also look at the declared major at the end of the second year. We use the labels CS for computer science majors, EO for engineering and other science majors, and O for other majors.
Here we look at differences in SAT math scores for these student vs. their gender and which type of major they switched to.
myData = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/majors.csv")
head(myData)
## sex maj satm satv hsm hss hse gpa
## 1 F CS 640 530 8 6 8 2.35
## 2 F CS 670 600 9 10 7 2.08
## 3 F CS 600 400 8 8 7 3.21
## 4 F CS 570 480 7 7 6 2.34
## 5 F CS 510 530 6 8 8 1.40
## 6 F CS 750 610 10 9 9 1.43
dim(myData)
## [1] 234 8
As you can see this data set includes 234 students who all initially wanted to major in CS. Here are the numbers in each subgroup:
table(myData$sex,myData$maj)
##
## CS EO O
## F 39 39 39
## M 39 39 39
The fact that all of the subgroups had the same number of students isn’t a coincidence. This was a study with a 2-factor design. The two controlled factors are sex of the student and their eventual major. We want to see what the relationship is between these factors and SAT math scores.
Here are the five number summaries for each group:
par(mfrow=c(1,2))
boxplot(satm~sex,data=myData,ylab='Math SAT',col='gray')
boxplot(satm~maj,data=myData,ylab='Math SAT',col='gray')
And here are the mean SAT math scores:
tapply(myData$satm,list(myData$sex,myData$maj),mean)
## CS EO O
## F 629.6923 617.8974 589.5128
## M 582.3590 632.0513 542.3590
We can make two interaction plots for this data.
interaction.plot(myData$sex,myData$maj,myData$satm,main='Interaction Plot',xlab='Gender',ylab='Math SAT',col=c(1,2,3,4),lwd=3)
interaction.plot(myData$maj,myData$sex,myData$satm,main='Interaction Plot',xlab='New Major',ylab='Math SAT',col=c(1,2,3,4),lwd=3)
myAOV = aov(satm~sex*maj,data=myData)
summary(myAOV)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 41947 41947 7.843 0.00554 **
## maj 2 141746 70873 13.252 3.59e-06 ***
## sex:maj 2 49006 24503 4.582 0.01120 *
## Residuals 228 1219389 5348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: the pooled standard deviation \(s_p\) is the square root of the MSE. In this case it is: 73.1. This is the best estimate for the population standard deviation \(\sigma\) of the residuals.
Independence and Bias We will assume this data is a SRS of college students who were interested in majoring in Computer Science.
Constant Variance In two-way ANOVA, the tapply
function is useful to calculate the standard deviations for each group. As long as the biggest standard deviation is not more than twice the smallest, we are probably okay using two-way ANOVA.
tapply(myData$satm,list(myData$sex,myData$maj),sd)
## CS EO O
## F 78.55710 62.59090 74.52194
## M 72.47776 61.73711 85.92055
qqnorm(resid(myAOV))
qqline(resid(myAOV))
That looks pretty normal, and since normality is the least important of our three conditions, we can be fairly confident in the ANOVA results.
The built-in Tukey’s Honest Significant Differences function in R (TukeyHSD
) is a quick and easy way to not only estimate the size of the main effects, but also make confidence intervals for the differences.
TukeyHSD(myAOV,which="sex")
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = satm ~ sex * maj, data = myData)
##
## $sex
## diff lwr upr p adj
## M-F -26.77778 -45.61797 -7.937583 0.0055388
TukeyHSD(myAOV,which="maj")
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = satm ~ sex * maj, data = myData)
##
## $maj
## diff lwr upr p adj
## EO-CS 18.94872 -8.677539 46.57498 0.2400579
## O-CS -40.08974 -67.716001 -12.46349 0.0021096
## O-EO -59.03846 -86.664719 -31.41220 0.0000028