The following data is from a study of undergraduate students who initially planned to major in computer science. These students all began the same year and were classified according to gender. We also look at the declared major at the end of the second year. We use the labels CS for computer science majors, EO for engineering and other science majors, and O for other majors.
Here we look at differences in SAT math scores for these student vs. their gender and which type of major they switched to.
myData = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/majors.csv")
head(myData)
## sex maj satm satv hsm hss hse gpa
## 1 F CS 640 530 8 6 8 2.35
## 2 F CS 670 600 9 10 7 2.08
## 3 F CS 600 400 8 8 7 3.21
## 4 F CS 570 480 7 7 6 2.34
## 5 F CS 510 530 6 8 8 1.40
## 6 F CS 750 610 10 9 9 1.43
dim(myData)
## [1] 234 8
table(myData$sex,myData$maj)
##
## CS EO O
## F 39 39 39
## M 39 39 39
tapply(myData$satm,list(myData$sex,myData$maj),mean)
## CS EO O
## F 629.6923 617.8974 589.5128
## M 582.3590 632.0513 542.3590
Now let’s look at how (and whether) our factors influence income.
par(mfrow=c(1,2))
boxplot(satm~sex,data=myData,ylab='Math SAT',col='gray')
boxplot(satm~maj,data=myData,ylab='Math SAT',col='gray')
We can make two interaction plots for this data.
interaction.plot(myData$sex,myData$maj,myData$satm,main='Interaction Plot',xlab='Gender',ylab='Math SAT',col=c(1,2,3,4),lwd=3)
interaction.plot(myData$maj,myData$sex,myData$satm,main='Interaction Plot',xlab='New Major',ylab='Math SAT',col=c(1,2,3,4),lwd=3)
myAOV = aov(satm~sex*maj,data=myData)
summary(myAOV)
## Df Sum Sq Mean Sq F value Pr(>F)
## sex 1 41947 41947 7.843 0.00554 **
## maj 2 141746 70873 13.252 3.59e-06 ***
## sex:maj 2 49006 24503 4.582 0.01120 *
## Residuals 228 1219389 5348
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Note: the pooled standard deviation \(s_p\) is the square root of the MSE. In this case it is: 73.1. This is the best estimate for the population standard deviation \(\sigma\) of the residuals. Before we use this, we should check that the variances really are the same in each group:
tapply(myData$satm,list(myData$sex,myData$maj),sd)
## CS EO O
## F 78.55710 62.59090 74.52194
## M 72.47776 61.73711 85.92055