In the 2014 General Social Survey, respondents were asked questions about many topics, including their religion and income. Below is a two-way analysis of variance that looks at how two factors affect income. The two factors are: region (Northeast, South, Midwest, or West) and religion (Catholic, Protestant, or Other).
myData = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring18/math222/data/RegionReligionIncome.csv")
head(myData)
## Household_Income Region Religion Income_Number
## 1 75000 to 89999 Northeast Catholic 82500
## 2 150000 or over Northeast Catholic 250000
## 3 40000 to 49999 Northeast Protestant 45000
## 4 150000 or over Northeast Catholic 250000
## 5 Refused Northeast Catholic NA
## 6 150000 or over Northeast Protestant 250000
dim(myData)
## [1] 2538 4
Once again, this data has a lot of NA entries. So we will use the na.omit() command to clean things up.
myData=na.omit(myData)
head(myData)
## Household_Income Region Religion Income_Number
## 1 75000 to 89999 Northeast Catholic 82500
## 2 150000 or over Northeast Catholic 250000
## 3 40000 to 49999 Northeast Protestant 45000
## 4 150000 or over Northeast Catholic 250000
## 6 150000 or over Northeast Protestant 250000
## 8 75000 to 89999 Northeast Catholic 82500
dim(myData)
## [1] 2314 4
It looks like we lost a little over 200 individuals from the sample (about 10%) when we omitted the individuals with NA entries. That is a potential source of bias, and we might want to take a closer look at that later. To see how many individuals are in each group, use the table()
command:
table(myData$Region,myData$Religion)
##
## Catholic Other Protestant
## Midwest 120 155 249
## Northeast 134 145 112
## South 173 178 496
## West 121 252 179
Now let’s look at how (and whether) our factors influence income.
par(mfrow=c(1,2))
boxplot(Income_Number~Region,data=myData,ylab='Income',col='gray')
boxplot(Income_Number~Religion,data=myData,ylab='Income',col='gray')
We can make two interaction plots for this data. Both interaction plots are ways of showing the group means for this data:
tapply(myData$Income_Number,list(myData$Region,myData$Religion),FUN=mean)
## Catholic Other Protestant
## Midwest 70702.08 61629.03 84159.64
## Northeast 80684.70 83343.10 65694.20
## South 64235.55 70228.93 59636.59
## West 60202.48 69063.49 65736.03
interaction.plot(myData$Religion,myData$Region,myData$Income_Number,main='Interaction Plot: Religion vs. Income',xlab='Religion',ylab='Mean Income',col=c(1,2,3,4),lwd=3)
This first plot shows a complicated relationshop between religion and income (that changes based on region). So it appears that their are some interesting interactions. Below is the other interaction plot.
interaction.plot(myData$Region,myData$Religion,myData$Income_Number,,main='Interaction Plot: Region vs. Income',xlab='Region',ylab='Mean Income',col=c(1,2,3,4),lwd=3)
To test the significance of the factors and the interactions, you can make an ANOVA table as shown below. Note that the two factors are attached by a *
symbol in the aov()
function.
myAOV = aov(Income_Number~Region*Religion,data = myData)
summary(myAOV)
## Df Sum Sq Mean Sq F value Pr(>F)
## Region 3 7.990e+10 2.663e+10 6.116 0.000385 ***
## Religion 2 1.987e+09 9.935e+08 0.228 0.796019
## Region:Religion 6 9.215e+10 1.536e+10 3.527 0.001772 **
## Residuals 2302 1.002e+13 4.355e+09
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Region clearly has a significant effect on income, but religion does not. The interaction between region and religion is significant, however. It appears that religion actually does matter, but in different regions it has the opposite effect (protestants are wealthier in the west, but poorer in the northeast and south) which masks the overall significance.
Before we trust these results from the ANOVA table, we should check that the conditions of the analysis of variance are satistified.
Note that these assumptions are listed in the order of importance. Assumption 1 is more important than assumption 3.
We can be pretty sure that this is close to being a SRS of the people in the United States, since the General Social Survey is a very reputable survey run by people who know a lot about statistics. Since the population is the entire United States, so we don’t have to worry about small population size causing the sample to lose independence. There is the problem that approximately 10% of the sample refused to answer some of the questions. That could be a source of bias, but I don’t see why it would differ based on region or even on religion. It might be worth looking into though.
To check the second assumption, we can use the tapply() function to find the standard deviation for each subgroup in our study:
tapply(myData$Income_Number,list(myData$Region,myData$Religion),sd)
## Catholic Other Protestant
## Midwest 60319.39 59621.43 76522.09
## Northeast 73418.60 77886.68 68138.84
## South 70879.39 69892.39 56320.61
## West 56246.46 70059.87 56845.68
The gap between the largest and smallest sample standard deviation much less than a factor of 2, so assumption #2 is probably okay. Let us turn to assumption #3.
qqnorm(resid(myAOV))
qqline(resid(myAOV))
That is not normal at all! But if you look at the sample sizes, even the smallest group contains 112 individuals, so the right skew in the income data is probably not going to be an issue. Even though individual income is very right-skewed, the Central Limit Theorem is so strong with samples this large that the group means \(\bar{x}_{ij}\) are probably all approximately normally distributed.