Two-way ANOVA Example: Region and Religion vs. Income

In the 2014 General Social Survey, respondents were asked questions about many topics, including their religion and income. Below is a two-way analysis of variance that looks at how two factors affect income. The two factors are: region (Northeast, South, Midwest, or West) and religion (Catholic, Protestant, or Other).

myData = read.csv("Data/RegionReligionIncome2.csv")
head(myData)
##   Household_Income    Region   Religion Income_Number
## 1   75000 to 89999 Northeast   Catholic         82500
## 2   150000 or over Northeast   Catholic        250000
## 3   40000 to 49999 Northeast Protestant         45000
## 4   150000 or over Northeast   Catholic        250000
## 5          Refused Northeast   Catholic            NA
## 6   150000 or over Northeast Protestant        250000
dim(myData)
## [1] 2538    4

Once again, this data has a lot of NA entries. So we will use the na.omit() command to clean things up.

myData=na.omit(myData)
head(myData)
##   Household_Income    Region   Religion Income_Number
## 1   75000 to 89999 Northeast   Catholic         82500
## 2   150000 or over Northeast   Catholic        250000
## 3   40000 to 49999 Northeast Protestant         45000
## 4   150000 or over Northeast   Catholic        250000
## 6   150000 or over Northeast Protestant        250000
## 8   75000 to 89999 Northeast   Catholic         82500
dim(myData)
## [1] 2314    4

It looks like we lost a little over 200 individuals from the sample (about 10%) when we omitted the individuals with NA entries. That is a potential source of bias, and we might want to take a closer look at that later. To see how many individuals are in each group, use the table() command:

table(myData$Region,myData$Religion)
##            
##             Catholic Other Protestant
##   Midwest        120   155        249
##   Northeast      134   145        112
##   South          173   178        496
##   West           121   252        179

Now let’s look at how (and whether) our factors influence income.

par(mfrow=c(1,2))
boxplot(Income_Number~Region,data=myData,ylab='Income',col='gray')
boxplot(Income_Number~Religion,data=myData,ylab='Income',col='gray')

Looking at Interactions

We can make two interaction plots for this data.

interaction.plot(myData$Religion,myData$Region,myData$Income_Number,main='Interaction Plot: Religion vs. Income',xlab='Religion',ylab='Mean Income',col=c(1,2,3,4),lwd=3)

This first plot shows a complicated relationshop between religion and income (that changes based on region). So it appears that their are some interesting interactions. Below is the other interaction plot.

interaction.plot(myData$Region,myData$Religion,myData$Income_Number,,main='Interaction Plot: Region vs. Income',xlab='Region',ylab='Mean Income',col=c(1,2,3,4),lwd=3)

Testing Significance

To test the significance of the factors and the interactions, you can make an ANOVA table as shown below. Note that the two factors are attached by a * symbol in the aov() function.

myAOV = aov(Income_Number~Region*Religion,data = myData)
summary(myAOV)
##                   Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Region             3 7.990e+10 2.663e+10   6.116 0.000385 ***
## Religion           2 1.987e+09 9.935e+08   0.228 0.796019    
## Region:Religion    6 9.215e+10 1.536e+10   3.527 0.001772 ** 
## Residuals       2302 1.002e+13 4.355e+09                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Region clearly has a significant effect on income, but religion does not. The interaction between region and religion is significant, however. It appears that religion actually does matter, but in different regions it has the opposite effect (protestants are wealthier in the west, but poorer in the northeast and south) which masks the overall significance.

Checking Conditions

Before we trust these results from the ANOVA table, we should check that the conditions of the analysis of variance are satistified.

  1. Data is a SRS of a large population
  2. Constant variance
  3. Normality

Note that these assumptions are listed in the order of importance. Assumption 1 is more important than assumption 3.

We can be pretty sure that this is close to being valid, since the general social survey is a very reputable survey run by people who know a lot about statistics. The population is the entire United States, so we don’t have to worry about small population size causing the sample to lose independence. There is the problem that approximately 10% of the sample refused to answer some of the questions. That could be a source of bias, but I don’t see why it would differ based on region or even on religion. It might be worth looking into though.

To check the second assumption, we can use the tapply() function to find the standard deviation for each subgroup in our study:

tapply(myData$Income_Number,list(myData$Region,myData$Religion),sd)
##           Catholic    Other Protestant
## Midwest   60319.39 59621.43   76522.09
## Northeast 73418.60 77886.68   68138.84
## South     70879.39 69892.39   56320.61
## West      56246.46 70059.87   56845.68

The gap between the largest and smallest sample standard deviation much less than a factor of 2, so assumption #2 is probably okay. Let us turn to assumption #3.

qqnorm(resid(myAOV))

That is not normal at all! Let’s see which group(s) might be problematical.

par(mfrow=c(4,3))
hist(subset(myData,Region=='Midwest' & Religion=='Catholic')$Income_Number,main='Midwest Catholic',col='gray',xlab='Income')
hist(subset(myData,Region=='Midwest' & Religion=='Other')$Income_Number,main='Midwest Other',col='gray',xlab='Income')
hist(subset(myData,Region=='Midwest' & Religion=='Protestant')$Income_Number,main='Midwest Protestant',col='gray',xlab='Income')
hist(subset(myData,Region=='Northeast' & Religion=='Catholic')$Income_Number,main='Northeast Catholic',col='gray',xlab='Income')
hist(subset(myData,Region=='Northeast' & Religion=='Other')$Income_Number,main='Northeast Other',col='gray',xlab='Income')
hist(subset(myData,Region=='Northeast' & Religion=='Protestant')$Income_Number,main='Northeast Protestant',col='gray',xlab='Income')
hist(subset(myData,Region=='South' & Religion=='Catholic')$Income_Number,main='South Catholic',col='gray',xlab='Income')
hist(subset(myData,Region=='South' & Religion=='Other')$Income_Number,main='South Other',col='gray',xlab='Income')
hist(subset(myData,Region=='South' & Religion=='Protestant')$Income_Number,main='South Protestant',col='gray',xlab='Income')
hist(subset(myData,Region=='West' & Religion=='Catholic')$Income_Number,main='West Catholic',col='gray',xlab='Income')
hist(subset(myData,Region=='West' & Religion=='Other')$Income_Number,main='West Other',col='gray',xlab='Income')
hist(subset(myData,Region=='West' & Religion=='Protestant')$Income_Number,main='West Protestant',col='gray',xlab='Income')

The data in each cell is clearly skewed right, but that really won’t be a problem. Because the sample sizes in each cell are so large (over 100), the sample means for each cell will definitely have come from an approximately normal distribution due to the central limit theorem. Therefore using ANOVA in this situation is probably fine.