Two-way ANOVA Example: Region and Religion vs. Income

In the 2014 General Social Survey, respondents were asked questions about many topics, including their religion and income. Below is a two-way analysis of variance that looks at how two factors affect income. The two factors are: region (Northeast, South, Midwest, or West) and religion (Catholic, Protestant, or Other).

myData = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring18/math222/data/RegionReligionIncome.csv")
head(myData)

##   Household_Income    Region   Religion Income_Number
## 1   75000 to 89999 Northeast   Catholic         82500
## 2   150000 or over Northeast   Catholic        250000
## 3   40000 to 49999 Northeast Protestant         45000
## 4   150000 or over Northeast   Catholic        250000
## 5          Refused Northeast   Catholic            NA
## 6   150000 or over Northeast Protestant        250000

dim(myData)

## [1] 2538    4

Once again, this data has a lot of NA entries. So we will use the na.omit() command to clean things up.

myData=na.omit(myData)
head(myData)

##   Household_Income    Region   Religion Income_Number
## 1   75000 to 89999 Northeast   Catholic         82500
## 2   150000 or over Northeast   Catholic        250000
## 3   40000 to 49999 Northeast Protestant         45000
## 4   150000 or over Northeast   Catholic        250000
## 6   150000 or over Northeast Protestant        250000
## 8   75000 to 89999 Northeast   Catholic         82500

dim(myData)

## [1] 2314    4

It looks like we lost a little over 200 individuals from the sample (about 10%) when we omitted the individuals with NA entries. That is a potential source of bias, and we might want to take a closer look at that later. To see how many individuals are in each group, use the table() command:

table(myData$Region,myData$Religion)

##            
##             Catholic Other Protestant
##   Midwest        120   155        249
##   Northeast      134   145        112
##   South          173   178        496
##   West           121   252        179

Now let’s look at how (and whether) our factors influence income.

par(mfrow=c(1,2))
boxplot(Income_Number~Region,data=myData,ylab='Income',col='gray')
boxplot(Income_Number~Religion,data=myData,ylab='Income',col='gray')

Looking at Interactions

We can make two interaction plots for this data. Both interaction plots are ways of showing the group means for this data:

tapply(myData$Income_Number,list(myData$Region,myData$Religion),FUN=mean)

##           Catholic    Other Protestant
## Midwest   70702.08 61629.03   84159.64
## Northeast 80684.70 83343.10   65694.20
## South     64235.55 70228.93   59636.59
## West      60202.48 69063.49   65736.03

interaction.plot(myData$Religion,myData$Region,myData$Income_Number,main='Interaction Plot: Religion vs. Income',xlab='Religion',ylab='Mean Income',col=c(1,2,3,4),lwd=3)

This first plot shows a complicated relationshop between religion and income (that changes based on region). So it appears that their are some interesting interactions. Below is the other interaction plot.

interaction.plot(myData$Region,myData$Religion,myData$Income_Number,,main='Interaction Plot: Region vs. Income',xlab='Region',ylab='Mean Income',col=c(1,2,3,4),lwd=3)

Testing Significance

To test the significance of the factors and the interactions, you can make an ANOVA table as shown below. Note that the two factors are attached by a * symbol in the aov() function.

myAOV = aov(Income_Number~Region*Religion,data = myData)
summary(myAOV)

##                   Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Region             3 7.990e+10 2.663e+10   6.116 0.000385 ***
## Religion           2 1.987e+09 9.935e+08   0.228 0.796019    
## Region:Religion    6 9.215e+10 1.536e+10   3.527 0.001772 ** 
## Residuals       2302 1.002e+13 4.355e+09                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Region clearly has a significant effect on income, but religion does not. The interaction between region and religion is significant, however. It appears that religion actually does matter, but in different regions it has the opposite effect (protestants are wealthier in the west, but poorer in the northeast and south) which masks the overall significance.

Checking Conditions

Before we trust these results from the ANOVA table, we should check that the conditions of the analysis of variance are satistified.

Independence - Data is a SRS of a large population.
Constant variance - Each group has the same (population) standard deviation \(\sigma\).
Normality - Each group has a normal distribution.

Note that these assumptions are listed in the order of importance. Assumption 1 is more important than assumption 3.

We can be pretty sure that this is close to being a SRS of the people in the United States, since the General Social Survey is a very reputable survey run by people who know a lot about statistics. Since the population is the entire United States, so we don’t have to worry about small population size causing the sample to lose independence. There is the problem that approximately 10% of the sample refused to answer some of the questions. That could be a source of bias, but I don’t see why it would differ based on region or even on religion. It might be worth looking into though.

To check the second assumption, we can use the tapply() function to find the standard deviation for each subgroup in our study:

tapply(myData$Income_Number,list(myData$Region,myData$Religion),sd)

##           Catholic    Other Protestant
## Midwest   60319.39 59621.43   76522.09
## Northeast 73418.60 77886.68   68138.84
## South     70879.39 69892.39   56320.61
## West      56246.46 70059.87   56845.68

The gap between the largest and smallest sample standard deviation much less than a factor of 2, so assumption #2 is probably okay. Let us turn to assumption #3.

qqnorm(resid(myAOV))
qqline(resid(myAOV))

That is not normal at all! But if you look at the sample sizes, even the smallest group contains 112 individuals, so the right skew in the income data is probably not going to be an issue. Even though individual income is very right-skewed, the Central Limit Theorem is so strong with samples this large that the group means \(\bar{x}_{ij}\) are probably all approximately normally distributed.