myData = read.csv('polpartyfull.csv')
The head() function allows us to preview the data in the csv file without printing all of the data on the notebook. This way we can quickly check that the data loaded correctly, without filling up the page with useless information.
head(myData)
| Education | Politics | Count | |
|---|---|---|---|
| 1 | None | StrongDemocrat | 63 |
| 2 | None | WeakDemocrat | 45 |
| 3 | None | NearDemocrat | 34 |
| 4 | None | Independent | 87 |
| 5 | None | NearRepublican | 19 |
| 6 | None | WeakRepublican | 20 |
Notice that the rows in myData do not correspond to individual people, but rather to groups of people who all have the same educational background and political views. The count column tells us how many people we have in in group. Because of this, we cannot use the table() function to convert this data into an R table. We need to use the xtabs() function below, which has a somewhat strange input format:
myTable = xtabs(myData$Count~myData$Education+myData$Politics)
myTable
myData$Politics
myData$Education Independent NearDemocrat NearRepublican Otherparty
Bachelor 31 57 35 11
Graduate 26 20 10 5
Highschool 156 132 78 16
JrCollege 22 19 20 4
None 87 34 19 2
myData$Politics
myData$Education StrongDemocrat StrongRepublican WeakDemocrat WeakRepublican
Bachelor 56 43 44 76
Graduate 54 22 29 27
Highschool 185 98 183 147
JrCollege 32 13 30 33
None 63 25 45 20
mosaicplot(myTable,color=T)
Note that this mosaic plot is a mess, mostly because the education levels and political offiliations are not in order. Also, the text on the left side overlaps, so it is very hard to read. In the cells below, I will order the political offiliation and education level data to fix this problem.
education = factor(myData$Education,levels=c("None","Highschool","JrCollege","Bachelor","Graduate"),ordered=T)
politics = factor(myData$Politics,levels=c("StrongDemocrat","WeakDemocrat","NearDemocrat","Otherparty","Independent","NearRepublican",
"WeakRepublican","StrongRepublican"),ordered=T)
The new variables, education and politics have the same information as myData$Education and myData$Politics, respectively, but now the categorical variables have an order that will be respected when they are put into a table or a mosaic plot.
myTable = xtabs(myData$Count~education+politics)
myTable
politics
education StrongDemocrat WeakDemocrat NearDemocrat Otherparty Independent
None 63 45 34 2 87
Highschool 185 183 132 16 156
JrCollege 32 30 19 4 22
Bachelor 56 44 57 11 31
Graduate 54 29 20 5 26
politics
education NearRepublican WeakRepublican StrongRepublican
None 19 20 25
Highschool 78 147 98
JrCollege 20 33 13
Bachelor 35 76 43
Graduate 10 27 22
mosaicplot(myTable,color=T,las=1)
Now we can see what is going on. It appears that as education level increases from high school drop outs to high school graduates to college graduates, the proportion of people who are on the more conservative end of the spectrum increases a little and the proportion of people on the Democrat end decreases, but the trend reverses when we look at people with graduate degrees. They are more likely than anyone else to be in the StrongDemocrat category. Now that we can see that there is an apparent association between education and political views, lets do a chi-squared test to test the strength of the association. Notice the las=1 option in the mosaicplot() command. This sets all plot labels to horizontal.
chisq.test(myTable)
Warning message:
In chisq.test(myTable): Chi-squared approximation may be incorrect
Pearson's Chi-squared test
data: myTable
X-squared = 111.01, df = 28, p-value = 7.753e-12
Since the p-value is astronomically small, we can be very confident that the association we observed in the mosaic plot above is not a random fluke, but instead truely reflects an association between education level and political views in the population. Notice the warning message above. That warning is probably due to the fact that the counts in the Otherparty column are so low (two below 5). We can remove that column from the table and recompute the chi-squared test to make sure.
myTable[,-c(4)]
| StrongDemocrat | WeakDemocrat | NearDemocrat | Independent | NearRepublican | WeakRepublican | StrongRepublican | |
|---|---|---|---|---|---|---|---|
| None | 63 | 45 | 34 | 87 | 19 | 20 | 25 |
| Highschool | 185 | 183 | 132 | 156 | 78 | 147 | 98 |
| JrCollege | 32 | 30 | 19 | 22 | 20 | 33 | 13 |
| Bachelor | 56 | 44 | 57 | 31 | 35 | 76 | 43 |
| Graduate | 54 | 29 | 20 | 26 | 10 | 27 | 22 |
chisq.test(myTable[,-c(4)])
Pearson's Chi-squared test
data: myTable[, -c(4)]
X-squared = 104.63, df = 24, p-value = 4.832e-12
Now there is no warning and the p-value is roughly the same order of magnitude. Also there are relatively few respondents who identified with another political party, so removing those people from the data is completely reasonable.