marriages = read.csv("http://people.hsc.edu/faculty-staff/blins/StatsExamples/marriageAges.txt")
head(marriages)
## Husband_Age Wife_Age
## 1 25 22
## 2 25 32
## 3 51 50
## 4 25 25
## 5 38 33
## 6 30 27
dim(marriages)
## [1] 24 2
diff=marriages$Husband_Age-marriages$Wife_Age
hist(diff,col='gray',main='Age Differences in the Sample', xlab='Age Difference (Husband - Wife)')
To generate a bootstrap distribution, we need to resample from the diff
vector many times, each time we need to take a sample of the same size (N=24), but with replacement (replace=T
). The we calculate the mean, since that is the statistic that we care about.
bootDist = c()
for (i in 1:5000) {
bootSamp = sample(diff,24,replace=T) # bootSamp is a bootstrap sample from the original data.
bootStat = mean(bootSamp) # bootStat is the bootstrap statistic we care about. In this case it is the mean.
bootDist = c(bootDist,bootStat) # Each time we iterate the for-loop, we add our new bootSamp to the bootstrap distribution.
}
hist(bootDist,col='gray',main='Bootstrap Distribution for the Mean Age Difference')
We can also resample other statistics, like correlation. Below, we compute a bootstrap distribution for the correlation in the data set. Notice how we sample a random collection of 24 rows from the variable marriages
which is a special type of variable in R called a data frame.
corDist = c()
for (i in 1:5000) {
bootSamp = marriages[sample(1:24,24,replace=T),]
bootStat = cor(bootSamp$Husband_Age,bootSamp$Wife_Age)
corDist = c(corDist,bootStat)
}
hist(corDist,col='gray')
As you can see, the boostrap distribution for the correlation coefficient is very left-skewed.