Bootstrap Example: Marriage Age Data

marriages = read.csv("http://people.hsc.edu/faculty-staff/blins/StatsExamples/marriageAges.txt")
head(marriages)

##   Husband_Age Wife_Age
## 1          25       22
## 2          25       32
## 3          51       50
## 4          25       25
## 5          38       33
## 6          30       27

dim(marriages)

## [1] 24  2

diff=marriages$Husband_Age-marriages$Wife_Age
hist(diff,col='gray',main='Age Differences in the Sample', xlab='Age Difference (Husband - Wife)')

Bootstrapping the Mean

To generate a bootstrap distribution, we need to resample from the diff vector many times, each time we need to take a sample of the same size (N=24), but with replacement (replace=T). The we calculate the mean, since that is the statistic that we care about.

bootDist = c()
for (i in 1:5000) {
  bootSamp = sample(diff,24,replace=T) # bootSamp is a bootstrap sample from the original data.
  bootStat = mean(bootSamp) # bootStat is the bootstrap statistic we care about. In this case it is the mean.
  bootDist = c(bootDist,bootStat) # Each time we iterate the for-loop, we add our new bootSamp to the bootstrap distribution.
}
hist(bootDist,col='gray',main='Bootstrap Distribution for the Mean Age Difference')

Bootstrapping other Statistics

We can also resample other statistics, like correlation. Below, we compute a bootstrap distribution for the correlation in the data set. Notice how we sample a random collection of 24 rows from the variable marriages which is a special type of variable in R called a data frame.

corDist = c()
for (i in 1:5000) {
  bootSamp = marriages[sample(1:24,24,replace=T),]
  bootStat = cor(bootSamp$Husband_Age,bootSamp$Wife_Age)
  corDist = c(corDist,bootStat)
}
hist(corDist,col='gray')

As you can see, the boostrap distribution for the correlation coefficient is very left-skewed.