Bootstrap Example: Farmville February Rainfall

We we look at rainfall data for Farmville, VA from the Southeastern Regional Climate Center.

rainfall=read.csv("http://people.hsc.edu/faculty-staff/blins/StatsExamples/rainfall.csv")
dim(rainfall)
## [1] 81 14
head(rainfall)
##   year  Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec total
## 1 1931 2.12 2.15 3.76 3.94 7.92 4.05 6.11 5.89 3.98 0.50 0.15 3.52 44.09
## 2 1932 6.12 1.89 8.07 1.89 5.02 2.33 0.82 0.85 4.10 8.54 6.24 5.78 51.65
## 3 1933 3.76 2.93 2.16 4.86 5.96 4.08 7.70 4.10 0.88 0.30 1.12 3.49 41.34
## 4 1934 1.14 4.07 4.91 1.62 5.26 3.79 3.62 4.13 5.07 2.36 4.30 3.07 43.34
## 5 1935 5.01 3.36 5.06 4.93 3.52 3.41 4.79 2.71 5.33 1.66 3.97 2.47 46.22
## 6 1936 8.21 4.57 5.88 3.00 0.43 5.37 1.80 7.86 4.87 1.48 0.46 4.73 48.66

The data includes the monthly rainfall totals in Farmville from 1931 to 2012. Here is what the data for February looks like:

hist(rainfall$Feb,col='gray',main='February Rainfall Totals',xlab='Inches')

Suppose we wanted to know how variable rainfall totals are. The right way to compute standard deviation from a sample is to use the sample standard deviation: \[s = \sqrt{\frac{\sum (x_i - \bar{x})}{N-1}}.\] The standard deviation of the February rainfall totals is 1.4790047 inches per year. That is only a point estimate. We might like to make a confidence interval for the parameter \(\sigma_\text{Feb}\), that is, the true population standard deviation for Farmville February rainfall amounts. Since we don’t know how to make confidence intervals for standard deviations, we can use a bootstrap distribution.

boot.dist = c()
for (i in 1:10000) {
  boot.sample = sample(rainfall$Feb,81,replace=T)
  boot.stat = sd(boot.sample)
  boot.dist = c(boot.dist,boot.stat)
}
hist(boot.dist,col='gray',main='Bootstrap Distribution for s',xlab='Inches')

This looks roughly normal. Let’s check normality even more carefully with a qqplot:

qqnorm(boot.dist)
qqline(boot.dist)

That looks like a good fit to me. When the bootstrap distribution is normal, you can make a better bootstrap confidence interval using the t-distribution. The equation is:

\[\text{statistic} \pm t^* \text{SE}_{boot}\] where \(\text{SE}_{boot}\) is the standard deviation of the bootstrap distribution (it’s called the bootstrap standard error). You calculate it with the command sd(boot.dist) and in this example we get 0.1072397. Our statistic is \(s\) which is 1.4790047 inches per year, and the critical t-value for a 95% confidence interval with 80 degrees of freedom is given by qt(0.975,80).

s = sd(rainfall$Feb)
t = qt(0.975,80)
SEb = sd(boot.dist)
c(s-t*SEb,s+t*SEb)
## [1] 1.265591 1.692419

So we are 95% certain that the true population standard deviation of Farmville February rainfall amounts is between 1.2655909 and 1.6924185 inches of rain per year.