When we looked at confidence intervals using the formula \(\displaystyle \bar{x} \pm z^* \frac{s}{\sqrt{N}}\), we saw that the sample size mattered a lot. Even when the population has a normal distribution, the formula didn’t work well for small sample sizes.

ages = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring19/math222/Examples/townies.csv")
bell = ages$bellville

N=4
zstar=1.96
popMean = mean(bell) # Population mean of Bellville

results = c() # This will contain TRUE for each simulated confidence interval that contains popMean.
tvals = c() # This vector will contain each simulated t-value.

for (i in 1:10000) {
  mySample = sample(bell,N)
  xbar = mean(mySample)
  s = sd(mySample)
  t = (xbar-popMean)/(s/sqrt(N))
  lower = xbar - zstar*s/sqrt(N)
  upper = xbar + zstar*s/sqrt(N)
  results = c(results,(lower < popMean) & (upper > popMean))
  tvals = c(tvals,t)
}
table(results)
## results
## FALSE  TRUE 
##  1466  8534

As you can see in the table above, our confidence interval contained the population mean only about 85% of the time. The problem is that we a using the normal distribution to model the distribution of the t-values: \[t = \frac{\bar{x} - \mu}{s/\sqrt{n}}.\]

As you can see below: t-values aren’t normal!

qqnorm(tvals)
qqline(tvals)