The data frame NBA in the snippet below contains data on all of the NBA players from the 2014 season. The salary numbers are in millions of dollars.
NBA = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/NBASalaries2014.txt")
head(NBA)
## firstname lastname team conference salary
## 1 Joe Johnson BrooklynNets eastern 23.18079
## 2 Deron Williams BrooklynNets eastern 19.75446
## 3 Brook Lopez BrooklynNets eastern 15.71900
## 4 Kevin Garnett BrooklynNets eastern 12.00000
## 5 Jarrett Jack BrooklynNets eastern 6.30000
## 6 Mirza Teletovic BrooklynNets eastern 3.36810
East = subset(NBA,conference == "eastern")
West = subset(NBA,conference == "western")
mean(East$salary)-mean(West$salary)
## [1] -0.0001861654
Because this is data from the whole population, there is no need to use statistical inference to give us information about salaries, we can just calculate the relevant parameters directly.
\[\mu_{eastern}-\mu_{western} = -\$186.17.\]
Does the salary data have any outliers? How can you tell?
Make histograms to display the distributions of salaries for both conferences. What do you notice?
How often are t-distribution confidence intervals correct? Let’s simulate many random samples from our two populations, and make 95% confidence intervals for \(\mu_{eastern} - \mu_{western}\) using the 2-sample t-distribution confidence interval formula: \[\bar{x}_1-\bar{x}_2 \pm t^* \sqrt{\frac{s_1^2}{N_1}+\frac{s_2^2}{N_2}}.\] In the loop below, I simulate taking samples of size \(N=20\), and computing whether or not the true difference in population means is in the interval.
results = c()
N = 20
for (i in 1:10000) {
EastSample = sample(East$salary,N)
WestSample = sample(West$salary,N)
dF = N-1
tstar = qt(0.975,dF)
upper = mean(EastSample)-mean(WestSample) + tstar*sqrt(sd(EastSample)^2/N+sd(WestSample)^2/N)
lower = mean(EastSample)-mean(WestSample) - tstar*sqrt(sd(EastSample)^2/N+sd(WestSample)^2/N)
trueGap = mean(East$salary)-mean(West$salary)
containsTrueGap = lower <= trueGap & upper >= trueGap
results = c(results,containsTrueGap)
}
And here are the results:
accuracy = sum(results)/length(results)
accuracy
## [1] 0.9641
It would also be nice to know what one of these confidence intervals looks like. The last interval computed in the loop above ranged from a -3.1321345 to 2.5151145 million dollar salary difference.
Copy the code above into your own R markdown file, and answer the following questions:
What happens if the sample size \(N\) is larger (say around 100)? Are the confidence intervals 95% accurate?
What happens if the sample size \(N\) is smaller (like around 5)?
Why is the accuracy so far away from 95% when \(N\) is small?
Why is the accuracy so far away from 95% when \(N\) is large?