2014 NBA Salaries

The data frame NBA in the snippet below contains data on all of the NBA players from the 2014 season. The salary numbers are in millions of dollars.

NBA = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/NBASalaries2014.txt")
head(NBA)
##   firstname  lastname         team conference   salary
## 1       Joe   Johnson BrooklynNets    eastern 23.18079
## 2     Deron  Williams BrooklynNets    eastern 19.75446
## 3     Brook     Lopez BrooklynNets    eastern 15.71900
## 4     Kevin   Garnett BrooklynNets    eastern 12.00000
## 5   Jarrett      Jack BrooklynNets    eastern  6.30000
## 6     Mirza Teletovic BrooklynNets    eastern  3.36810
East = subset(NBA,conference == "eastern")
West = subset(NBA,conference == "western")
mean(East$salary)-mean(West$salary)
## [1] -0.0001861654

Because this is data from the whole population, there is no need to use statistical inference to give us information about salaries, we can just calculate the relevant parameters directly.

\[\mu_{eastern}-\mu_{western} = -\$186.17.\]

  1. Does the salary data have any outliers? How can you tell?

  2. Make histograms to display the distributions of salaries for both conferences. What do you notice?

Simulating Confidence

How often are t-distribution confidence intervals correct? Let’s simulate many random samples from our two populations, and make 95% confidence intervals for \(\mu_{eastern} - \mu_{western}\) using the 2-sample t-distribution confidence interval formula: \[\bar{x}_1-\bar{x}_2 \pm t^* \sqrt{\frac{s_1^2}{N_1}+\frac{s_2^2}{N_2}}.\] In the loop below, I simulate taking samples of size \(N=20\), and computing whether or not the true difference in population means is in the interval.

results = c()
N = 20
for (i in 1:10000) {
  EastSample = sample(East$salary,N)
  WestSample = sample(West$salary,N)
  dF = N-1
  tstar = qt(0.975,dF)
  upper = mean(EastSample)-mean(WestSample) + tstar*sqrt(sd(EastSample)^2/N+sd(WestSample)^2/N)
  lower = mean(EastSample)-mean(WestSample) - tstar*sqrt(sd(EastSample)^2/N+sd(WestSample)^2/N)
  trueGap = mean(East$salary)-mean(West$salary)
  containsTrueGap = lower <= trueGap & upper >= trueGap
  results = c(results,containsTrueGap)
}

And here are the results:

accuracy = sum(results)/length(results)
accuracy
## [1] 0.9641

It would also be nice to know what one of these confidence intervals looks like. The last interval computed in the loop above ranged from a -3.1321345 to 2.5151145 million dollar salary difference.

Copy the code above into your own R markdown file, and answer the following questions:

  1. What happens if the sample size \(N\) is larger (say around 100)? Are the confidence intervals 95% accurate?

  2. What happens if the sample size \(N\) is smaller (like around 5)?

  3. Why is the accuracy so far away from 95% when \(N\) is small?

  4. Why is the accuracy so far away from 95% when \(N\) is large?