Baseball Positions at Bat

Are baseball players with some positions better at bat than players with other positions? For example, are out-fielders better batters than catchers? Below is a data file containing data on 327 MLB players.

mlb = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring17/math222/data/bat10.txt")
head(mlb)
##        name team position  AB   H HR RBI   OBP   AVG
## 1  I Suzuki  SEA       OF 680 214  6  43 0.359 0.315
## 2   D Jeter  NYY       IF 663 179 10  67 0.340 0.270
## 3   M Young  TEX       IF 656 186 21  91 0.330 0.284
## 4  J Pierre  CWS       OF 651 179  1  47 0.341 0.275
## 5   R Weeks  MIL       IF 651 175 29  83 0.366 0.269
## 6 M Scutaro  BOS       IF 632 174 11  56 0.333 0.275

Here the variables are:

plot(mlb$position,mlb$OBP,col='gray')

It looks like Designated Hitters have a better OBP, than other players, but is the difference statistically significant?

ANOVA in R

R has a built-in function for doing analysis of variance.

results = aov(OBP~position,data=mlb)
summary(results)
##              Df Sum Sq  Mean Sq F value Pr(>F)
## position      3 0.0076 0.002519   1.994  0.115
## Residuals   323 0.4080 0.001263

Checking Conditions for ANOVA

The three mathematical assumptions for ANOVA are:

  1. Independence - Ideally, your sample should be a SRS of less than 10% of the population.
  2. Normality - Check histograms or qqplots for each group (less important if the sample sizes are large).
  3. Constant Variance - If the largest sample standard deviation is no more than twice the smallest sample standard deviation, that is a good sign.

We should look at the sample sizes for each group.

aggregate(OBP~position,data=mlb,FUN=length)
##   position OBP
## 1        C  39
## 2       DH  14
## 3       IF 154
## 4       OF 120

As you can see, the overall sample size is very large and the box-and-whisker plots for each group don’t show a lot of skew. So normality probably won’t be an issue.

To check constant variance, look at the standard deviations for each group.

aggregate(OBP~position,data=mlb,FUN=sd)
##   position        OBP
## 1        C 0.04513175
## 2       DH 0.03603669
## 3       IF 0.03709504
## 4       OF 0.02944394