High Bridge 2018 Half-Marathon Results

results = read.csv("http://people.hsc.edu/faculty-staff/blins/classes/spring19/math222/Examples/highbridge2018.csv")
head(results)
##   place bib gender age state    time minutes
## 1     1  66      M  35    VA 1:10:28   70.47
## 2     2  87      M  29    VA 1:18:08   78.13
## 3     3 112      F  32    VA 1:25:47   85.78
## 4     4 116      M  32    VA 1:27:02   87.03
## 5     5  32      M  38    VA 1:27:14   87.23
## 6     6 115      F  31    VA 1:28:15   88.25

The variables in the results data frame are:

Questions

  1. How many people ran the High Bridge half-marathon in 2018?

  2. Make a histogram of the runners’ times using the hist() function. What is the shape of the distribution? Would you say it is skewed left, skewed right, or symmetric? Based on the shape of the distribution, which would you expect to be larger: the mean or the median?

  3. Use the summary() function to find the mean and the five number summary of the race times. Is your prediction about the mean vs. the median from the last problem correct?

  4. Make a bargraph to show the number of runners from each state. To make a bargraph, you can use the plot() function on a vector of categorical data (which are called factors in R).

  5. Use the function class() to determine the data type that R is using for each of the variables in the data frame above. How many different data types are there in this data frame?

  6. What percent of runners were male/female?

  7. Make two different barplots, one to show the number of male vs. female runners, the other to show the percents.

  8. Make side by side boxplots to compare the times of male vs. female runners. What do you notice?

  9. How does the plot() function work in each of the following situations?
    1. The input is a vector of numerical data.

    2. The input is a vector of categorical data.

    3. The first input is a vector of numerical data, and the second is a vector of categorical data.

    4. The first input is a vector of numerical data, and the second is also a vector of numerical data.

    5. The first input is a vector of categorical data, and the second is also a vector of categorical data.

    6. The first input is a vector of categorical data, and the second is a vector of numerical data.

  10. Add another column to the results data frame that gives each runner’s average speed (in miles per hour) for the race. (A half-marathon is 13.1 miles.) Then use the summary() function to give a quick summary of the speeds of the runners.

Subsets of Data Frames

What if we are only interested in the times of men? To get a subset of a data frame, you can use the subset() function. Here is an example:

men = subset(results,results$Gender == 'M')

The first argument of the subset() function is a data frame and the second argument is a logical expression. You can use the operations ==, <, >, <=, >= to create a logical expression. You can also use the symbol & to combine one logical expression and another. The result will be TRUE if both logical expressions were true and FALSE otherwise. To get the logical or, use the symbol |.

Use this idea to answer the following additional questions.

  1. How many men ran the half-marathon?

  2. What was the average time for men?

  3. Make a data frame with information about the women who ran the marathon. What was the average time for women?

  4. Make a data frame for all runners over the age of 50? What was their average time?

  5. How would you make a data frame for runners in their twenties only?

  6. What about a data frame with runners under 20 or over 50?