Math 121 - Week 1 Notes

Monday, August 31

Today we talked about the best way to plot data.

Categorical data:
- Pie Charts
- Bar Charts (bar charts are usually better because they make comparisons easier)
Numerical data:
- Stemplots
- Histograms
- Boxplots

We showed how to use Excel to make pie charts and bar charts. We also showed how to make stemplots and histograms. Then we talked about what statisticians mean when they talk about the distribution of data. We used our class weights as a first example.

Distributions of Data

The distribution of a numerical variable encompasses three things:

Shape (is the data skewed left/right or symmetrical?)
Center (what is the mean and/or median?)
Spread (how spread out are the numbers?)

Robustness

A statistic is robust if it is not affected by outliers or skew. The median is robust, the mean is not.

Important concept: If a distribution has a long right tail (is right skewed), then the mean will be larger than the median. If a distribution is skewed left, it is the opposite, the mean is lower than the median.

Workshop

We finished by doing this workshop.

Wednesday, September 2

Today we talked about boxplots and different ways to measure spread in a quantitative variable.

A boxplot is a picture showing the 5-number summary. These five numbers are:

Max
Min
Median
Quartile 1 (Q1)
Quartile 3 (Q3)

You find the quartiles Q1 and Q3 by finding the medians of the bottom half and the top half of the data, respectively. We made a boxplot for the 2018 High Bridge Half-Marathon times. Then, we did a workshop comparing homicide rates in states with and without the death penalty: Death Penalty Workshop.

We also discussed how to measure spread in data. Three ways to do this are:

Range \(( = \text{max} - \text{min})\)
Interquartile Range (IQR) \(( = Q_3 - Q_1)\)
Standard Deviation

Note that the range is not robust, since it only depends on two numbers: the max and the min. IQR is very robust, and standard deviation (which we will discuss more later) is somewhere in between in terms of robustness.

Thursday, September 3

Today we talked about standard deviation. We calculated one example by hand using the formula, but then I promised that you would never have to do that again (at least in my class)! Instead, we will use the =STDEV() function in Excel.

How to Plot Two Variables

We also talked about how to plot two variables at a time. Depending on whether you have numerical or categorical, these are the best options:

Two Numerical Variables Use a scatterplot.
One Numerical and One Categorical Variable Use side-by-side boxplots.
Two Categorical Variables Use a segmented bar graph.

Friday, September 4

Today we talked about contingency tables (also known as two-way tables). We did a couple of examples, and looked at how the tables can be used to see if there is an association between two categorical variables. The key is to compare either the row proportions or the column proportions.

Our first example was from a question on the General Social Survey which asked a random sample of 977 Americans this question: in general, how do you feel about your time? Would you say that you always feel rushed, sometimes feel rushed, or almost never feel rushed?. The results are broken down by gender in this contingency table:

	Male	Female	total
Always	116	188	304
Sometimes	229	284	513
Never	82	78	160
total	427	550	977

To see if there is an association, you can look at the row proportions, which are the row numbers divided by the row totals, or the column proportions. You can tell the difference by how they are worded. The percent of men who always feel rushed is a column proportion, since it is a percent of men which is a column. In this case, it is 27.2% which is 116 divided by the column total 427. The corresponding column proportion for women is 188/550 = 34.2%. So women were somewhat more likely to say they always feel rushed. That means the two variables: gender and rushedness are associated, not independent.

We also talked about segmented bar graphs as a way to visualize row or column proportions. These are easy to make in Excel.

Segmented Bar Graph

We defined relative risk which is a ratio of two corresponding row or column percentages and tells you how many times higher one percent is than the other. Here are two examples problems we did in class.

A study by the NHTSA looked at people who were injured in car accidents. For each person, they looked at whether the injury was fatal or not, and whether the person was wearing a seatbelt during the accident. Here were the results:

	Seatbelt	No Seatbelt	total
Nonfatal	412368	162527	574895
Fatal	510	1601	2111
total	412878	164128	577006

Which fractions would be better to compare to see if there is an association between seatbelt use and survival? (i) 510/2111 vs. 1601/2111, (ii) 510/412878 vs. 1601/164128, or (iii) 510/577006 vs. 1601/577006?
What is the relative risk of dying in a car accident without a seatbelt vs. with one?

A 1993 study looked at whether the anti-retroviral drug AZT could help prevent pregnant mothers who were HIV-positive from passing the virus on to their babies. The study was a randomized controlled experiment. 183 pregnant women were randomly assigned to a placebo group, while 180 received AZT. When their babies were born, 40 out of the 183 babies in the placebo group were HIV-positive, while only 13 of the babies in the AZT group were HIV-positive.
1. Make a contingency table showing the results of this experiment
2. What are the explanatory and response variable?
3. Are the variables associated or independent?
4. What is the relative risk of a baby getting HIV from their mother in the placebo group vs. in the AZT group?

Finally, we gave two examples of Simpson’s paradox which is when an apparent association between two variables reverses when you divide the data into smaller groups.