Today we talked about the best way to plot data.
We showed how to use Excel to make pie charts and bar charts. We also showed how to make stemplots and histograms. Then we talked about what statisticians mean when they talk about the distribution of data. We used our class weights as a first example.
The distribution of a numerical variable encompasses three things:
A statistic is robust if it is not affected by outliers or skew. The median is robust, the mean is not.
Important concept: If a distribution has a long right tail (is right skewed), then the mean will be larger than the median. If a distribution is skewed left, it is the opposite, the mean is lower than the median.
We finished by doing this workshop.
Today we talked about boxplots and different ways to measure spread in a quantitative variable.
A boxplot is a picture showing the 5-number summary. These five numbers are:
You find the quartiles Q1 and Q3 by finding the medians of the bottom half and the top half of the data, respectively. We made a boxplot for the 2018 High Bridge Half-Marathon times. Then, we did a workshop comparing homicide rates in states with and without the death penalty: Death Penalty Workshop.
We also discussed how to measure spread in data. Three ways to do this are:
Note that the range is not robust, since it only depends on two numbers: the max and the min. IQR is very robust, and standard deviation (which we will discuss more later) is somewhere in between in terms of robustness.
Today we talked about standard deviation. We calculated one example by hand using the formula, but then I promised that you would never have to do that again (at least in my class)! Instead, we will use the =STDEV()
function in Excel.
We also talked about how to plot two variables at a time. Depending on whether you have numerical or categorical, these are the best options:
Today we talked about contingency tables (also known as two-way tables). We did a couple of examples, and looked at how the tables can be used to see if there is an association between two categorical variables. The key is to compare either the row proportions or the column proportions.
Male | Female | total | |
---|---|---|---|
Always | 116 | 188 | 304 |
Sometimes | 229 | 284 | 513 |
Never | 82 | 78 | 160 |
total | 427 | 550 | 977 |
To see if there is an association, you can look at the row proportions, which are the row numbers divided by the row totals, or the column proportions. You can tell the difference by how they are worded. The percent of men who always feel rushed is a column proportion, since it is a percent of men which is a column. In this case, it is 27.2% which is 116 divided by the column total 427. The corresponding column proportion for women is 188/550 = 34.2%. So women were somewhat more likely to say they always feel rushed. That means the two variables: gender and rushedness are associated, not independent.
We also talked about segmented bar graphs as a way to visualize row or column proportions. These are easy to make in Excel.
We defined relative risk which is a ratio of two corresponding row or column percentages and tells you how many times higher one percent is than the other. Here are two examples problems we did in class.
Seatbelt | No Seatbelt | total | |
---|---|---|---|
Nonfatal | 412368 | 162527 | 574895 |
Fatal | 510 | 1601 | 2111 |
total | 412878 | 164128 | 577006 |
Finally, we gave two examples of Simpson’s paradox which is when an apparent association between two variables reverses when you divide the data into smaller groups.