Math 121 - Week 1 Notes

Monday, August 24

Today we went over the syllabus for the course. Then we introduced some key terminology that we will use throughout the course. We started by making an example data table on the blackboard. The data table contained the following variables:

Height
Weight
Birth Location
Number of Siblings

We defined the following terms:

Variables
Observational units (aka Individuals, Case, Subject)
Data matrix (aka Data frame or Data table)

We also pay attention to the difference between numerical and categorical variables.

Examples

Kristin Gilbert was a nurse who was accused of murdering patients at a hospital where she worked. During the 18 months where she worked at the hospital, patients died during 40 out of the 257 shifts when Gilbert was working. During the other 1384 shifts, only 34 shifts had patient deaths. Who or what are the individuals and variables in this example?
The National Highway Traffic Safety Administration (NHTSA) keeps records on the number of traffic fatalities that occur in the USA every year. Here is a link with the data up through 2018: https://en.wikipedia.org/wiki/Motor_vehicle_fatality_rate_in_U.S._by_year. The data table there includes several quantitative variables.

Which variable is a better measure of how safe cars are (Total deaths, Fatalities over Population, or Fatalities over VMT)?
Which variable is a better measure of how traffic fatalities compare with other causes of death like cancer, heart disease, etc.?
Why is it better to look at the fatality rates than the total fatalities?

Good vs. Bad Answers

The last question is tricky. It is really tempting to give a bad answer:

Bad Answer: The rate is better because it is a more precise number and it is easier to understand.
Good Answer: The rates are better because they take the size of the population or the amount people drive into account, so they give a more accurate picture of how safe car travel is.

The first answer is what I call a bullshit answer. It is designed to try to sound good and hide the fact that you don’t really know the real answer. Here is what the philosopher Harry Frankfurt has to say about bullshit:

It is impossible for someone to lie unless he thinks he knows the truth. Producing bullshit requires no such conviction. A person who lies is thereby responding to the truth, and he is to that extent respectful of it. When an honest man speaks, he says only what he believes to be true; and for the liar, it is correspondingly indispensable that he considers his statements to be false. For the bullshitter, however, all these bets are off: he is neither on the side of the true nor on the side of the false. His eye is not on the facts at all, as the eyes of the honest man and of the liar are, except insofar as they may be pertinent to his interest in getting away with what he says. He does not care whether the things he says describe reality correctly. He just picks them out, or makes them up, to suit his purpose.

One of the goals of this course is to learn to avoid bullshit answers!

More Examples

Question from an old midterm exam: A study by the Economic Innovation Group found that since 2007, counties in the USA with small populations have had a lower rate of job growth than larger counties. Who or what were the individuals and what where the variables in this study?

Answer: The variables are job growth rate and population. The individuals are the counties, since each county has a population and a job growth rate. Both variables are numerical.

Someone suggested that the individuals might be the different levels of growth rates (low, high, etc.), and then the two variables might be large population and small population. This is not correct, because large population and small population are not different variables. They are different categories for the one variable: population. Later we’ll talk about a special kind of table, called a contingency table, for displaying the data this way. For now, we’ll focus on data matrices where each variable has only one column.

Wednesday, August 26

We started by revisiting some of the examples from Monday to discuss explanatory vs. response variables.

Explanatory and Response Variables

Sometimes variables are associated like height and weight. Other variables might not be associated like height and number of siblings. When variables are not associated, we say they are independent.

+----------------------+    might affect    +-------------------+
| Explanatory Variable | -----------------> | Response Variable |
+----------------------+                    +-------------------+

Remember: Association is not causation!

In the examples above from Monday, which (if any) of the variables were explanatory and which were response?

Populations vs. Samples

We started this topic by each randomly selecting 10 words from the Gettysburg address. This led to a discussion of population parameters versus sample statistics. There are two reasons why sample statistics can be off target:

Bias
Random Error

We spent some time discussing some examples where bias was a problem. We covered the following important lessons:

Large samples have less random error.
Large samples can still be very biased!
The only way to avoid bias is with a simple random sample from the whole population.

Extra Practice

Exercise 1.13

Thursday, August 27

Today we did this workshop about bias and random error in samples.

Friday, August 28

Today we talked about randomized controlled experiments, which are the only way to prove a cause & effect relationship. We compared observational studies with experiments. The difference is that an experiment imposes a treatment on the individuals in the study. An experiment is randomized if the individuals are randomly assigned to the different treatments.

Example 1: Brain Cancer and Cellphones

We looked at an observational study from 1998 where 469 patients with brain cancer were interview about how much they used cell-phones over the previous few years. The patients were then matched with 469 healthy people of the same age, sex, and race who were also interviewed. This study controlled the variables age, sex, and race, but there were many other lurking variables that were not controlled.

Big Idea #1 You can’t be sure that an explanatory variable is the cause of a response unless you control all possible lurking variables. In observational studies it is impossible to control all lurking variables.

Example 2: Polio Vaccine Trials

We also looked at the polio vaccine trials from the early 1950s. This was a very famous randomized controlled experiment. Hundreds of thousands of kids in the 1950s were given the experimental polio vaccine to see if it worked. The kids were randomly assigned one of two treatments, they either got the vaccine or a placebo. Then they were tracked to see if they developed polio.

Big Idea #2 Random assignment controls lurking variables, which lets you prove cause and effect. You can control lurking variables with a randomized controlled experiment.

The Polio Vaccine trials were double blind. That means that both patients and their doctors didn’t know which treatment they were getting. This is done so that all patients would be treated the same (so no new lurking variables could creep in after the random assignment).

Example 3: Magnetic Bracelets

Do magnetic bracelets work to treat joint pain? We talked about how we could design a randomized controlled experiment to test this. We also talked about why people still buy magnetic bracelets, even though studies show they don’t work better than a placebo. One reason is anecdotal evidence which is evidence based on a short memorable personal story. Anecdotal evidence has two problems (from a statistical perspective). One, it is based on small samples, so it is subject to random error. And two, anecdotes are usually memorable stories, and so they are usually biased. Even so, sometimes we have to rely on anecdotes in everyday life because that’s all we have.