Today we went over the syllabus for the course. Then we introduced some key terminology that we will use throughout the course. We started by making an example data table on the blackboard. The data table contained the following variables:
We defined the following terms:
We also pay attention to the difference between numerical and categorical variables.
Kristin Gilbert was a nurse who was accused of murdering patients at a hospital where she worked. During the 18 months where she worked at the hospital, patients died during 40 out of the 257 shifts when Gilbert was working. During the other 1384 shifts, only 34 shifts had patient deaths. Who or what are the individuals and variables in this example?
The last question is tricky. It is really tempting to give a bad answer:
Bad Answer: The rate is better because it is a more precise number and it is easier to understand.
Good Answer: The rates are better because they take the size of the population or the amount people drive into account, so they give a more accurate picture of how safe car travel is.
The first answer is what I call a bullshit answer. It is designed to try to sound good and hide the fact that you don’t really know the real answer. Here is what the philosopher Harry Frankfurt has to say about bullshit:
It is impossible for someone to lie unless he thinks he knows the truth. Producing bullshit requires no such conviction. A person who lies is thereby responding to the truth, and he is to that extent respectful of it. When an honest man speaks, he says only what he believes to be true; and for the liar, it is correspondingly indispensable that he considers his statements to be false. For the bullshitter, however, all these bets are off: he is neither on the side of the true nor on the side of the false. His eye is not on the facts at all, as the eyes of the honest man and of the liar are, except insofar as they may be pertinent to his interest in getting away with what he says. He does not care whether the things he says describe reality correctly. He just picks them out, or makes them up, to suit his purpose.
One of the goals of this course is to learn to avoid bullshit answers!
Answer: The variables are job growth rate and population. The individuals are the counties, since each county has a population and a job growth rate. Both variables are numerical.
Someone suggested that the individuals might be the different levels of growth rates (low, high, etc.), and then the two variables might be large population and small population. This is not correct, because large population and small population are not different variables. They are different categories for the one variable: population. Later we’ll talk about a special kind of table, called a contingency table, for displaying the data this way. For now, we’ll focus on data matrices where each variable has only one column.
We started by revisiting some of the examples from Monday to discuss explanatory vs. response variables.
Sometimes variables are associated like height and weight. Other variables might not be associated like height and number of siblings. When variables are not associated, we say they are independent.
+----------------------+ might affect +-------------------+
| Explanatory Variable | -----------------> | Response Variable |
+----------------------+ +-------------------+
Remember: Association is not causation!
We started this topic by each randomly selecting 10 words from the Gettysburg address. This led to a discussion of population parameters versus sample statistics. There are two reasons why sample statistics can be off target:
We spent some time discussing some examples where bias was a problem. We covered the following important lessons:
Large samples have less random error.
Large samples can still be very biased!
The only way to avoid bias is with a simple random sample from the whole population.
Today we did this workshop about bias and random error in samples.
Today we talked about randomized controlled experiments, which are the only way to prove a cause & effect relationship. We compared observational studies with experiments. The difference is that an experiment imposes a treatment on the individuals in the study. An experiment is randomized if the individuals are randomly assigned to the different treatments.
We looked at an observational study from 1998 where 469 patients with brain cancer were interview about how much they used cell-phones over the previous few years. The patients were then matched with 469 healthy people of the same age, sex, and race who were also interviewed. This study controlled the variables age, sex, and race, but there were many other lurking variables that were not controlled.
Big Idea #1 You can’t be sure that an explanatory variable is the cause of a response unless you control all possible lurking variables. In observational studies it is impossible to control all lurking variables.
We also looked at the polio vaccine trials from the early 1950s. This was a very famous randomized controlled experiment. Hundreds of thousands of kids in the 1950s were given the experimental polio vaccine to see if it worked. The kids were randomly assigned one of two treatments, they either got the vaccine or a placebo. Then they were tracked to see if they developed polio.
Big Idea #2 Random assignment controls lurking variables, which lets you prove cause and effect. You can control lurking variables with a randomized controlled experiment.
The Polio Vaccine trials were double blind. That means that both patients and their doctors didn’t know which treatment they were getting. This is done so that all patients would be treated the same (so no new lurking variables could creep in after the random assignment).
Do magnetic bracelets work to treat joint pain? We talked about how we could design a randomized controlled experiment to test this. We also talked about why people still buy magnetic bracelets, even though studies show they don’t work better than a placebo. One reason is anecdotal evidence which is evidence based on a short memorable personal story. Anecdotal evidence has two problems (from a statistical perspective). One, it is based on small samples, so it is subject to random error. And two, anecdotes are usually memorable stories, and so they are usually biased. Even so, sometimes we have to rely on anecdotes in everyday life because that’s all we have.