Day | Sections | Topic |
---|---|---|
Mon, Aug 21 | 1.1 - 1.4 | Counting |
Wed, Aug 2 | 1.5 - 1.6 | Story proofs & definition of probability |
Fri, Aug 25 | 1.5 | Simulation with Python |
Today we review the basic rules for counting.
We looked at how these rules can help us solve probability problems using the naive definition of probability.
What is the probability that a hand of 5 cards from a shuffled deck of 52 playing cards is a full house (3 of one rank and 2 of another)?
What is the probability that a group of 30 random people will all have different birthdays?
How many different ways are there to rearrange the letters of MISSISSIPPI?
What if you randomly permuted the letters of MISSISSIPPI? What is the probability that they would still spell MISSISSIPPI?
The functions $_n C_k$ and $_n P_k$ are both in the Python standard math library. You can use the following code to calculate them:
import math
= 5, 3
n, k print(math.comb(n,k))
print(math.perm(n,k))
Today we introduced story proofs with these examples:
Explain why it makes sense that $\binom{n}{k} = \binom{n}{n-k}$ in terms of choosing subsets of a set with $n$ elements.
Prove that $\displaystyle n\binom{n-1}{k-1} = k \binom{n}{k}$. Hint: there are the same number of ways to choose a team captain who then picks the rest of the team as there are ways to pick the team first, and then select the captain from within the team.
Prove Vandermonde’s identity $\binom{m+n}{k} = \sum_{j = 0}^k \binom{m}{j} \binom{n}{k-j}.$
After that, we introduced the general definition of probability:
Definition. A probability space consists of a set $S$ of possible outcomes called the sample space, and a probability function $P:2^S \rightarrow [0,1]$ which satisfies two axioms:
Axiom 1. $P(\varnothing) = 0$ and $P(S) = 1$, and
Axiom 2. For any finite or countably infinite collection of disjoint events $A_1, A_2, \ldots$, $P(\bigcup_{i = 1}^\infty A_i) = \sum_{i = 1}^\infty P(A_i).$
We looked at the following examples of probability spaces:
Describe the sample space and probability function for rolling a fair six-sided die.
Suppose you flip a fair coin repeatedly until it lands on heads. The sample space is the number of flips it takes. Describe the sample space, the probabilities of the individual outcomes in the sample space, and then calculate the probability that you get an odd number of flips. (We needed geometric series to answer this question!).
What is the probability it takes an even number of flips?
We finished by proving the Complementary events formula $P(A^C) = 1-P(A)$.
Today we did probability simulations in Python using Google Colab.
Estimate the probability that the total of three six-sided dice is 12 by simulating rolling 3 dice many times.
Chevalier de Mere’s problem. The Chevalier de Mere was a French gambler in the 1600’s. He knew that the chance of rolling a one on a six-sided die is 1/6 and the chance of rolling two dice and getting both die to land on ones is 1/36. So he reasoned that the probability of getting a one in four rolls of a single die should be the same as getting a pair of ones in 24 rolls of two dice. Write a simulation in Python to see which is more likely to happen.
Here are my sample solutions in Google Colab.
Day | Sections | Topic |
---|---|---|
Mon, Aug 28 | 2.1 - 2.3 | Bayes rule |
Wed, Aug 30 | 2.4 - 2.5 | Independence |
Fri, Sep 1 | 2.6 - 2.7 | Conditioning |
Today we talked about conditional probability: $P(A | B) = \frac{P( A \cap B)}{P(B)}.$
We did these examples:
Shuffle a deck of cards and draw two cards off the top. Find $P(\text{2nd is an Ace} ~|~ \text{1st is an Ace} )$.
Women in their 40’s have an 0.8% chance to have breast cancer. Mammograms are 90% accurate at detecting breast cancer for people who have it. They are also 93% accurate at not detecting breast cancer when someone doesn’t have it.
We also talked about how to draw weighted tree diagrams to keep track of the probabilities in an example like this. And we also reviewed some of the Python we used on Monday. See this Google colab example.
Today we talked about Bayes formula, both the standard version and the version for calculating posterior odds based on prior odds and the likelyhood ratio. We did these examples:
5% of men are color blind, but only 0.25% of women are. Find the posterior odds that someone is male given that they are color blind.
What is the likelihood ratio for having breast cancer if someone has a positive mammogram test?
(Problem from OpenIntroStats.) Jose visits campus every Thursday evening. However, some days the parking garage is full, often due to college events. There are academic events on 35% of evenings, sporting events on 20% of evenings, and no events on 45% of evenings. When there is an academic event, the garage fills up about 25% of the time, and it fills up 70% of evenings with sporting events. On evenings when there are no events, it only fills up about 5% of the time. If Jose comes to campus and finds the garage full, what is the probability that there is a sporting event? Use a tree diagram to solve this problem.
We also looked at other examples of conditional probabilities in two-way tables:
We finished by defining independent events and proving that if $A, B$ are independent, then so are $A$ and $B^c$.
Today we talked about how to use conditioning to solve probability problems. We did two types of examples: ones where you condition on all of the possible outcomes of a family of events that partition the sample space (using the Law of Total Probability) and the other where you condition on the outcome of a single event (using the conditional probability definition or Bayes formula). We did these examples:
In the example from Wednesday where Jose is trying to find parking on campus, find $P(\text{garage is full})$.
If I have a bag with five dice, one four-sided, one six-sided, one eight-sided, one twelve-sided, and one twenty-sided, and I randomly select one die from the bag and roll it, what is $P(\text{result} = 5)$?
What if you didn’t know which die I rolled, but you knew the result was 5? Find $P(\text{die is twenty-sided}|\text{result}=5)$.
Prove that for any two events $A$ and $B$, $P(A | A \cup B) \ge P(A|B)$. Hint: Condition on whether or not $B$ occurs.
5% of men are color blind and 0.25% of women are. 25% of men are taller than 6 feet tall, while only 2% of women are. Find the conditional probability that a random person is male if you know they are both color blind and taller than 6 ft.
$P(A | B \cap C) = \frac{P(A \cap B | C)}{P(B | C)} = \frac{P(A \cap B \cap C)}{P(B \cap C)}.$
Day | Sections | Topic |
---|---|---|
Wed, Sep 6 | 2.8 - 2.9 | Conditioning - con’d |
Fri, Sep 8 | 3.1 - 3.2 | Discrete random variables |
Today we talked about counter-intuitive probability examples.
The Monte Hall problem.
The prosecutor’s fallacy. In 1998, Sally Clark was tried for murder after two of her sons died suddenly after birth. The prosecutor argued that the probability of one child dying of Sudden Infant Death syndrome (SIDs) is 1/8500, so $P(\text{two children died} | \text{ not guilty} ) = \frac{1}{8500} \cdot \frac{1}{8500} = \frac{1}{72,250,000}$ Given how small that number was, the prosecutor argued that this proved that Sally Clark was guilty beyond a reasonable doubt. What assumptions was the prosecutor making?
p-values in Statistics. We looked at an example where I used my psychic powers to guess 10 out of 25 Zener cards correctly. The conditional probability that I would get so many correct if I were just guessing is very low. The actual answer uses the binomial distribution which we will talk about next week. But is that strong evidence that I am psychic?
Today we introduced random variables. We defined discrete random variables and their probability mass functions (PMFs). We looked at these examples:
Flip two coins. Let X = the number of heads. What is the PMF for X?
Roll two six-sided dice and let T = the total. What is the PMF for T?
Flip a coin until it lands on heads. Let Z = the number of flips. What is the PMF for Z?
We also simulated the PMF of the random variable X = # of correct guesses on 25 tries with Zener cards.
Day | Sections | Topic |
---|---|---|
Mon, Sep 11 | 3.3 - 3.4 | Bernoulli, Binomial, Hypergeometric |
Wed, Sep 13 | 3.3 - 3.6 | Hypergeometric, CDFs |
Fri, Sep 15 | 3.7 - 3.8 | Functions of random variables |
Today we introduced the binomial distribution and the hypergeometric distribution. We derived the probability mass functions for both distributions and spent some time proving the PMF formula for the binomial distribution. We did the following in class exercises:
Graph the PMFs for Bin(2,0.5) and Bin(3,0.5).
One step in the proof of the binomial PMF formula was to give a story proof that $\binom{n}{k} = \binom{n-1}{k} + \binom{n-1}{k-1}.$
Suppose a large town is 47% Republican, 35% Democrat, and 18% independent. A political poll asks a sample of 100 residents their political affiliation. Let X be the number of Republicans. What is the probability distribution for X?
What is the probability of getting exactly 3 aces in a five card poker hand?
What is the distribution of the number of aces in a five card poker hand?
Today we introduced the cumulative distribution function (CDF) of a random variable. We used the binomial distribution CDF to calculate the following:
In roulette, if you bet on a number like 7, you have a 1/38 probability of winning. If you bet $1 and you win, then you get $36 dollars. If you play 100 games of roulette and bet $1 on 7 every time, what is the probability that you lose money?
What is the probability that someone who is just guess would get 10 or more Zener cards correct out of 25?
There is also a CDF function for the hypergeometric distribution, and we used it to find the following:
We also talked about how to get access to CDF functions in Python and R. Here is example of how to use Python with the scipy.stats
module to work with binomial and hypergeometric random variables:
from scipy.stats import binom, hypergeom
= binom(25,1/5)
X print(1-X.cdf(9))
= hypergeom(10,7,5)
Y print(Y.cdf(4))
Then we talked about functions of random variables.
You can also define functions of more than one random variable. For example:
If you roll two six-sided dice and let X be the result of the first roll and Y be the result of the second. Then f(X,Y) = X+Y is a function of two random variables. It is also a new R.V. in its own right. How would you find the PMF for this new R.V.?
For the random variables X and Y above, what is the difference between 2X and X+Y?
Today we finished chapter 3 in the textbook. We defined what it means for two or more random variables to be independent. We observed that if $X \sim \operatorname{Bin}(n,p)$, then $X = X_1 + X_2 + \ldots X_n$ where $X_k \sim \operatorname{Bin(1,p)}$ are i.i.d (independent, identically distributed) Bernoulli random variables.
Use this idea to show that if $X \sim \operatorname{Bin}(n,p)$ and $Y \sim \operatorname{Bin}(m,p)$ are independent, then $X+Y \sim \operatorname{Bin}(m+n,p)$.
A casino has 5 roulette wheels. One of the wheels is improperly balanced, so it lands on a 7 twice as often as it should ($p = 1/19$). Bob likes to play roulette and he always bets on 7. Suppose that Bob plays 100 games of roulette at the casino. If he wins 5 games, find the probability that Bob was at the unbalanced wheel. What if he wins 10 times?
Day | Sections | Topic |
---|---|---|
Mon, Sep 18 | Review | |
Wed, Sep 20 | Midterm 1 | |
Fri, Sep 22 | 4.1 - 4.2 | Expectation |
Today we looked at the following questions:
Explain why the identity below makes sense (i.e., give a story proof). $\binom{n}{2} = \binom{k}{2} + k(n-k) + \binom{n-k}{2}, \text{ for all } 1 \le k \le n-1.$
Alice and Bob are both asked to pick their favorite three movies from the same list of 10 choices. Assume their choices are completely independent and are essentially random with each movie equally likely to be picked. Let $M$ be the number of movies that they both pick.
Is $M$ a sample space, an event, a probability, a random variable, or a probability distribution?
What is $P(M = 3)$?
What is the probability that the surgery succeeds without infection?
Are the events that Bob gets an infection and the surgery fails independent?
If Bob gets an infection, what is the conditional probability that the surgery will fail to fix his knee?
What percent of applicants will test positive?
Find P( uses drugs | tests positive).
Today we defined expected value for discrete random variables. We calculated the expected value for the following examples:
A six-sided die.
A single game of Roulette betting $1 on 7.
Then we used the linearity property of expectation to find this expected value for binomial and hypergeometric random variables. Then we introduced the geometric distribution. We derived the PMF for the geometric distribution and wrote down the definition of the expected value for a geometric random variable.
Day | Sections | Topic |
---|---|---|
Mon, Sep 25 | 4.3 - 4.4 | Geometric and negative binomial |
Wed, Sep 27 | 4.5 - 4.6 | Variance |
Fri, Sep 29 | 4.7 - 4.8 | Poisson distribution |
Today I handed out the Discrete Probability Distributions cheat sheet in class. You don’t need to memorize these formulas, but make sure you are familiar with them and can recognize them when they come up in problems. We also did the following:
We derived the formula for the expected value of a geometric random variable.
McDonalds happy meals come with a toy. One month they give out Teenage Mutant Ninja Turtle toys. Each of the four turtles is equally likely. How many happy meals would you need to buy on average to get all four toys? Key idea: treat this as a sum of random variables $T_1$, $T_2$, $T_3$, and $T_4$ which represent the meals needed to get the next turtle.
We found the expected value in The St. Petersburg Paradox.
We introduced the negative binomial distribution and derived the formula for its expected value.
We also gave a brief introduction to variance and standard deviation and we found the variance of rolling a six-sided die.
Today we talked more about variance. We talked about how to use the Law of the Unconscious Statistician (LOTUS)to find variance. We also proved this alternate formula for variance: $\operatorname{Var}(X) = E(X^2) - E(X)^2$ We also looked at the properties of variance. If $X, Y$ are independent random variables and $c$ is any constant, then
$\operatorname{Var}(c) = 0$
$\operatorname{Var}(X+Y) = \operatorname{Var}(X) + \operatorname{Var}(Y)$
$\operatorname{Var}(cX) = c^2 \operatorname{Var}(X)$
We used these ideas to do the following exercises.
Flip a coin that lands on heads with probability $p$ and let $X$ be 1 if you get heads and 0 otherwise. Find $\operatorname{Var}(X)$.
Find the variance and standard deviation of $Y \sim \operatorname{Binom}(100,0.5)$.
Let $X$ and $Y$ be the results from rolling two different fair six-sided dice. Find $\operatorname{Var}(X+Y)$ and compare it with $\operatorname{Var}(X+X)$. Which random variable has a larger variance $X+Y$ or $X+X$? Why does the answer make sense?
The variance for a $\operatorname{Geom}(p)$ random variable is $\frac{1-p}{p^2}$. Use that formula to find the variance of a $\operatorname{NegBinom}(n,p)$ random variable.
Today we talked about the Poisson distribution. We started with the fact that Virginia gets about 0.3 earthquakes per year (at least earthquakes of magnitude greater than 4). This is a good example of a Poisson process which is any situation where the events you are looking for are rare, but they occur independently of each other and at a predictable rate. We modeled this example with a Binomial distribution where every day there is a very small chance of an earthquake.
How small would p have to be for $\lambda = np$ to be 0.3 for $n = 365$ days?
What if you used $n =(365)(24) = 8760$ hours instead?
Use the Poisson distribution PMF to compute the probability that VA gets an earthquake next year. How does it compare to the two Binomial approximations above?
What is the probability that VA gets an earthquake in the next 4 months?
Write down an infinite series for the expected value of a Pois(λ) random variable.
Theorem. If $X \sim \operatorname{Pois}(\lambda_1)$ and $Y \sim \operatorname{Pois}(\lambda_2)$ are independent, then $X+Y \sim \operatorname{Pois}(\lambda_1 + \lambda_2).$
A store typically gets 10 female customers and 3 male customers per hour. Find the probability distribution for the total number of customers per hour. If the store is open from 10 AM to 6 PM, what is the probability that they get at least 100 customers in one day?
If the store gets 15 customers in one hour, what is the probability that 4 of them are men?
Day | Sections | Topic |
---|---|---|
Mon, Oct 2 | 5.1 - 5.3 | Continuous random variables |
Wed, Oct 4 | 5.4 - 5.5 | Normal and exponential distributions |
Fri, Oct 6 | 5.6 | Poisson processess |
Today we introduced continuous random variables. I gave the following slightly different definition than the one in the book (they are equivalent, however):
Definition. A random variable $X$ is continuous if it has a piecewise continuous probability density function (PDF) $f(x)$ such that $P(a \le X \le b) = \int_a^b f(x)\, dx$ for all $a \le b$ in $[-\infty, \infty]$. The cumulative distribution function (CDF) for $X$ is $P(X \le k) = \int_{-\infty}^k f(x) \, dx.$ Note that the PDF is always the derivative of the CDF.
We looked at these examples:
The uniform distribution $\operatorname{Unif}(a,b)$.
The Cauchy distribution with PDF $\displaystyle f(x) = \frac{1}{\pi(x^2 + 1)}$.
The distribution with PDF $e^{-x}$ on $[0, \infty)$.
We also defined the expected value of a continuous random variable to be $E(X) = \int_{-\infty}^\infty x f(x) \, dx.$
Find the expected value of the random variable with support $e^{-x}$ on $[0,\infty)$.
What is $E(U)$ when $U \sim \operatorname{Unif}(a,b)$?
Today we introduced the normal and the exponential distributions. We did the following problems.
Theorem. If $X \sim \operatorname{Norm}(\mu_X, \sigma_X)$ and $Y \sim \operatorname{Norm}(\mu_Y, \sigma_Y)$ are independent random variables, then $X+Y \sim \operatorname{Norm}(\mu_X + \mu_Y, \sqrt{\sigma_X^2 + \sigma_Y^2})$.
Let $M$ be the height of a random man and $W$ be the height of a random woman. Use the theorem above to find $P(M > W)$. Hint: First find the probability distribution for $M - W$.
If $X \sim \operatorname{Exp}(\lambda)$, find the expected value and variance for $X$. Explain why the expected value formula makes intuitive sense.
We finished by briefly discussing the fact that exponential random variables are memoryless, that is $P(X > s + t | X > s) = P(X > t).$
Today we started with the Blissville vs. Blotchville example from Section 5.5 in the book.
Then we introduced the Gamma distribution. This is the last of the continuous probability distributions on the Continuous Distributions cheat sheet. We use the parameters $n$ and $\lambda$, but a lot of software implementations of Gamma distributions use parameters $\alpha$ (which is the same as $n$, except it can be a decimal) and $\beta$ for the scale which is the reciprocal of the rate $\lambda$. See for example this Gamma distribution app.
We finished today by talking about inverse CDFs. They convert percentiles into the value (or location) of the random variable at that percentile.
Day | Sections | Topic |
---|---|---|
Mon, Oct 9 | 6.1 - 6.2 | Measures of center & moments |
Wed, Oct 11 | 6.3 | Sampling moments |
Fri, Oct 13 | 6.7 | Probability generating functions |
Today we talked about the mean and the median of a random variable. We started with the following example:
The time a person has to wait for a bus to arrive in Blotchville is exponentially distributed with mean 10 minutes (so λ=0.1/minute). What is the median wait time? (In the homework, you’ll invert the CDF for the exponential distribution explicitly. But in class we just used the app).
Why does it make sense that the median is less than the average wait time?
Then we introduced moments. We also defined central moments and standardized moments. The 3rd standardized moment is called the skewness and the 4th is called the kurtosis (our book uses a slightly different definition of kurtosis, but I’ll stick with the simpler definition).
We talked about how skewness measures how asymmetric a distribution is, which led to a discussion of symmetric distributions.
We started by talking about how the first moment (the expected value) corresponds to the center of mass and the second central moment (the variance) corresponds to the moment of inertia (it’s proportional to the kinectic energy of rotation if you spun the PMF/PDF around its center of mass).
After that we talked about sample moments $M_k = \sum_{j = 1}^n X_j^k$ when $X_1, X_2, \ldots, X_n$ is any i.i.d. sample of $n$ random variables.
Definition. A random variable $Y$ is an unbiased estimator for a parameter $\theta$ if $E(Y) = \theta$.
Show that the sample mean $\bar{X} = \frac{X_1 + X_2 + \ldots + X_n}{n}$ is an unbiased estimator for the population mean $\mu$.
Show that the $k$-th sample moment $M_k$ is an unbiased estimator for the $k$-th moment for any probability distribution.
Then we talked about the problem of how to find a useful unbiased estimator for the variance $\sigma^2$. The best unbiased estimator ends up being the sample variance $s^2 = \frac{\sum_{j = 1}^n (X_j - \bar{X})^2}{n-1}.$ Why do we divide by $n-1$? We started but didn’t finish the explanation in class. First we defined two random vectors which are vectors with random variable entries:
$V = \begin{bmatrix} X_1 - \bar{X} \\ X_2 - \bar{X} \\ \vdots \\ X_n - \bar{X} \end{bmatrix}, ~ W = \begin{bmatrix} X_1 - \mu \\ X_2 - \mu \\ \vdots \\ X_n - \mu \end{bmatrix}$
Recall that the length of a vector $u = \begin{bmatrix} u_1 \\ \vdots \\ u_n \end{bmatrix}$ is $\|u\| = \sqrt{u_1^2 + \ldots + u_n^2}$.
Show that $\|W\|^2 = n \sigma^2$ where $\sigma^2$ is the variance of each of the RVs $X_j$.
Show that $V$ is orthogonal to $V-W$, i.e., show that the dot product $V \cdot (V-W) = 0$.
We didn’t finish, but we ended with this last question:
With all of those pieces, you can calculate the expected value $E(\|V\|^2) = E(\|W\|^2) - E(\|W-V\|^2)$.
Today we started by reviewing moments and the Law of the Unconscious Statistician (LOTUS). I gave out this handout about random variables.
We also finished the calculation from last time to show that $\frac{\sum_{j = 1}^n (X_j - \bar{X})^2}{n-1}$ is an unbiased estimator for the variance.
Then we started a discussion of probability generating functions (PGFs) which are covered in Section 6.7 of the book. We looked at these examples:
In general the PGF is a function with a variable $t$ such that coefficients on each term are the probabilities and the powers are the outcomes in a probability model for a discrete random variable.
Theorem. If $X$ and $Y$ are independent discrete RVs with PGFs $f_X(t)$ and $f_Y(t)$, then $X+Y$ has PGF $f_X(t) \cdot f_Y(t).$
Find the PGF for rolling 6 six-sided dice, and use it to find probability that the total is 20.
Find the PGF for flipping a coin once. What about 10 times?
Day | Sections | Topic |
---|---|---|
Wed, Oct 18 | 6.7 | Probability generating functions - con’d |
Fri, Oct 20 | 6.4 - 6.5 | Moment generating functions |
Today we discussed probability generating functions in more detail. We also talked about using generating functions for counting too. We did this in-class workshop:
For people who finished the workshop early, I suggested this extra problem:
Today we introduced moment generating functions (MGFs). We calculated several examples:
Find the MGFs for a Bernouilli(p) random variable and for a six-sided die.
Find the MGF for a Unif(a,b) random variable.
Find the MGF for $X \sim \operatorname{Exp}(1)$.
Then we talked about these important theorems:
Theorem. (MGFs completely determine the distribution) If two R.V.s have the same MGF on an open interval around 0, then they have the same probability distributions.
Theorem. (MGFs for sums of independent R.V.s) If $X, Y$ are independent R.V.s with MGFs $m_X(t)$ and $m_Y(t)$, then $X+Y$ has MGF $m_X(t) \cdot m_Y(t)$.
We didn’t prove these theorems, but we did talk a little about why the second one is true.
Finally we explained why they are called moment generating functions by proving:
Theorem. If $X$ is a R.V. with MGF $m_X(t)$, then the k-th moment of $X$ is the k-th derivative of $m_X(t)$ at $t=0$: $E(X^k) = m_X^{(k)}(0).$
We also pointed out that not every R.V. has a moment generating function, because the integral (or sum) that defines the MGF might not converge when $t \ne 0$.
Day | Sections | Topic |
---|---|---|
Mon, Oct 23 | 6.6 | Sums of independent r.v.s. |
Wed, Oct 25 | Review | |
Fri, Oct 27 | no class |
Today we started with two exercises:
Let $X \sim \operatorname{Pois}(\lambda)$, so the PMF for $X$ is: $e^{-\lambda} \lambda^k/ k!$. Find the MGF $M_X(t)$.
Let $Z \sim \operatorname{Norm}(0,1)$. Find the MGF for $Z$. Hint: You’ll have to complete the square: $tx-x^2/2 = -\frac{x^2 - 2tx}{2} = -\frac{(x - t)^2}{2} + \frac{t^2}{2}.$
If $X$ is any random variable with MGF $M_X(t)$, what is the MGF for $Y = aX + b$? Hint: use the definition.
Use your answer to the previous problem to find the MGF for $\mu + \sigma Z$.
After we did these two exercises, we looked at this Table of Moment Generating Functions. Then we proved these two theorems:
Theorem 1. Let $X \sim \operatorname{Pois}(\lambda)$ and $Y \sim \operatorname{Pois}(\mu)$ be independent. Then $X+Y \sim \operatorname{Pois}(\lambda + \mu)$.
Theorem 2. Let $X \sim \operatorname{Norm}(\mu_X, \sigma_X)$ and $Y \sim \operatorname{Norm}(\mu_Y, \sigma_Y)$ be independent. Then $X+Y \sim \operatorname{Norm}(\mu_X + \mu_Y, \sqrt{\sigma_X^2 + \sigma_Y^2}).$
Today we looked at some examples similar to what might be on the midterm on Monday.
Day | Sections | Topic |
---|---|---|
Mon, Oct 30 | Midterm 2 | |
Wed, Nov 1 | 7.1 | Joint distributions |
Fri, Nov 3 | 7.1 | Marginal & conditional distributions |
Today we introduced joint distributions. First we defined joint PMFs for discrete random variables:
Definition. A function $f(x,y)$ is a joint probability mass function for two discrete r.v.s $X$ and $Y$ if $P(X=x \text{ and } Y = y) = f(x,y)$ for all pairs $(x,y)$ in the support of $X$ and $Y$.
We briefly looked at example 7.1.5 from the book before moving on to joint PDFs for continuous random variables:
Definition. A function $f(x,y)$ is a joint probability density function for two continuous r.v.s $X$ and $Y$ if $P(a \le X \le b \text{ and } c \le Y \le d) = \int_c^d \int_a^b f(x,y) \, dx dy.$
We did the example of a uniform distribution on a circle, then we did this workshop:
If you need to review double integrals, I recommend trying these videos & examples on Khan Academy.
Today we talked about marginal and conditional distributions for jointly distributed random variables. If $X, Y$ are jointly distributed, then
The marginal PDF/PMFs $f_X(x)$ and $f_Y(y)$ tell you how $X$ and $Y$ are distributed when you don’t care or know about the other random variable.
$f_X(x) = \sum_y f(x,y) \text{ or } \int_{-\infty}^\infty f(x,y) \, dy.$
The conditional PDF/PMFs $f_{X\,|\,Y}(x\,|\,Y=y)$ and $f_{Y\,|\,X}(y\,|\,X=x)$ tell you the distributions of $X$ & $Y$ are when you know the value of the other.
$f_{X \, | \, Y}(x \, | \, Y = y) = \frac{f(x,y)}{f_Y(y)}.$
The book has a good picture to understand how the shape of the conditional PDF comes from the shape of the joint PDF after you renormalize Figure 7.5 on page 325.
We did these examples.
Day | Sections | Topic |
---|---|---|
Mon, Nov 6 | 7.2 | 2D Lotus |
Wed, Nov 8 | 7.3 | Covariance & correlation |
Fri, Nov 10 | 7.5 | Multivariate normal distribution |
Today we introduced the 2-dimension version of the Law of the Unconscious Statistician (2D LOTUS). For random variables $X, Y$ with joint PDF or PMF $f(x,y)$, $E(g(X,Y)) = \int_{-\infty}^\infty \int_{-\infty}^\infty g(x,y) f(x,y) \, dx \, dy \text{ or } \sum_x \sum_y g(x,y) f(x,y).$
Suppose $X, Y \sim \operatorname{Unif}(0,1)$ are i.i.d. RVs. Find $E(|x-y|)$.
Let $X, Y$ be any independent random variables with joint distribution $f(x,y)$. Then prove that $E(XY) = E(X) E(Y).$
After these examples, we defined the covariance of two random variables: $\operatorname{Cov}(X,Y) = E((X-\mu_X)(Y-\mu_Y)).$
Show that $\operatorname{Cov}(X,Y) = E(XY) - E(X)E(Y)$.
Show that $\operatorname{Cov}(X,Y) = 0$ if $X, Y$ are independent.
We also discussed the following additional properties of covariance:
Finally, we defined the correlation between two random variables: $\rho(X,Y) = \frac{\operatorname{Cov}(X,Y)}{\sigma_X \sigma_Y}.$
Today we introduced multivariate normal distributions. Suppose that $X_1, \ldots, X_n \overset{i.i.d.}{\sim} \operatorname{Norm}(0,1)$. We can arrange the values of $X_1, \ldots X_n$ into a vector $X = [X_1, \ldots, X_n]^T$ in $\mathbb{R}^n$ which we call a random vector. Then for any $m$-by-$n$ matrix $A \in \mathbb{R}^{m \times n}$ and any vector $b \in \mathbb{R}^m$, the random vector $Y = AX + b$ has a multivariate normal distribution.
For any multivariate normal random vector $Y = AX + b$ where $X$ is a standard normal random vector, the covariance matrix for $Y$ is $\Sigma = AA^T.$ The entries of the covariance matrix are $\Sigma = \begin{bmatrix} \operatorname{Cov}(Y_1,Y_1) & \operatorname{Cov}(Y_1, Y_2) & \ldots & \operatorname{Cov}(Y_1, Y_m) \\ \operatorname{Cov}(Y_2,Y_1) & \ddots & & \vdots \\ \vdots & & \ddots & \vdots \\ \operatorname{Cov}(Y_m,Y_1) & \ldots & \ldots & \operatorname{Cov}(Y_m, Y_m) \end{bmatrix}.$
The heights of fathers and their adult sons have a moderately strong correlation $\rho = 0.5$. Both father’s and son’s heights (measured in standard deviations from the mean) have a standard normal distribution $\operatorname{Norm}(0,1)$. But they are not indepedent. Instead, find the covariance matrix for this situation.
You can think of the height of the son $Y$ as a linear combination of his father’s height $X$ and an additional independent random component $Z$ that has a standard normal distribution. Find coefficients for $aX + bZ$ such that the resulting normal distribution has standard deviation $1$ and $\operatorname{Cov}(aX+bZ,X) = 0.5$.
Compute $AA'$ where $A = \begin{bmatrix} 1 & 0 \\ a & b \end{bmatrix}.$ Do you get the correct covariance matrix for the heights of fathers and sons?
Today we started with some examples that applied the following theorem about multivariate normal distributions.
Theorem. If $X = \begin{bmatrix} X_1 \\ \vdots \\ X_n\end{bmatrix}$ is a random vector with a multivariate normal distribution, and if $A \in \mathbb{R}^{m \times n}$ is a matrix, then $Y = AX$ has a multivariate normal distribution. In the special case where $A$ has only one row (i.e., $m=1$), $Y$ has a normal distribution.
We used this theorem to help solve the following problems.
After that, we introduced the log-normal distribution. This is the distribution of $Y = e^X$ if $X \sim \operatorname{Norm}(0,1)$.
Find $P(Y \le y)$ using $\Phi$ to represent the standard normal CDF (which doesn’t have a nice formula).
Differentiate the CDF for $Y$ to find the PDF for $Y$.
Find a formula for the moments $E(Y^n)$. Hint: $E(Y^n) = E(e^{nX})$ which looks a lot like the MGF $m_X(t) = E(e^{tX})$. So you can use the MGF for a standard normal $X$ to find the moments of $Y$.
Day | Sections | Topic |
---|---|---|
Mon, Nov 13 | 8.1 | Change of variables |
Wed, Nov 15 | 8.2 | Convolutions |
Fri, Nov 17 | Review | |
Mon, Nov 20 | Midterm 3 |
Today we did more examples of change of variables for random variables. We focused on one dimensional examples.
Let $U \sim \operatorname{Unif}(0,1)$. Find PDF for $U^2$.
Here is how you can generate an exponentially distributed r.v. $X$. Start by randomly generating $U \sim \operatorname{Unif}(0,1)$. Then apply the function $f(x) = -\ln x$ to $U$. Prove that $X = -\ln U$ has the $\operatorname{Exp}(1)$ distribution.
Let $X$ be any r.v. For any monotone (i.e., either always increasing or always decreasing) differentiable function $g$ defined on the support of $X$, let $Y = g(X)$. Prove that the PDF of $Y$ is $f_Y(y) = f_X(x) \left| \dfrac{dx}{dy} \right|$.
Find the PDF for $X^3$ where $X \sim \operatorname{Norm}(0,1)$.
We also defined the $\chi^2$-distribution, which is the distribution of a sum of i.i.d. random variables $X_1^2 + \ldots + X_n^2$ where each $X_i \sim \operatorname{Norm}(0,1)$. When $n = 1$, we were able to find the PDF for the $\chi^2(1)$ distribution even though the function $g(x) = x^2$ is not monotone increasing on the whole real line.
Today we defined the convolution of two functions $(f_X \ast f_Y)(t) = \int_{-\infty}^\infty f_Y(t-x) f_X(x) \, dx = \int_{-\infty}^\infty f_X(t-y) f_Y(y) \, dy.$ If $X, Y$ are independent continuous random variables with PDFs $f_X$ and $f_Y$, then $X+Y$ has PDF $f_X \ast f_Y$. You can also define the discrete convolution, which gives the PMF for a sum of two independent discrete random variables. $(f_X \ast f_Y)(k) = \sum_{x} f_Y(k-x) f_X(x) = \sum_y f_X(k-y) f_Y(y)$
Find the PMF for the sum of two 6-sided dice.
If $X, Y \overset{i.i.d.}{\sim} \operatorname{Unif}(0,1)$, find the PDF for $X+Y$. (https://youtu.be/Blg5RIjGwBE)
To help with the notation in this last problem, we introduced indicator functions $\mathbf{1}_A$. (See https://en.wikipedia.org/wiki/Indicator_function)
Find the PDF for the sum of $X, Y \overset{i.i.d.}{\sim} \operatorname{Exp}(\lambda)$. (https://youtu.be/Glff9dvPVEg)
Last time, we found the distribution for $X^2$ if $X \sim \operatorname{Norm}(0,1)$. The PDF for $X^2$ is $f_{X^2}(x) = \frac{1}{\sqrt{2 \pi x}} e^{-x/2}.$ This is the $\chi^2(1)$ PDF. Now suppose that we have $X, Y \overset{i.i.d.}{\sim} \operatorname{Norm}(0,1)$ random variables. Set up, but don’t evaluate a convolution integral for the PDF of $X^2 + Y^2$.
Today we did a review of material that will be on midterm 3. This includes:
We did these examples in class:
Suppose $X, Y$ are discrete r.v.s. each taking values in $\{0,1\}$ with joint PMF $f(x,y)$ given by $f(0,0) = \tfrac{1}{2}, ~f(0,1) = \tfrac{1}{3}, ~f(1,0) = \tfrac{1}{6}, ~f(1,1) = 0.$
Suppose that $(X,Y)$ are continuous r.v.s. that are jointly uniformly distributed in the unit disk $D = \{(x,y) \in \mathbb{R}^2 \, : \, x^2+y^2 \le 1\}.$
Suppose $X,Y$ are both $\operatorname{Norm}(0,1)$ r.v.s., and the correlation between $X$ and $Y$ is $\rho = 0.5$. Find $P(X+2Y \ge 3)$.
Let $X \sim \operatorname{Unif}(0,2)$ and $Y \sim \operatorname{Unif}(0,1)$ be independent r.v.s. Find the PDF for $X+Y$.
Day | Sections | Topic |
---|---|---|
Mon, Nov 27 | 10.1 - 10.2 | Inequalities & Law of large numbers |
Wed, Nov 29 | 10.3 | Central limit theorem |
Fri, Dec 1 | 10.3 | Applications of LLN & CLT |
Mon, Dec 4 | Review & recap |
Today we introduced two important inequalities: the Markov Inequality and Chebyshev’s Inequality.
Theorem (Markov’s Inequality). For any r.v. $X$ and constant $a > 0$, $P(|X| \ge a) \le \frac{E(|X|)}{a}.$
We gave a visual proof by looking at the graph of the PDF for $X$ (assuming $X$ is continuous, but the proof is essentially the same for discrete r.v.s.) and comparing the integrals to find $P(|X| \ge a)$ with the one to find $E(\tfrac{1}{a} |X|)$. Which integral is bigger?
An corollary of Markov’s inequality is this more useful inequality.
Theorem (Chebyshev’s Inequality). Let $X$ be any r.v. with mean $\mu$ and variance $\sigma^2$. Let $a > 0$. Then $P(|X - \mu | \ge a) \le \frac{\sigma^2}{a^2}.$
Prove Chebyshev’s inequality by applying Markov’s inequality to the random variable $(X-\mu)^2$.
Here on Earth, heights of adults are roughly normally distributed. But if you go to another planet, they might have a totally different probability distribution. Explain how we can be certain that less than 25% of Martians have a height that is 2 standard deviations above average.
If $X_1, \ldots, X_N$ are i.i.d. r.v.s., with mean $\mu$ and standard deviation $\sigma$, then what is the mean and standard deviation of $\bar{X} = \frac{X_1 + \ldots + X_n}{n}?$
What does Chebyshev’s inequality say about the probability $P(|\bar{X} - \mu| \ge a)$? What happens as $n$ gets bigger?
Today we talked about the central limit theorem.
Central Limit Theorem. Let $X_1, X_2, \ldots, X_n$ be i.i.d. r.v.s. with mean $\mu$ and variance $\sigma^2$. Then the PMF or PDF for the random variable $Z = \frac{\bar{X} - \mu}{\sigma/\sqrt{n}}$ converges to the PDF for a $\operatorname{Norm}(0,1)$ random variable as $n \rightarrow \infty$.
We proved this theorem in class under the extra assumption that the MGF for $X_i$ exists. We also assumed for simplicity that $\mu = 0$ and $\sigma=1$. Neither assumption is necessary (and it is easy to get rid of the assumption that $\mu = 0$ and $\sigma=1$).
Let $M(t)$ denote the MGF for each of the $X_i$. How do we know that they all have the same MGF?
Show that the MGF for $Z = \bar{X}/\sqrt{n}$ is $M(t/\sqrt{n})^n$.
What are $M(0)$, $M'(0)$, and $M''(0)$?
Find $\lim_{n \rightarrow \infty} M(t/\sqrt{n})^n$.
We finished with this exercise:
The key to the last problem is to use these normal approximation facts:
Corollary (Normal Approximations). If $X_1, \ldots, X_n$ are i.i.d. RVs with mean $\mu$ and variance $\sigma^2$ and $n$ is large, then
The total $X_1 + X_2 + \ldots + X_n$ is approximately $\operatorname{Norm}(n \mu, \sqrt{n} \sigma)$.
The sample mean $\bar{X}$ is approximately $\operatorname{Norm}(\mu, \tfrac{\sigma}{\sqrt{n}})$.
Today we talked about applications of the Central Limit Theorem and the Law of Large Numbers. We started with this corollary of the Central Limit Theorem that we didn’t write down explicitly in class last time:
Corollary (Normal Approximations). If $X_1, \ldots, X_n$ are i.i.d. RVs with mean $\mu$ and variance $\sigma^2$ and $n$ is large, then
The total $X_1 + X_2 + \ldots + X_n$ is approximately $\operatorname{Norm}(n \mu, \sqrt{n} \sigma)$.
The sample mean $\bar{X}$ is approximately $\operatorname{Norm}(\mu, \tfrac{\sigma}{\sqrt{n}})$.
We also used the corollary to derive the formula for a 95% confidence interval:
95% Confidence Intervals In a large sample, $\bar{X}$ is within 2 standard deviations of $\mu$, 95% of the time. So if you know (or have a good estimate) for $\sigma$, then you can use $\bar{X}$ to estimate $\mu$:
$\bar{x} \pm 2 \frac{\sigma}{\sqrt{n}}.$
We finished by introducing Monte Carlo Integration.
Find $\int_0^1 \sqrt{1-x^2} \, dx$ using Monte Carlo integration. Write a computer program to randomly generate points uniformly in the square $[0,1] \times [0,1]$, then record 1 if the point is under the curve, or 0 if it is not.
When you randomly generate points in a rectangle an calculate the proportion that hit the region you want, what is the approximate probability distribution for the proportion that hit? What is its mean and standard deviation?
Today we did a review of some material that might be on the final. We did the following problems in class:
Suppose 10 men and 10 women get in a line in a random order. Find the probability that the 10 men are in front of the 10 women in the line.
Let $X$ be a random variable that is partially determined by flipping a coin. If the coin is heads, then $X \sim \operatorname{Exp}(1)$ and if the coin is tails, then $X \sim \operatorname{Unif}(0,1)$. Find $P( \text{head} \,|\, X \ge 0.9)$.
Let $X \sim \operatorname{Norm}(0,1)$ and $Y \sim \operatorname{Exp}(1)$ be independent random variables. Set up, but do not evaluate an integral that represents $P(Y \ge X)$.
We also reviewed two problems from midterm 3. The problem about finding the standard deviation of a difference of two correlated normal random variables and the problem of finding the expected value of a function for a pair of random variables that are jointly uniformly on a triangle.