Probability Distributions & Sampling Distributions

Similar documents
(Re)introduction to Statistics Dan Lizotte

Announcements Monday, September 18

GOV 2001/ 1002/ E-2001 Section 3 Theories of Inference

Exponential Distribution and Poisson Process

Closed book and notes. 60 minutes. Cover page and four pages of exam. No calculators.

Probability concepts. Math 10A. October 33, 2017

Simulating MLM. Paul E. Johnson 1 2. Descriptive 1 / Department of Political Science

Theory of Maximum Likelihood Estimation. Konstantin Kashin

Lecture 1: Random number generation, permutation test, and the bootstrap. August 25, 2016

Announcements Wednesday, November 15

Lecture 30. DATA 8 Summer Regression Inference

Announcements August 31

Precept 4: Hypothesis Testing

Derivatives and series FTW!

Today s Outline. Biostatistics Statistical Inference Lecture 01 Introduction to BIOSTAT602 Principles of Data Reduction

You may use your calculator and a single page of notes. The room is crowded. Please be careful to look only at your own exam.

EECS 126 Probability and Random Processes University of California, Berkeley: Fall 2014 Kannan Ramchandran November 13, 2014.

Let s try a poll! Where did you see the solar eclipse? GaTech Home Out of Atlanta Out of Georgia I didn t watch it

STAT 430/510 Probability Lecture 12: Central Limit Theorem and Exponential Distribution

Extensions to the Logic of All x are y: Verbs, Relative Clauses, and Only

Econ 1123: Section 2. Review. Binary Regressors. Bivariate. Regression. Omitted Variable Bias

STA304H1F/1003HF Summer 2015: Lecture 11

Midterm Exam 1 Solution

Matematisk statistik allmän kurs, MASA01:A, HT-15 Laborationer

Basic Data Analysis. Stephen Turnbull Business Administration and Public Policy Lecture 5: May 9, Abstract

Chapter 4 - Lecture 3 The Normal Distribution

EE376A - Information Theory Midterm, Tuesday February 10th. Please start answering each question on a new page of the answer booklet.

Functions of Random Variables Notes of STAT 6205 by Dr. Fan

IE 581 Introduction to Stochastic Simulation. Three pages of hand-written notes, front and back. Closed book. 120 minutes.

STA414/2104. Lecture 11: Gaussian Processes. Department of Statistics

Week 9 The Central Limit Theorem and Estimation Concepts

The remains of the course

MATH5745 Multivariate Methods Lecture 07

Stat 217 Final Exam. Name: May 1, 2002

Natural Language Processing

Midterm: CS 6375 Spring 2015 Solutions

CS 361 Sample Midterm 2

Midterm exam CS 189/289, Fall 2015

Announcements Wednesday, September 20

Wednesday, 10 September 2008

Outline. Wednesday, 10 September Schedule. Welcome to MA211. MA211 : Calculus, Part 1 Lecture 2: Sets and Functions

You have 3 hours to complete the exam. Some questions are harder than others, so don t spend too long on any one question.

Announcements Wednesday, September 05

COMP 551 Applied Machine Learning Lecture 20: Gaussian processes

Page 0 of 5 Final Examination Name. Closed book. 120 minutes. Cover page plus five pages of exam.

Lecture 4.3: Closures and Equivalence Relations

Class 04 - Statistical Inference

The Central Limit Theorem

EECS 126 Probability and Random Processes University of California, Berkeley: Fall 2014 Kannan Ramchandran September 23, 2014.

Essentials of Statistics and Probability

STA 111: Probability & Statistical Inference

SDS 321: Introduction to Probability and Statistics

Hypothesis Tests and Confidence Intervals Involving Fitness Landscapes fit by Aster Models By Charles J. Geyer and Ruth G. Shaw Technical Report No.

Review for Exam Spring 2018

ASTR 101L: Motion of the Sun Take Home Lab

Review of basic probability and statistics

2018 Fall 2210Q Section 013 Midterm Exam I Solution

Chapter 4 Continuous Random Variables and Probability Distributions

Lecture 5 : The Poisson Distribution

Hypothesis Tests and Confidence Intervals Involving Fitness Landscapes fit by Aster Models By Charles J. Geyer and Ruth G. Shaw Technical Report No.

GOV 2001/ 1002/ Stat E-200 Section 1 Probability Review

HW1 (due 10/6/05): (from textbook) 1.2.3, 1.2.9, , , (extra credit) A fashionable country club has 100 members, 30 of whom are

Lecture 4. Continuous Random Variables and Transformations of Random Variables

Probability. Kenneth A. Ribet. Math 10A After Election Day, 2016

Section 3.3. Matrix Equations

1 st Midterm Exam Solution. Question #1: Student s Number. Student s Name. Answer the following with True or False:

Introduction to Probability, Fall 2009

Supplementary Material: Visual Fidelity of 3D Regular Sampling and Reconstruction

Markov Networks.

Inference Tutorial 2

Introduction to bivariate analysis

MATH 2200 Final Review

MATH 450: Mathematical statistics

Name: SOLUTIONS Exam 01 (Midterm Part 2 take home, open everything)

Announcements Monday, September 17

Introduction to bivariate analysis

Shorts

The Welch-Satterthwaite approximation

review session gov 2000 gov 2000 () review session 1 / 38

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

S n = x + X 1 + X X n.

CMPSCI 240: Reasoning Under Uncertainty First Midterm Exam

DS-GA 1002 Lecture notes 10 November 23, Linear models

Sampling distributions and the Central Limit. Theorem. 17 October 2016

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

CprE 281: Digital Logic

Spring 2014 Midterm 1 02/26/14 Lecturer: Jesus Martinez Garcia

Lecture : Probabilistic Machine Learning

A Probabilistic Model for Canonicalizing Named Entity Mentions. Dani Yogatama Yanchuan Sim Noah A. Smith

Stat 515 Midterm Examination II April 4, 2016 (7:00 p.m. - 9:00 p.m.)

Midterm Exam 1 (Solutions)

ORF 245 Fundamentals of Engineering Statistics. Final Exam

Be sure to pledge your exam booklet and put your name on it.

CPE100: Digital Logic Design I

Q = (c) Assuming that Ricoh has been working continuously for 7 days, what is the probability that it will remain working at least 8 more days?

Calculus (Math 1A) Lecture 1

Week 3: Learning from Random Samples

Welcome Back to Physics Electric Potential. Robert Millikan Physics 1308: General Physics II - Professor Jodi Cooley

Machine Learning CSE546 Carlos Guestrin University of Washington. September 30, 2013

Bootstrap tests. Patrick Breheny. October 11. Bootstrap vs. permutation tests Testing for equality of location

Transcription:

GOV 2000 Section 4: Probability Distributions & Sampling Distributions Konstantin Kashin 1 Harvard University September 26, 2012 1 These notes and accompanying code draw on the notes from Molly Roberts, Maya Sen, Iain Osgood, Brandon Stewart, and TF s from previous years

Outline Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions

Problem Set Expectations Third problem set distributed yesterday.

Problem Set Expectations Third problem set distributed yesterday. Corrections for first problem set due next Tuesday.

Problem Set Expectations Third problem set distributed yesterday. Corrections for first problem set due next Tuesday. Must be typeset using LATEX or Word and submitted as one document containing graphics and explanation electronically in pdf form

Problem Set Expectations Third problem set distributed yesterday. Corrections for first problem set due next Tuesday. Must be typeset using LATEX or Word and submitted as one document containing graphics and explanation electronically in pdf form Must be accompanied by source-able, commented code

Midterm Exam Window of exam: Tuesday, October 9th (after class) - Sunday, October 14th at 11.59pm

Midterm Exam Window of exam: Tuesday, October 9th (after class) - Sunday, October 14th at 11.59pm Needs to be completed in 5 hours

Midterm Exam Window of exam: Tuesday, October 9th (after class) - Sunday, October 14th at 11.59pm Needs to be completed in 5 hours Open note / book, but no collaboration allowed

Outline Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions

Sampling from Common Probability Distributions How do we sample from the normal distribution with µ = 0 and σ = 2 in R?

Sampling from Common Probability Distributions How do we sample from the normal distribution with µ = 0 and σ = 2 in R? set. seed (12345) rnorm (10, mean =0, sd =2)

Sampling from Common Probability Distributions How do we sample from the normal distribution with µ = 0 and σ = 2 in R? set. seed (12345) rnorm (10, mean =0, sd =2) How do we sample from the uniform distribution on the interval [0, 10] in R?

Sampling from Common Probability Distributions How do we sample from the normal distribution with µ = 0 and σ = 2 in R? set. seed (12345) rnorm (10, mean =0, sd =2) How do we sample from the uniform distribution on the interval [0, 10] in R? set. seed (12345) runif (10, min =0, max =10)

What is the probability that a random variable lies in a particular subdomain? X is the wait time (in minutes) for the red line in the morning. Let X Expo(0.1). What is the probability that X [1, 5], that one has to wait less than 5 minutes but more than a minute?

What is the probability that a random variable lies in a particular subdomain? X is the wait time (in minutes) for the red line in the morning. Let X Expo(0.1). What is the probability that X [1, 5], that one has to wait less than 5 minutes but more than a minute? 0.08 0.06 Density 0.04 0.02 0.00 0 10 20 30 40 50 X (Wait Time in Minutes)

What is the probability that a random variable lies in a particular subdomain? X is the wait time (in minutes) for the red line in the morning. Let X Expo(0.1). What is the probability that X [1, 5], that one has to wait less than 5 minutes but more than a minute? 0.08 0.06 Density 0.04 0.02 0.00 0 10 20 30 40 50 X (Wait Time in Minutes)

Analytic Approach: CDFs

Analytic Approach: CDFs P(1 X 5) = P(X 5) P(X 1) = F X (5) F X (1)

Analytic Approach: CDFs In R: P(1 X 5) = P(X 5) P(X 1) = F X (5) F X (1) pexp (q=5, rate =0.1) -pexp (q=1, rate =0.1)

Analytic Approach: CDFs In R: P(1 X 5) = P(X 5) P(X 1) = F X (5) F X (1) pexp (q=5, rate =0.1) -pexp (q=1, rate =0.1) P(1 X 5) 0.298

Simulation Approach 1. Sample from distribution of interest to approximate it

Simulation Approach 1. Sample from distribution of interest to approximate it 2. Calculate proportion of observations in sample that fall in subdomain of interest

Simulation Approach 1. Sample from distribution of interest to approximate it 2. Calculate proportion of observations in sample that fall in subdomain of interest In R: set. seed (12345) exp. vec <- rexp (n =10000, rate =0.1) mean (1 <= exp. vec & exp. vec <= 5)

Simulation Approach 1. Sample from distribution of interest to approximate it 2. Calculate proportion of observations in sample that fall in subdomain of interest In R: set. seed (12345) exp. vec <- rexp (n =10000, rate =0.1) mean (1 <= exp. vec & exp. vec <= 5) P(1 X 5) 0.3016

Outline Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions

Our Data

Our Data We are going to work with precincts dataset from fulton.rdata.

Our Data We are going to work with precincts dataset from fulton.rdata. Election data at precinct level for Fulton County, Georgia. Population: 268 precincts Variables: turnout rate, % black, % female, mean age, turnout in Dem. primary, turnout in Rep. primary, dummy variable for whether precinct is in Atlanta, and location dummies for polling stations

Our Data We are going to work with precincts dataset from fulton.rdata. Election data at precinct level for Fulton County, Georgia. Population: 268 precincts Variables: turnout rate, % black, % female, mean age, turnout in Dem. primary, turnout in Rep. primary, dummy variable for whether precinct is in Atlanta, and location dummies for polling stations Let s define X = % black

Visualizing the Population What does the population of X = % black look like?

Visualizing the Population What does the population of X = % black look like? 80 60 Frequency 40 20 0 0.0 0.2 0.4 0.6 0.8 1.0 % Black in Precinct (X)

Calculate True (Population) Mean

Calculate True (Population) Mean mean ( precincts $ black )

Calculate True (Population) Mean mean ( precincts $ black ) µ = 0.506

Sampling of Precincts Suppose we can only sample (SRS without replacement) n = 40 precincts. How do we do this in R?

Sampling of Precincts Suppose we can only sample (SRS without replacement) n = 40 precincts. How do we do this in R? First, note that we can subset the dataset as: precincts [c (10,3,200),]

Sampling of Precincts Suppose we can only sample (SRS without replacement) n = 40 precincts. How do we do this in R? First, note that we can subset the dataset as: precincts [c (10,3,200),] To sample n = 40 rows / precincts then:

Sampling of Precincts Suppose we can only sample (SRS without replacement) n = 40 precincts. How do we do this in R? First, note that we can subset the dataset as: precincts [c (10,3,200),] To sample n = 40 rows / precincts then: set. seed (12345) n. sample <- 40 N <- nrow ( precincts ) rand. rows <- sample (N, size =n. sample, replace = FALSE ) mypoll <- precincts [ rand. rows,]

Visualizing the Sample What does the sample of X = % black look like?

Visualizing the Sample What does the sample of X = % black look like? 15 10 Frequency 5 0 0.0 0.2 0.4 0.6 0.8 1.0 % Black in Precinct (X)

Calculate Sample Mean ( X)

Calculate Sample Mean ( X) mean ( mypoll $ black )

Calculate Sample Mean ( X) mean ( mypoll $ black ) X = 0.412

Connecting this to Resampling

Connecting this to Resampling We can think of our sample as one of many possible samples we can draw from our population:

Connecting this to Resampling We can think of our sample as one of many possible samples we can draw from our population: Samples from Pop. 1 2 3 10000 X 1 0.496 x 1,2 x 1,3 x 1,10000 X 2 0.320 x 2,2 x 2,3 x 2,10000 X 40 0.172 x 40,2 x 40,3 x 40,10000 X 40 0.412 x 2 x 3 x 10000 S 2 0.160 s 2 2 s 2 3 s 2 10000

Connecting this to Resampling We can think of our sample as one of many possible samples we can draw from our population: Samples from Pop. 1 2 3 10000 X 1 0.496 x 1,2 x 1,3 x 1,10000 X 2 0.320 x 2,2 x 2,3 x 2,10000 X 40 0.172 x 40,2 x 40,3 x 40,10000 ˆµ = X 40 0.412 x 2 x 3 x 10000 ˆσ 2 = S 2 0.160 s 2 2 s 2 3 s 2 10000

Sampling Distribution of X The sampling distribution of X is the distribution of the following vector: Samples from Pop. 1 2 3 10000 X 1 0.496 x 1,2 x 1,3 x 1,10000 X 2 0.320 x 2,2 x 2,3 x 2,10000 X 40 0.172 x 40,2 x 40,3 x 40,10000 ˆµ = X 40 0.412 x 2 x 3 x 10000 ˆσ 2 = S 2 0.160 s 2 2 s 2 3 s 2 10000

Calculating Sampling Distribution with Known Population 1. Start with complete population.

Calculating Sampling Distribution with Known Population 1. Start with complete population. 2. Define a quantity of interest (the parameter). For us, it s µ.

Calculating Sampling Distribution with Known Population 1. Start with complete population. 2. Define a quantity of interest (the parameter). For us, it s µ. 3. Choose a plausible estimator. For us, it s ˆµ = X.

Calculating Sampling Distribution with Known Population 1. Start with complete population. 2. Define a quantity of interest (the parameter). For us, it s µ. 3. Choose a plausible estimator. For us, it s ˆµ = X. 4. Draw a sample from the population and calculate the estimate using the estimator.

Calculating Sampling Distribution with Known Population 1. Start with complete population. 2. Define a quantity of interest (the parameter). For us, it s µ. 3. Choose a plausible estimator. For us, it s ˆµ = X. 4. Draw a sample from the population and calculate the estimate using the estimator. 5. Repeat step 4 many times (we will do 10,000).

Calculating Sampling Distribution with Known Population We already saw how to take one sample in R, but let s now repeat it 10,000 times and store X for each sample:

Calculating Sampling Distribution with Known Population We already saw how to take one sample in R, but let s now repeat it 10,000 times and store X for each sample: set. seed (12345) n. sample <- 40 N <- nrow ( precincts ) xbar. vec <- replicate (n =10000, mean ( precincts [ sample (N, size =n. sample, replace = FALSE ),]$ black )) plot ( density ( xbar. vec ), col = " navy ", lwd =2, main = " Sampling Distribution of Sample Mean ", xlab =" Sample Mean ")

Visualizing Sampling Distributions Here is the sampling distribution for X: 6 4 Density 2 0 0.3 0.4 0.5 0.6 0.7 Sample Mean

Visualizing Sampling Distributions Here is the sampling distribution for X: 6 Original Sample Mean Pop. Mean 4 Density 2 0 0.3 0.4 0.5 0.6 0.7 Sample Mean

Characterizing Sampling Distribution of Sample Mean We know that in large sample (large enough for Central Limit Theorem to kick in), the sample mean will be distributed as:

Characterizing Sampling Distribution of Sample Mean We know that in large sample (large enough for Central Limit Theorem to kick in), the sample mean will be distributed as: X N (µ, σ 2 /n)

Sampling Distributions for Other Statistics We can calculate infinitely many statistics from a sample. If we have the population, we can simulate a sampling distribution for those! For example, we can look at the sampling distribution of S - the sample standard deviation:

Sampling Distributions for Other Statistics We can calculate infinitely many statistics from a sample. If we have the population, we can simulate a sampling distribution for those! For example, we can look at the sampling distribution of S - the sample standard deviation: set. seed (12345) n. sample <- 40 N <- nrow ( precincts ) sample. fxn <- function (){ poll.i <- precincts [ sample (N, size =n. sample, replace = FALSE ),]$ black return (c( mean ( poll.i), sd( poll.i))) } out.df <- replicate (n =10000, sample. fxn ()) head ( out.df)

Visualizing Sampling Distributions Here is the sampling distribution for S: 25 20 15 Density 10 5 0 0.35 0.40 0.45 Sample Standard Deviation

Outline Administrative Details Probability Distributions Sampling Distributions Estimating Sampling Distributions

Sampling Distribution of X with Unknown Population In reality, we only draw one sample and cannot directly observe the sampling distribution of X: Samples from Pop. 1 2 3 10000 X 1 0.496??? X 2 0.320??? X 40 0.172??? ˆµ = X 40 0.412??? ˆσ 2 = S 2 0.160???

How do we estimate sampling distribution?

Characterizing Sampling Distribution of Sample Mean We know that in large sample (large enough for Central Limit Theorem to kick in), the sample mean will be distributed as:

Characterizing Sampling Distribution of Sample Mean We know that in large sample (large enough for Central Limit Theorem to kick in), the sample mean will be distributed as: X N (µ, σ 2 /n)

Estimating the Sampling Distribution of Sample Mean We can estimate the equation in the previous slide using estimators for the population mean µ and the population variance σ 2. We will use the following estimators:

Estimating the Sampling Distribution of Sample Mean We can estimate the equation in the previous slide using estimators for the population mean µ and the population variance σ 2. We will use the following estimators: ˆµ = X = 1 n n i=1 X i

Estimating the Sampling Distribution of Sample Mean We can estimate the equation in the previous slide using estimators for the population mean µ and the population variance σ 2. We will use the following estimators: ˆµ = X = 1 n n i=1 X i σˆ 2 = S 2 X = 1 n 1 n i=1(x i X) 2

Estimating the Sampling Distribution of Sample Mean We can estimate the equation in the previous slide using estimators for the population mean µ and the population variance σ 2. We will use the following estimators: ˆµ = X = 1 n n i=1 X i σˆ 2 = S 2 X = 1 n 1 n i=1(x i X) 2 The estimated sampling distribution is thus: X N ( X, S 2 X/n)

Bootstrapping the Sampling Distribution of Sample Mean 1. Start with sample.

Bootstrapping the Sampling Distribution of Sample Mean 1. Start with sample. 2. Define a quantity of interest (the parameter). For us, it s µ.

Bootstrapping the Sampling Distribution of Sample Mean 1. Start with sample. 2. Define a quantity of interest (the parameter). For us, it s µ. 3. Choose a plausible estimator. For us, it s ˆµ = X.

Bootstrapping the Sampling Distribution of Sample Mean 1. Start with sample. 2. Define a quantity of interest (the parameter). For us, it s µ. 3. Choose a plausible estimator. For us, it s ˆµ = X. 4. Take a resample of size n (with replacement) from the sample and calculate the estimate using the estimator.

Bootstrapping the Sampling Distribution of Sample Mean 1. Start with sample. 2. Define a quantity of interest (the parameter). For us, it s µ. 3. Choose a plausible estimator. For us, it s ˆµ = X. 4. Take a resample of size n (with replacement) from the sample and calculate the estimate using the estimator. 5. Repeat step 4 many times (we will do 10,000).

Bootstrapping the Sampling Distribution of Sample Mean Recall at the beginning of lecture, we already took a sample of size n = 40 from the population and stored the resultant dataframe as the object mypoll.

Bootstrapping the Sampling Distribution of Sample Mean Recall at the beginning of lecture, we already took a sample of size n = 40 from the population and stored the resultant dataframe as the object mypoll. set. seed (12345) n. sample <- 40 xbar. vec.bs <- replicate (n =10000, mean ( sample ( mypoll $ black, size =n. sample, replace = TRUE ))) plot ( density ( xbar. vec.bs), col = " navy ", lwd =2, main = " Bootstrapped Sampling Distribution of Sample Mean ", xlab =" Sample Mean ")

Visualizing the Bootstrapped Sampling Distribution Here is the bootstrapped sampling distribution: Sample Mean 6 4 Density 2 0 0.2 0.3 0.4 0.5 0.6 0.7 Sample Mean

Visualizing the Bootstrapped Sampling Distribution However, any given sample may be far away from the truth! Sample Mean Population Mean 6 4 Density 2 0 0.2 0.3 0.4 0.5 0.6 0.7 Sample Mean

Visualizing the Bootstrapped Sampling Distribution However, any given sample may be far away from the truth! Sample Mean Population Mean 6 4 Density 2 0 0.2 0.3 0.4 0.5 0.6 0.7 Sample Mean

Conclusion ANY QUESTIONS?