Understanding Samples

Similar documents
DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

Statistical Inference (Chapter 10) Statistical inference = learn about a population based on the information provided by a sample.

Chapter 22. Comparing Two Proportions. Copyright 2010, 2007, 2004 Pearson Education, Inc.

Understanding Dissimilarity Among Samples

Chapter 22. Comparing Two Proportions. Copyright 2010 Pearson Education, Inc.

A quick activity - Central Limit Theorem and Proportions. Lecture 21: Testing Proportions. Results from the GSS. Statistics and the General Population

Stat 421-SP2012 Interval Estimation Section

STA Learning Objectives. Population Proportions. Module 10 Comparing Two Proportions. Upon completing this module, you should be able to:

Agreement of CI and HT. Lecture 13 - Tests of Proportions. Example - Waiting Times

Statistics 511 Additional Materials

STATS 200: Introduction to Statistical Inference. Lecture 1: Course introduction and polling

Topic 9: Sampling Distributions of Estimators

Frequentist Inference

FACULTY OF MATHEMATICAL STUDIES MATHEMATICS FOR PART I ENGINEERING. Lectures

Discrete Mathematics for CS Spring 2008 David Wagner Note 22

1 Inferential Methods for Correlation and Regression Analysis

Comparing your lab results with the others by one-way ANOVA

Confidence Intervals for the Population Proportion p

Hypothesis Testing. Evaluation of Performance of Learned h. Issues. Trade-off Between Bias and Variance

STAT 350 Handout 19 Sampling Distribution, Central Limit Theorem (6.6)

Median and IQR The median is the value which divides the ordered data values in half.

Resampling Methods. X (1/2), i.e., Pr (X i m) = 1/2. We order the data: X (1) X (2) X (n). Define the sample median: ( n.

Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

This is an introductory course in Analysis of Variance and Design of Experiments.

Homework 5 Solutions

Simulation. Two Rule For Inverting A Distribution Function

The standard deviation of the mean

Lecture 2: Monte Carlo Simulation

Distribution of Random Samples & Limit theorems

Lecture 3. Properties of Summary Statistics: Sampling Distribution

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. Comments:

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 22

Parameter, Statistic and Random Samples

Chapter 6 Sampling Distributions

Chapter 23: Inferences About Means

Infinite Sequences and Series

n outcome is (+1,+1, 1,..., 1). Let the r.v. X denote our position (relative to our starting point 0) after n moves. Thus X = X 1 + X 2 + +X n,

Econ 325/327 Notes on Sample Mean, Sample Proportion, Central Limit Theorem, Chi-square Distribution, Student s t distribution 1.

PRACTICE PROBLEMS FOR THE FINAL

Goodness-of-Fit Tests and Categorical Data Analysis (Devore Chapter Fourteen)

Chapter 8: STATISTICAL INTERVALS FOR A SINGLE SAMPLE. Part 3: Summary of CI for µ Confidence Interval for a Population Proportion p

Parameter, Statistic and Random Samples

Lecture 1 Probability and Statistics

ACCESS TO SCIENCE, ENGINEERING AND AGRICULTURE: MATHEMATICS 1 MATH00030 SEMESTER / Statistics

Introduction There are two really interesting things to do in statistics.

Sampling Distributions, Z-Tests, Power

7-1. Chapter 4. Part I. Sampling Distributions and Confidence Intervals

Random Variables, Sampling and Estimation

Chapter 8: Estimating with Confidence

Lecture 1 Probability and Statistics

Statisticians use the word population to refer the total number of (potential) observations under consideration

Tests of Hypotheses Based on a Single Sample (Devore Chapter Eight)

CEE 522 Autumn Uncertainty Concepts for Geotechnical Engineering

Economics Spring 2015

Properties and Hypothesis Testing

II. Descriptive Statistics D. Linear Correlation and Regression. 1. Linear Correlation

Simple Random Sampling!

(7 One- and Two-Sample Estimation Problem )

Section 9.2. Tests About a Population Proportion 12/17/2014. Carrying Out a Significance Test H A N T. Parameters & Hypothesis

MATH 320: Probability and Statistics 9. Estimation and Testing of Parameters. Readings: Pruim, Chapter 4

AMS570 Lecture Notes #2

SDS 321: Introduction to Probability and Statistics

Comparing Two Populations. Topic 15 - Two Sample Inference I. Comparing Two Means. Comparing Two Pop Means. Background Reading

1 Introduction to reducing variance in Monte Carlo simulations

Discrete Mathematics for CS Spring 2005 Clancy/Wagner Notes 21. Some Important Distributions

BIOSTATS 640 Intermediate Biostatistics Frequently Asked Questions Topic 1 FAQ 1 Review of BIOSTATS 540 Introductory Biostatistics

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Lecture 16

Data Analysis and Statistical Methods Statistics 651

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

MA131 - Analysis 1. Workbook 3 Sequences II

Mathematical Notation Math Introduction to Applied Statistics

Variance of Discrete Random Variables Class 5, Jeremy Orloff and Jonathan Bloom

Lecture 2: Concentration Bounds

Sample Size Determination (Two or More Samples)

6.3 Testing Series With Positive Terms

A statistical method to determine sample size to estimate characteristic value of soil parameters

Read through these prior to coming to the test and follow them when you take your test.

Lecture 24 Floods and flood frequency

Eco411 Lab: Central Limit Theorem, Normal Distribution, and Journey to Girl State

October 25, 2018 BIM 105 Probability and Statistics for Biomedical Engineers 1

CS434a/541a: Pattern Recognition Prof. Olga Veksler. Lecture 5

7.1 Convergence of sequences of random variables

Lecture 6 Chi Square Distribution (χ 2 ) and Least Squares Fitting

STAT 203 Chapter 18 Sampling Distribution Models

7.1 Convergence of sequences of random variables

Chapter 18 Summary Sampling Distribution Models

Power and Type II Error

STAT Homework 2 - Solutions

Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 15

PRACTICE PROBLEMS FOR THE FINAL

WHAT IS THE PROBABILITY FUNCTION FOR LARGE TSUNAMI WAVES? ABSTRACT

Topic 9: Sampling Distributions of Estimators

Introducing Sample Proportions

Econ 371 Exam #1. Multiple Choice (5 points each): For each of the following, select the single most appropriate option to complete the statement.

Module 1 Fundamentals in statistics

CONFIDENCE INTERVALS STUDY GUIDE

Recall the study where we estimated the difference between mean systolic blood pressure levels of users of oral contraceptives and non-users, x - y.

Topic 9: Sampling Distributions of Estimators

The variance of a sum of independent variables is the sum of their variances, since covariances are zero. Therefore. V (xi )= n n 2 σ2 = σ2.

If, for instance, we were required to test whether the population mean μ could be equal to a certain value μ

Transcription:

1 Will Moroe CS 109 Samplig ad Bootstrappig Lecture Notes #17 August 2, 2017 Based o a hadout by Chris Piech I this chapter we are goig to talk about statistics calculated o samples from a populatio. We are the goig to talk about probability claims that we ca make with respect to the origial populatio a cetral requiremet for most scietific disciplies. Let s say you are the kig of Bhuta, ad you wat to kow the average happiess of the people i your coutry. You ca t ask every sigle perso, but you could ask a radom subsample. I this ext sectio we will cosider pricipled claims that you ca make based o a subsample. Assume we radomly sample 200 Bhutaese people ad ask them about their happiess, o a scale of 1 to 100 (happiesses? smiles?). Our data looks like this: 72, 85,..., 71. You ca also thik of it as a collectio of = 200 I.I.D. (idepedet, idetically distributed) radom variables X 1, X 2,..., X. Uderstadig Samples The idea behid samplig is simple, but the details ad the mathematical otatio ca be complicated. Here is a picture to show you all of the ideas ivolved: The theory is that there is some large populatio (such as the 774,000 people who live i Bhuta). We collect a sample of people at radom, where each perso i the populatio is equally likely to be i our sample. From each perso we record oe umber (e.g., their reported happiess). We are goig to call the umber from the i-th perso we sampled. Oe way to visualize your samples X 1, X 2,..., X is to make a histogram of their values. We make the assumptio that all of our s are idetically distributed. That meas that we are assumig there is a sigle uderlyig distributio F that we drew our samples from. Recall that a distributio for discrete radom variables should defie a probability mass fuctio.

2 Estimatig Mea ad Variace from Samples We assume that the data we look at are I.I.D. from the same uderlyig distributio (F) with a true mea (µ) ad a true variace (σ 2 ). Sice we ca t talk to everyoe i Bhuta, we have to rely o our sample to estimate the mea ad variace. From our sample we ca calculate a sample mea ( X) ad a sample variace (S 2 ). These are the best guesses that we ca make about the true mea ad true variace. X = S 2 = ( X) 2 1 The first thig to kow about these estimates is that they are ubiased. Havig a ubiased estimate meas that if we were to repeat this samplig process may times, the expected value each of the estimate should be equal to the true value we are tryig to estimate. We will first prove that that is the case for X. E[ X] = 1 E = 1 E[ ] = 1 µ = 1 µ = µ The equatio for the sample mea seems related to our uderstadig of expectatio. The same could be said about sample variace except for the surprisig ( 1) i the deomiator of the equatio. Why ( 1)? That deomiator is ecessary to make sure that E[S 2 ] = σ 2. The proof for S 2 is a bit more ivolved; you do t have to remember this, but some people may be iterested i kowig it: E[S 2 ] ( X) 2 1 ( 1)E[S 2 ] ( X) 2 (( µ) + (µ X)) 2 ( µ) 2 + (µ X) 2 + 2 ( µ)(µ X) ( µ) 2 + (µ X) 2 + 2(µ X) ( µ) ( µ) 2 + (µ X) 2 + 2(µ X)( X µ) ( µ) 2 (µ X) 2 [ ( µ) 2] E [ (µ X) 2] = σ 2 Var( X) = σ 2 σ2 = σ2 σ 2 = ( 1)σ 2

3 So E[S 2 ] = σ 2. The ituitio behid the proof is that sample variace calculates the distace of each sample to the sample mea, ot the true mea. The sample mea itself varies, ad we ca show that its variace is also related to the true variace. Variace of the Sample Mea We ow have estimates for mea ad variace that are ot biased that is, they are correct o average. However, the estimates chage depedig o the samples. How stable are they? The sample mea is computed as a average of radom variables. It takes o values probabilistically, which makes it a radom variable itself. We ca compute its variace: Var( X) = Var ( ) 2 1 = Var ( ) 2 ( ) 2 ( ) 2 1 1 1 = Var( ) = σ 2 = σ 2 = σ2 This tells us that the variace of the sample mea is proportioal to the variace of the uderlyig distributio, but goes dow with the umber of samples. Stadard Error Kowig that the variace of the sample mea is small if the umber of samples is large is reassurig, but the expressio for the variace of the sample mea depeds o the true variace of the uderlyig distributio. What if we do t kow that true variace? What ca we say about the stability of our estimate of the mea, give oly the sample we took? We kow that S 2 is a ubiased estimator for the true variace. So oe reasoable thig to try is to substitute S 2 for σ 2 : Var( X) = σ2 S2 SD( X) = σ S sice S 2 is a ubiased estimate sice SD is the square root of Var That SD( X) formula has a special ame: it is called the stadard error, ad it is a commo way of reportig ucertaity of estimates of meas ( error bars ) i scietific papers. Let s say our sample of happiess has = 200 people, the sample mea is X = 83, ad the sample variace is S 2 = 450. We ca calculate the stadard error of our estimate of the mea to be S 1.5. Whe we report our results, we will say that the average happiess score i Bhuta is 83 ± 1.5, with variace 450. ( If you re woderig, S 2 has a variace too; it turs out it s equal to 1 E[(X µ) 4 ] 3 1 (σ2 ) 2). We wo t use that oe i CS 109.

4 Bootstrap The bootstrap is a statistical techique for uderstadig distributios of statistics. It was iveted here at Staford i 1979 whe mathematicias were just startig to uderstad how computers, ad computer simulatios, could be used to better uderstad probabilities. The first key isight is that if we had access to the uderlyig distributio (F), the aswerig almost ay questio we might have about how accurate our statistics are would become straightforward. For example, i the previous sectio we gave a formula for how you could calculate the sample variace from a sample of size. We kow that i expectatio our sample variace is equal to the true variace. But what if we wat to kow the probability that the true variace is withi a certai rage of the umber we calculated? That questio might soud dry, but it is critical to evaluatig scietific claims. If you kew the uderlyig distributio, F, you could simply repeat the experimet of drawig a sample of size from F, calculate the sample variace from our ew sample ad test what portio fell withi a certai rage. The ext isight behid bootstrappig is that the best estimate that we ca get for F is from our sample itself. The geeral algorithm looks like this: def bootstrap(sample): N = umber of elemets i sample pmf = estimate the uderlyig pmf from the sample stats = [] repeat 10,000 times: resample = draw N ew samples from the pmf stat = calculate your stat o the resample stats.apped(stat) stats ca ow be used to estimate the distributio of the stat Next week we will talk i much more detail about estimatig distributios from samples. For ow, the simplest way to estimate F (ad the oe we will use i this class) is to assume that P(X = k) is simply the fractio of times that k showed up i the sample. This set of probabilities defies a probability mass fuctio for a discrete radom variable, which we ll call ˆF, the hat idicatig that ˆF is a estimate of the probability distributio of F. This estimated distributio, formed from couts of samples, is sometimes called this empirical distributio. Bootstrappig is a reasoable thig to do because the sample you have is the best ad oly iformatio you have about what the uderlyig populatio distributio actually looks like. May samples will look quite like the populatio they came from. With this approach, we ca compute probabilities ad estimates ot just for the mea, but for ay statistic we wat. To calculate Var(S 2 ), for example, we could calculate S i 2 for each resample i, ad after 10,000 iteratios, we could calculate the sample variace of all the S i 2 s.

5 You might be woderig why the resample is the same size as the origial sample (). The aswer is that the variatio of the variatio of stat that you are calculatig could deped o the size of the sample (or the resample). To accurately estimate the distributio of the stat, we must use resamples of the same size. The bootstrap has strog theoretical guaratees, ad it is accepted by the scietific commuity. It breaks dow whe the uderlyig distributio has a log tail or if the samples are ot I.I.D. Example of p-value calculatio We are tryig to figure out if people are happier i Bhuta or i Nepal. We sample 1 = 200 idividuals i Bhuta ad 2 = 300 idividuals i Nepal ad ask them to rate their happiess o a scale from 1 to 10. We measure the sample meas for the two samples ad observe that people i our Nepal sample are slightly happier the differece betwee the Nepal sample mea ad the Bhuta sample mea is 0.5 poits o the happiess scale. Have we really show that people i Nepal are happier? Sample meas ca fluctuate. How do we kow that we did t just get that differece because of the radom differeces amog samples? There is t a rigorous, objective way to prove that the differece you discovered was t due to chace, or eve to give a probability that the differece was due to chace. It is possible, however, to give a probability for the reverse statemet: if the oly differece i the samples was due to chace, what would be the probability that we get a result just as extreme? The assumptio that the differece betwee the samples was due to chace is a example of a ull hypothesis. A ull hypothesis says that there is o relatioship betwee two measured pheomea or o differece betwee two groups. The probability we gave is kow as a p-value. So a p-value is the probability that, whe the ull hypothesis is true, the statistic measured would be equal to, or more extreme tha, tha the value you are reportig. I the case of comparig Nepal to Bhuta, the ull hypothesis is that there is o differece betwee the distributio of happiess i Bhuta ad Nepal. Whe you drew samples, Nepal had a mea that 0.5 poits larger tha Bhuta by chace. We ca use bootstrappig to calculate the p-value. First, we estimate the uderlyig distributio of the ull hypothesis uderlyig distributio, by makig a probability mass fuctio from all of our samples from Nepal ad all of our samples from Bhuta. def pvaluebootstrap(bhutasample, epalsample): N = size of bhutasample M = size of epalsample uiversalsample = combie bhutasamples ad epalsamples uiversalpmf = estimate the uderlyig pmf of uiversalsample cout = 0 repeat 10,000 times: bhutaresample = draw N ew samples from uiversalpmf

6 epalresample = draw M ew samples from uiversalpmf mubhuta = sample mea of bhutaresample munepal = sample mea of epalresample meadifferece = munepal - mubhuta if meadifferece > observeddifferece: cout += 1 pvalue = cout / 10,000 This is particularly ice because we ever had to assume the distributio that our samples came from had a particular form (e.g., we ever had to claim that happiess is ormally distributed). You might have heard of a t-test. That is aother way of calculatig p-values, but it makes the assumptios that samples are ormally distributed ad have the same variace. Nowadays, whe we have reasoable computer power, bootstrappig is a more versatile ad accurate tool.