1 Will Moroe CS 109 Samplig ad Bootstrappig Lecture Notes #17 August 2, 2017 Based o a hadout by Chris Piech I this chapter we are goig to talk about statistics calculated o samples from a populatio. We are the goig to talk about probability claims that we ca make with respect to the origial populatio a cetral requiremet for most scietific disciplies. Let s say you are the kig of Bhuta, ad you wat to kow the average happiess of the people i your coutry. You ca t ask every sigle perso, but you could ask a radom subsample. I this ext sectio we will cosider pricipled claims that you ca make based o a subsample. Assume we radomly sample 200 Bhutaese people ad ask them about their happiess, o a scale of 1 to 100 (happiesses? smiles?). Our data looks like this: 72, 85,..., 71. You ca also thik of it as a collectio of = 200 I.I.D. (idepedet, idetically distributed) radom variables X 1, X 2,..., X. Uderstadig Samples The idea behid samplig is simple, but the details ad the mathematical otatio ca be complicated. Here is a picture to show you all of the ideas ivolved: The theory is that there is some large populatio (such as the 774,000 people who live i Bhuta). We collect a sample of people at radom, where each perso i the populatio is equally likely to be i our sample. From each perso we record oe umber (e.g., their reported happiess). We are goig to call the umber from the i-th perso we sampled. Oe way to visualize your samples X 1, X 2,..., X is to make a histogram of their values. We make the assumptio that all of our s are idetically distributed. That meas that we are assumig there is a sigle uderlyig distributio F that we drew our samples from. Recall that a distributio for discrete radom variables should defie a probability mass fuctio.
2 Estimatig Mea ad Variace from Samples We assume that the data we look at are I.I.D. from the same uderlyig distributio (F) with a true mea (µ) ad a true variace (σ 2 ). Sice we ca t talk to everyoe i Bhuta, we have to rely o our sample to estimate the mea ad variace. From our sample we ca calculate a sample mea ( X) ad a sample variace (S 2 ). These are the best guesses that we ca make about the true mea ad true variace. X = S 2 = ( X) 2 1 The first thig to kow about these estimates is that they are ubiased. Havig a ubiased estimate meas that if we were to repeat this samplig process may times, the expected value each of the estimate should be equal to the true value we are tryig to estimate. We will first prove that that is the case for X. E[ X] = 1 E = 1 E[ ] = 1 µ = 1 µ = µ The equatio for the sample mea seems related to our uderstadig of expectatio. The same could be said about sample variace except for the surprisig ( 1) i the deomiator of the equatio. Why ( 1)? That deomiator is ecessary to make sure that E[S 2 ] = σ 2. The proof for S 2 is a bit more ivolved; you do t have to remember this, but some people may be iterested i kowig it: E[S 2 ] ( X) 2 1 ( 1)E[S 2 ] ( X) 2 (( µ) + (µ X)) 2 ( µ) 2 + (µ X) 2 + 2 ( µ)(µ X) ( µ) 2 + (µ X) 2 + 2(µ X) ( µ) ( µ) 2 + (µ X) 2 + 2(µ X)( X µ) ( µ) 2 (µ X) 2 [ ( µ) 2] E [ (µ X) 2] = σ 2 Var( X) = σ 2 σ2 = σ2 σ 2 = ( 1)σ 2
3 So E[S 2 ] = σ 2. The ituitio behid the proof is that sample variace calculates the distace of each sample to the sample mea, ot the true mea. The sample mea itself varies, ad we ca show that its variace is also related to the true variace. Variace of the Sample Mea We ow have estimates for mea ad variace that are ot biased that is, they are correct o average. However, the estimates chage depedig o the samples. How stable are they? The sample mea is computed as a average of radom variables. It takes o values probabilistically, which makes it a radom variable itself. We ca compute its variace: Var( X) = Var ( ) 2 1 = Var ( ) 2 ( ) 2 ( ) 2 1 1 1 = Var( ) = σ 2 = σ 2 = σ2 This tells us that the variace of the sample mea is proportioal to the variace of the uderlyig distributio, but goes dow with the umber of samples. Stadard Error Kowig that the variace of the sample mea is small if the umber of samples is large is reassurig, but the expressio for the variace of the sample mea depeds o the true variace of the uderlyig distributio. What if we do t kow that true variace? What ca we say about the stability of our estimate of the mea, give oly the sample we took? We kow that S 2 is a ubiased estimator for the true variace. So oe reasoable thig to try is to substitute S 2 for σ 2 : Var( X) = σ2 S2 SD( X) = σ S sice S 2 is a ubiased estimate sice SD is the square root of Var That SD( X) formula has a special ame: it is called the stadard error, ad it is a commo way of reportig ucertaity of estimates of meas ( error bars ) i scietific papers. Let s say our sample of happiess has = 200 people, the sample mea is X = 83, ad the sample variace is S 2 = 450. We ca calculate the stadard error of our estimate of the mea to be S 1.5. Whe we report our results, we will say that the average happiess score i Bhuta is 83 ± 1.5, with variace 450. ( If you re woderig, S 2 has a variace too; it turs out it s equal to 1 E[(X µ) 4 ] 3 1 (σ2 ) 2). We wo t use that oe i CS 109.
4 Bootstrap The bootstrap is a statistical techique for uderstadig distributios of statistics. It was iveted here at Staford i 1979 whe mathematicias were just startig to uderstad how computers, ad computer simulatios, could be used to better uderstad probabilities. The first key isight is that if we had access to the uderlyig distributio (F), the aswerig almost ay questio we might have about how accurate our statistics are would become straightforward. For example, i the previous sectio we gave a formula for how you could calculate the sample variace from a sample of size. We kow that i expectatio our sample variace is equal to the true variace. But what if we wat to kow the probability that the true variace is withi a certai rage of the umber we calculated? That questio might soud dry, but it is critical to evaluatig scietific claims. If you kew the uderlyig distributio, F, you could simply repeat the experimet of drawig a sample of size from F, calculate the sample variace from our ew sample ad test what portio fell withi a certai rage. The ext isight behid bootstrappig is that the best estimate that we ca get for F is from our sample itself. The geeral algorithm looks like this: def bootstrap(sample): N = umber of elemets i sample pmf = estimate the uderlyig pmf from the sample stats = [] repeat 10,000 times: resample = draw N ew samples from the pmf stat = calculate your stat o the resample stats.apped(stat) stats ca ow be used to estimate the distributio of the stat Next week we will talk i much more detail about estimatig distributios from samples. For ow, the simplest way to estimate F (ad the oe we will use i this class) is to assume that P(X = k) is simply the fractio of times that k showed up i the sample. This set of probabilities defies a probability mass fuctio for a discrete radom variable, which we ll call ˆF, the hat idicatig that ˆF is a estimate of the probability distributio of F. This estimated distributio, formed from couts of samples, is sometimes called this empirical distributio. Bootstrappig is a reasoable thig to do because the sample you have is the best ad oly iformatio you have about what the uderlyig populatio distributio actually looks like. May samples will look quite like the populatio they came from. With this approach, we ca compute probabilities ad estimates ot just for the mea, but for ay statistic we wat. To calculate Var(S 2 ), for example, we could calculate S i 2 for each resample i, ad after 10,000 iteratios, we could calculate the sample variace of all the S i 2 s.
5 You might be woderig why the resample is the same size as the origial sample (). The aswer is that the variatio of the variatio of stat that you are calculatig could deped o the size of the sample (or the resample). To accurately estimate the distributio of the stat, we must use resamples of the same size. The bootstrap has strog theoretical guaratees, ad it is accepted by the scietific commuity. It breaks dow whe the uderlyig distributio has a log tail or if the samples are ot I.I.D. Example of p-value calculatio We are tryig to figure out if people are happier i Bhuta or i Nepal. We sample 1 = 200 idividuals i Bhuta ad 2 = 300 idividuals i Nepal ad ask them to rate their happiess o a scale from 1 to 10. We measure the sample meas for the two samples ad observe that people i our Nepal sample are slightly happier the differece betwee the Nepal sample mea ad the Bhuta sample mea is 0.5 poits o the happiess scale. Have we really show that people i Nepal are happier? Sample meas ca fluctuate. How do we kow that we did t just get that differece because of the radom differeces amog samples? There is t a rigorous, objective way to prove that the differece you discovered was t due to chace, or eve to give a probability that the differece was due to chace. It is possible, however, to give a probability for the reverse statemet: if the oly differece i the samples was due to chace, what would be the probability that we get a result just as extreme? The assumptio that the differece betwee the samples was due to chace is a example of a ull hypothesis. A ull hypothesis says that there is o relatioship betwee two measured pheomea or o differece betwee two groups. The probability we gave is kow as a p-value. So a p-value is the probability that, whe the ull hypothesis is true, the statistic measured would be equal to, or more extreme tha, tha the value you are reportig. I the case of comparig Nepal to Bhuta, the ull hypothesis is that there is o differece betwee the distributio of happiess i Bhuta ad Nepal. Whe you drew samples, Nepal had a mea that 0.5 poits larger tha Bhuta by chace. We ca use bootstrappig to calculate the p-value. First, we estimate the uderlyig distributio of the ull hypothesis uderlyig distributio, by makig a probability mass fuctio from all of our samples from Nepal ad all of our samples from Bhuta. def pvaluebootstrap(bhutasample, epalsample): N = size of bhutasample M = size of epalsample uiversalsample = combie bhutasamples ad epalsamples uiversalpmf = estimate the uderlyig pmf of uiversalsample cout = 0 repeat 10,000 times: bhutaresample = draw N ew samples from uiversalpmf
6 epalresample = draw M ew samples from uiversalpmf mubhuta = sample mea of bhutaresample munepal = sample mea of epalresample meadifferece = munepal - mubhuta if meadifferece > observeddifferece: cout += 1 pvalue = cout / 10,000 This is particularly ice because we ever had to assume the distributio that our samples came from had a particular form (e.g., we ever had to claim that happiess is ormally distributed). You might have heard of a t-test. That is aother way of calculatig p-values, but it makes the assumptios that samples are ormally distributed ad have the same variace. Nowadays, whe we have reasoable computer power, bootstrappig is a more versatile ad accurate tool.