DS 100: Principles and Techniques of Data Science Date: April 13, Discussion #10

DS 00: Priciples ad Techiques of Data Sciece Date: April 3, 208 Name: Hypothesis Testig Discussio #0. Defie these terms below as they relate to hypothesis testig. a) Data Geeratio Model: Solutio: A set of assumptios about the process that geerated data. Some examples of assumptios iclude The data are draw idepedetly. The data are uiformly distributed There are two populatios preset i the data b) Null Hypothesis: Solutio: The ull hypothesis is a statemet about the model that correspods to the idea that ay observed differece is due to samplig or experimetal error. We are ofte tryig to disprove this. c) Test Statistic: Solutio: A statistic a fuctio of the data) that ca be used to help reject or fail to reject the ull hypothesis. d) Samplig distributio Solutio: The distributio of all the possible values of a statistic with a fixed sample size. The assumptios of the ull model should specify a samplig distributio. e) p-value: Solutio: The chace, uder the ull hypothesis, of gettig a test statistic equal to or more extreme tha the observed test statistic.

Discussio #0 2 2. State whether each statemet below is True or False. Provide a explaatio. a) p-values ca idicate how icompatible the data are with a specified statistical model. Solutio: True, the p-value is i the cotext of the ull model ie. it is the probability of extreme data uder the ull model) b) p-values measure the probability that the ull hypothesis is true. Solutio: False, the p-value is the probability of extreme data give the ull hypothesis. c) If our p-value is small, we have prove that the ull model is false. Solutio: False. If we get a small p-value, we thik that the evidece is strog eough to reject the ull hypothesis, i.e. we o loger thik radom chace i a ull model is a adequate explaatio for the variability. d) The p-value is the probability of the ull hypothesis give the data. Solutio: False, the p-value is the probability of observig the data uder all the assumptios of the ull hypothesis. e) By itself, a p-value does ot provide a good measure of evidece regardig a model or hypothesis. Solutio: True; p-values are ofte misused/misrepreseted through p-hackig ad multiple testig without appropriate correctio. Therefore, more iformatio about the hypothesis testig procedure is eed to uderstad the evidece regardig a model.

Discussio #0 3 Bootstrap We take a i.i.d. radom sample of size 9 from a populatio. We write all the values o pieces of paper ad stick them i a box: 2 2 3 3 3 4 4 5 The umbers i the box have the followig summary statistics: Statistic Sum Sum of Squares Mea Media Value 27 93 3 3 3. For each of the followig, aswer the followig questios: Is this value calculable from the iformatio give? If so, either calculate it by had or describe how you would calculate this value. If ot, the suggest a estimate for the quatity. All draws are with replacemet. a) The expected value of a sigle draw from the box. Solutio: E Sigle draw = Average of the Box = b) The expected value of the average of ie draws from this box Solutio: E Average of ie draws }{{} Estimator c) The exact variace of the tickets i the box Sum of the tickets Number of Tickets = 27 9 = 3 = Average of the Box = 3 }{{} Bootstrap populatio Parameter Solutio: 93 9 32 = 4 3 d) The exact variace of a sigle draw from the box Solutio: 4 3 e) The exact variace of the average of ie draws from the box Solutio: 4 3 9 = 4 27

Discussio #0 4 f) The exact variace of the average of ie draws from the populatio Solutio: We caot calculate this from the sample. It ca be estimated by the umber i part e. 4. Let s say we forgot the aalytic solutio for fidig the variace of the average of ie draws with replacemet from the populatio. Describe a bootstrap procedure to estimate the variace. Solutio:. Draw a bootstrap sample of size 9 from the box bootstrap populatio) 2. Calculate the mea of the bootstrap sample. 3. Steps ad 2 costitute a sigle bootstrap replicate. Repeat them a large umber of times0000 is usually suggested). 4. Calculate the variace of the meas from step 2. This is the bootstrap estimate of the populatio mea. 5. What are the sources of error i the bootstrap procedure? Solutio: Estimatio error - from estimatig usig a sample rather tha direct calculatio usig the populatio distributio fuctio we do t have this!) Simulatio error - from simulatig the bootstrap samplig distributio. We ca reduce this by icreasig the umber of bootstrap replicatios we do or by eumeratig all possible bootstrap resamples.

Discussio #0 5 6. Which of the followig could be valid bootstrap resamples? Provide reasos for the oes that are ot. a), 2, 2, 3, 3, 4, 4, 5, 6 Solutio: No, 6 is ot part of the origial sample b), 2, 2, 2, 3, 3, 3, 4, 4, 5 Solutio: No, this resample size 0) is too big c),,,,,,,, Solutio: Yes, this could be a resample d) 2, 2, 3, 3, 3, 4, 4, 4 Solutio: No, this resample size 8) is too small e), 2, 3, 3, 3, 4, 4, 5, 5 Solutio: Yes, this could be a resample 7. What are some assumptios we are makig whe performig the bootstrap? Solutio: The sample is represetative of the populatio draw from the same distributio ad is big eough ) The sample was draw i.i.d.

Discussio #0 6 8. You geerate 0 bootstrap resamples you would ormally take may more). They are sorted ad prited below: [, 2, 2, 2, 4, 4, 4, 4, 5] [, 2, 3, 3, 3, 3, 3, 4, 4] [, 2, 2, 3, 3, 3, 4, 4, 5] [,, 2, 3, 4, 4, 4, 5, 5] [2, 3, 3, 3, 4, 4, 4, 5, 5] [2, 3, 3, 3, 3, 3, 3, 4, 4] [,,,, 2, 2, 3, 4, 5] [2, 2, 3, 4, 4, 4, 4, 4, 5] [, 2, 2, 3, 3, 3, 4, 4, 4] [, 2, 2, 2, 3, 3, 3, 4, 5] Costruct a 60% cofidece iterval for the populatio 40 th percetile of the populatio. Solutio: The medias for the bootstrap resamples are: Sortig these values: 2, 3, 3,, 3, 3, 3, 3, 4, 2, 2, 2, 2, 3, 3, 3, 3, 3, 4 We take the 20 th ad 80 th percetiles to be the edpoits of our cofidece iterval, givig us [2, 3]. 9. Which of the followig statemets are valid claims? Provide revisios for the others. a) There s a 60% chace that the cofidece iterval i questio 6 covers the true populatio 40th percetile. Solutio: No, the cofidece iterval either covers the true populatio parameter or it does t. The 60% refers to the coverage of the may cofidece itervals costructed from hypothetical ew samples. b) If we were to repeat our samplig procedure ad bootstrap cofidece iterval estimatio may times o the populatio, the i the limit of ifiite samples, at least 40% of those 60% cofidece itervals will cover the 40th percetile of the populatio. Solutio: This statemet is fie. 40% itervals are cotaied i 60% itervals. c) A 80% cofidece iterval will i geeral be arrower tha a 60% cofidece iterval. Solutio: No, they are wider.

Discussio #0 7 Properties of the Bootstrap I the bootstrap, { we have a sample {X,..., X } from which we sample with replacemet times to obtai X,..., X }. Most likely, some of the values {X,..., X } will show up more tha oce. 0. For a really big sample, how likely are we to observe a data poit X i a particular bootstrap sample? Write dow a guess.. Let s see how we would aswer this questio aalytically. First, pick a fixed sample size. What is the probability that X appears o the secod draw of a bootstrap sample? Solutio: 2. What is the probability that X appears i a particular bootstrap sample? Solutio: The probability that we do t pick X i the ith draw is. The probability that we do t pick X at all is the probability that we do t pick it for the st draw, for the secod draw, ad so o up to the th draw. Sice we are samplig them idepedetly, this probability is the product P X does t appear i the bootstrap sample) = ) ) = } {{ } times ) This is the probability that we do t observe X at all. We are iterested i the probability of the complemet evet: observe X at least oce. This is give by oe mius the above probability: P X appears at least oce i the bootstrap sample) = ). 3. What is the limit of this probability as approaches? Hit: Defie y = P X does t appear i the bootstrap sample). The take the atural log of both sides.

Discussio #0 8 Solutio: Defie y = ) ad take the log of both sides to get l y = l ) = l Take the limit of both sides ad apply L Hôpital s Rule. lim l y = lim Expoetiatig both sides, Puttig thigs together: l ) = lim ) 2 2 e lim l y = lim e l y = lim y = e ) = lim = lim P X appears at least oce i the bootstrap sample) = e Techical poits: There is a mior abuse of otatio whe we take the derivatives sice N. Here you should uderstad the umerator ad deomiator as cotiuous fuctios of real umbers. For those of you who are worryig about the existece of this limit, you ca be more careful usig a squeeze theorem argumet. See http://www.maths. machester.ac.uk/ mprest/elimit.pdf for a closely related proof. 4. Approximately what is the limit above equal to umerically? Solutio: lim P X appears at least oce i the bootstrap sample) = e 0.632

Discussio #0 9 5. How may times does a data poit X show up o average i the bootstrap sample? Solutio: First ote that the umber of times T that X is picked for the bootstrap sample ca be writte as T = I[ X j = X ], j= where I[ X j = X ] is if X j = X ad 0 if X j X. To get the expected umber of times, use liearity of expectatio! E[T ] = EI[ X j = X ] = j= j= ) P Xj = X = j= =. So the expected umber of times X shows up is, ad the same is true for every other observatio X 2,..., X. This makes sese, sice there are cadidate positios ad observatios to choose from, with o oe privileged over the other.