Big Picture. 5. Data, Estimates, and Models: quantifying the accuracy of estimates.

5. Data, Estimates, ad Models: quatifyig the accuracy of estimates. 5. Estimatig a Normal Mea 5.2 The Distributio of the Normal Sample Mea 5.3 Normal data, cofidece iterval for, kow 5.4 Normal data, cofidece iterval for, ukow (the t distributio) 5.5 Beroulli data, cofidece iterval for p 5.6 The Cetral Limit Theorem ad a Geeral Approximate Cofidece Iterval for Big Picture We ow move to lookig at usig data to estimate parameters of models. We begi by cosiderig estimatio of the mea of a distributio. The mea ad variace are the two parameters that describe a Normal model. We saw that as the sample size gets big, the sample average x should get close to the mea. What determies how close? Ca we quatify the accuracy?

5. Estimatig the Mea of a Normal distributio Cosider a plat which fills cereal boxes. The maager eeds to kow how much cereal is goig i to the boxes, at least, o average. How accurate will the sample average be as a estimate for the true mea? The setup The distributio of cereal box weights are Normal(345,5 2 ). So the true mea (log ru average weight) is 345. The maager does t kow =345 so she radomly grabs boxes that have bee filled ad uses the sample average of their weights as a estimate for the ukow true mea (345). 2

0 Our Approach First, I ll show that if we kow the true distributio of cereal box weights (say N(345,5 2 )), we ca describe how likely it is that the estimate costructed from the sample average lies ear of far from the true value. Next, we ll use the results from above to quatify the accuracy of our estimates i the realistic settig where we do t kow the true value of the mea. Here are the time series ad histogram of the observed weights for 500 boxes: weights 300 320 340 360 380 20 00 80 60 40 20 Histogram looks Normal! 0 00 200 300 400 500 Looks iid observatio # 300 320 340 360 380 weights The weights of cereal boxes are iid ormal with = 345 ad = 5. 3

With 500 observatios, our guess for, is probably pretty good (we get 344.83, very close). But what if you had fewer observatios? Suppose you oly had the first 0! x How would you guess? 340 350 360 370 The solid black lie is the sample average of the first 0 obs.. It is further from 2 3 4 5 6 7 8 9 0 the true value 345. first 0 x 0 x 500 We saw that the sample average of a large umber of iid draws should coverge to the mea of the distributio we are drawig from. I our cereal box example, the weights are iid draws from a N(345,5 2 ) This meas that the sample average should be close to 345. I geeral: E() (for large ) i i 4

Give a sample of size of observatios that look iid ormal, the sample mea, is our estimate of i i ( ) E i is sometimes called the populatio mea sice it is the mea of the etire populatio of all potetial values, while the sample mea is just the average of some of them. 5.2 The Distributio of the Normal Sample Mea How bad ca the estimate be if you oly have 0 observatios? To ivestigate this we perform a coceptual Experimet. Let s take our 500 observatios ad break them up ito 50 groups of 0 cosecutive observatios. Each group represets a sample of size 0 that you might have gotte. For each group we calculate the mea. This will show us what kids of values we could get for the average of just 0 observatios. 5

I wat to see how oisy the sample average is whe we have a sample of size 0 so I will look at a buch of sample averages costructed usig differet datasets of 0 observatios. We will look at how close or far the sample averages lie from the true mea. I reality we would have just a sigle sample of size 0, we could have gotte ay of the 50 samples we look at. The little solid segmets are plotted at the mea of the correspodig 0 umbers. 320 330 340 350 360 370 0 00 200 300 400 500 6

Here is the histogram of the 50 sample averages These are the 50 sample averages, ot the idividual cereal boxes Histogram of 50 sample averages 20 The look Normal too!! 5 So the distributio of the types of values we Frequecy 0 5 0 338 340 342 344 346 348 350 352 C2 get for our sample averages looks Normal too! Suppose the maager is about to grab a ew sample of size 0 usig observatios 50-50 ad use that sample average as their estimate for the mea. What values might they get for the sample average? i i 20 Histogram of 50 sample averages 5 Frequecy 0 5 0 338 340 342 344 C2 346 348 350 352 Recall empirically we foud this histogram for our coceptual experimet. 7

0 With the ew sample, the maager could get ay value like the oes we saw i our coceptual experimet (or other values). Whe we take a ew sample it is like a radom outcome, why is it radom? Because the data are radom outcomes. Each i is a radom draw from a N(345,5 2 ) Key idea: Before we get the sample, each i is radom. So we thik of the sample mea as a radom variable!! It is a liear Combiatio of iid Normals!... 0 2 0 Q? What is the value that we will get for the first observatio,? As. It s ukow. It will be the outcome of a radom draw from a N(345,5 2 ). 20 00 80 60 40 20 300 320 340 360 380 weights 8

9 So, the big idea is that before we collect our observatios, we ca thik of the sample average as a radom variable. Whe we fially take our sample it gives us oe realizatio of the sample average. It is radom because it is a liear combiatio of iid radom variables. Note that the otatio will remai the same, but we ow thik of the sample average before we take the sample as radom. i i E E E E i i i...... ) (... 2 2 Sice the expected value of is equal to the thig we are tryig to estimate,, we say our is a ubiassed estimate of the populatio mea.

What is the variace of the sample average? 2 2 Var() Var( ) Var( 2) Var( 2 2 2 ) 2 2 2 2 2 2 2 2 2 So the sample average is ubiassed ad the variace of the sample average ca be quatified. Ideally we would like the variace to be small so that the sample average should be close to the mea. 0

The variace of the sample average depeds o two thigs: the variace of the populatio from which we are samplig 2 ad the sample size. The variability of our sample average is decreasig with larger sample sizes (larger values of ) The variability of our sample average is larger whe the populatio variace is larger. Larger populatio variace meas that our idividual draws of the s are more spread out. Why do t ay covariaces appear i the variace of? The i must be idepedet. Does this make sese?

Fact: sice the average is a combiatio of idepedet Normals, it is also Normally distributed. Let the, N i ~ N(, ) 2, 2, ~ (, ) 2 i This is the same 2. I the top lie it represets the variace of the distributio of cereal box weights. I the secod lie, the ratio of 2 / provides the variace of sample averages costructed by averagig cereal box weights. iid Same 2 Relatioship betwee the distributio of cereal box weights ad the sample average of te 0.09 0.08 0.07 0.06 0.05 0.04 0.03 0.02 0.0 0 300 320 340 360 380 Cereal Box Average of 0 2

Example 2 5 ~N(345, ) 50 For differet sample sizes we get differet distributios for the sample averages: 2 5 ~N(345, ) 500 2 5 ~N(345, ) 0 desity 0.0 0. 0.2 0.3 0.4 0.5 0.6 335 340 345 350 355 xbar For differet sample sizes, the ormal curves tell us how close we ca expect our estimate to be to the true value! 2 ( 345 30 375) If we assume =345 ad =5: 2 5 ( 345 2 354.5) 0 ( 345) 2 5 ( 345 2 335.5) 0 300 320 340 360 380 2 ( 345 30 35) 0 00 200 300 400 500 3

5.3 Cofidece Itervals : How do we use the results from the previous sectio whe we do t kow? We just figured out that if we sample from a N(, 2 ), we ca figure out what kid of sample averages we will get from a sample of size. What we really wat to kow is, give a sample average, where do we thik is? At first, we will still assume that we kow but we do t kow. I the ext sectio we will relax this urealistic assumptio. We are assumig that the data are iid ormal. 4

First let s add a bit of otatio: Let, 2 This will simplify the look of the formulas ad emphasize that the sample mea has its ow stadard deviatio. Now we stadardize so, ~N(, ) ~N(0,) 2 Pr( 2 2).95 (really the 2 is.96!!) 5

so, Pr( 2 2 ).95 This says that there is a 95% chace that the sample mea x lies withi two stadard deviatios of the true mea. Remember that gets smaller as the sample size gets larger. So we should expect the sample mea to be closer to the true mea i larger samples. 0. 0.08 0.06 0.04 0.02 0 320 330 340 350 360 370 2 345 2 95% chace that x falls i here N 5 2 ~ 345, 0 Next lets Rearrage some more to get somethig useful! Alteratively, if there is a 95% chace that lies withi two stadard deviatios of, the there is a 95% chace that lies withi two stadard deviatios of x : x Mathematically, we rearrage the last iequality to get: Pr( 2 2 ).95 Pr(2 2 ).95 6

For iid ormal data, with kow stadard deviatio, a 95% cofidece iterval for the true mea is, 2 95% of the time the true value will be cotaied i the iterval. 95% CI for : All values we caot rule out based o the data A picture of the process: Our sample gives us a value for x. We wat to ask what values for are reasoable. x 2 Cosider a possible value. The red curve is the samplig distributio of x. Is this reasoable? NO. If that were the right value of, it s extremely ulikely we d see a like the oe we got i the data. x Cosider 2. If this were the right value of, it s perfectly possible we d see a like the oe we saw i the data. x 7

Example: Remember our weight data? Give 500 observatios, what do we kow about? Assume =5..67 IN ECEL, use the formula: =5/sqrt(500) ECEL gives us: 0.670820 The sample average was 344.83. The 95% ci is (343.5, 346.7) IN ECEL, use the formulas: = 344.83-2*.67 = 344.83+2*.67 ECEL gives us the values: 343.490 346.70 Example: Give just the 0 observatios, what do we kow about? Assume =5. 4.74 IN ECEL, use the formula: = 5/sqrt(0) ECEL gives us: 4.74342 The sample average was 348.5. The 95% ci is (339.02, 357.98) IN ECEL, use the formulas: = 348.5-2*4.74 = 348.5+2*4.74 ECEL gives us the values: 339.020 357.980 8

Cofidece itervals aswer the basic questios, what do you thik the parameter is ad how sure are you. I particular, a 95% CI meas that if we took 00 samples ad created 00 differet cofidece itervals, we would expect 95 of them to cotai the true (but ukow) value. small iterval: good, you kow a lot big iterval: bad, you do t kow much. Clearly there is othig special (outside of covetio) i usig a 95% CI. We ca have costructed ay cofidece iterval we like. For example: A 68% CI is give by More geerally we ca compute a 00(-% cofidece iterval by: z 2 2 2 z 2 z 2 9

Here are some tabulated values: / 2 z / 2.80.90.95.99.20.0.05.0.0.05.025.005.28.64.96 2.58 The (-) 00% C.I. for is the give by z /2 5.4 Normal data, cofidece iterval for, ukow Now we will exted our ci to the more realistic situatio where is ukow. Typically you do t, so we have to estimate it as well. How do we estimate? Just as we ow thik of the sample mea as a estimate of, we ca thik of the sample sd as a estimate of. 20

Estimatig s (x x) 2 2 x i i is our estimate for 2 we divide by - so that the estimator is ubiased. Fact: E s 2 2 x the estimate of is, s x i ( x x) i 2 2

Now our big idea is that i the formula istead of usig, we use a estimate of it: se() s x This is called the stadard error. Clearly, it is a estimate of the true stadard deviatio. We might thik that N(0,) se() givig the ci: x 2se() (squiggly lies mea approximately distributed as ) (just replace with its estimate) This is approximately right for large (>30). But it turs out that for iid ormal data we ca get a exact result. First we eed to lear about the t-distributio. 22

The t distributio The t is just aother cotiuous distributio. It has oe parameter called the degrees of freedom which is usually deoted by the symbol. Each value of gives you a differet distributio. Compariso of Normal ad t distributios for differet values of 23

Whe is bigger tha about 30 the t is very much like the stadard ormal. 0.4 0.3 Oe of these is t with 30 df, the other is stadard ormal. u3 0.2 0. t dist with =3 df. 0.0-4 -3-2 - 0 2 3 4 t For smaller, it puts more prob i the tails. for our Normal mea problem we use =-. Now, let, t,.025 be such that t rv with - df. P( t t t ).95,.025,.025 0.4.95 0.3.025 f(x) 0.2 0..025 0.0-3 -2-0 2 3 t,.025 x t,.025 24

For ->about 30, the t - is so much like the stadard ormal that t 2,.025 For smaller, the t value gets bigger tha 2. Here is a table of t values ad. We ca see that for >30 (or eve about 20) the t value is about 2. t 025,. 4.303 3 2.228 2.086 2 2.042 3 2.00 6 IN ECEL, use the formula: =TINV( 0.05, 0) Degrees of freedom Probability i the tails There is.025 prob less tha -2.22 ad.025 prob greater tha 2.22 for the t dist with 0 degrees of freedom. ECEL gives us: 2.22 25

Our basic result is, se() ~t for small, the t distributio accouts for our estimatio of with s x. thus, Pr( t,.025 t,.025 ).95 se() Just a before, we ca rearrage this to obtai the iterval: x t se(),.025 26

A exact 95% cofidece iterval for with ukow is x t se(),.025 Usig the t value istead of the z value will make the iterval bigger for smaller. This reflects the fact that we are ot sure that our estimate for is quite right. Example Back to our weight data. With =500 the sample sd is 5.455, ad the sample mea is 344.83. The t dist with =499 is just like the stadard ormal so the t-value is about 2. 5.455 se().69 500 ci: 344.83 +/-.4 IN ECEL, use the formulas: = 344.83-.4 = 344.83+.4 ECEL gives us the values: 343.430 346.230 27

T Cofidece Itervals se( ) sx Variable N Mea StDev SE Mea 95.0 % CI weights 500 344.828 5.455 0.69 ( 343.470, 346.86) Histogram of weights (with 95% t-cofidece iterval for the mea) 60 Frequecy 40 20 0 _ [ ] 300 350 400 weights For the first 0 observatios, the sample sd = 4.6, ad the sample mea was 348.5. The t 9,.025 value is 2.262. 4.6 se() 4.6 0 ci: 348.5 +/- 2.262*4.6 348.5 +/- 0.4 =(338., 358.9) 28

T Cofidece Itervals Variable N Mea StDev SE Mea 95.0 % CI weights 0 348.5 4.60 4.62 ( 338.07, 358.96) Histogram of weights0 (with 95% t-cofidece iterval for the mea) 4 3 Frequecy 2 0 _ [ ] 330 340 350 360 370 380 weights0 Example Let s get a 95% ci for the true mea of Caadia returs. IN ECEL, use the pull-dow meu: StatPro > Statistical Iferece > Oe-sample aalysis Results for oe-sample aalysis for caada Histogram of caada (with 95% t-cofidece iterval for the mea) Summary measures Sample size 07 Sample mea 0.009 Sample stadard deviatio 0.038 Cofidece iterval for mea Cofidece level 95.0% Sample mea 0.009 Std error of mea 0.004 Degrees of freedom 06 Lower limit 0.002 Upper limit 0.06 sx se( ) Frequecy 20 0 0 x t se(),.025 _ [ ] -0. 0.0 0. caada Is the cofidece iterval big? 29

Example: 95% CI for true mea of NYSE stock idex over same period. T Cofidece Itervals Variable N Mea StDev SE Mea 95.0 % CI yse 07 0.0330 0.03686 0.00356 ( 0.00624, 0.02036) Histogram of yse (with 95% t-cofidece iterval for the mea) 30 Frequecy 20 0 0 _ [ ] -0.2-0. 0.0 0. yse Of course, just as for the case of the Normal, we ca fid ay cofidece iterval that we would like. The (- ) 00% C.I.for is the give by t ( /2,-) s defied similarly to z /2 for the N(0,) 30

5.5 Beroulli data, cofidece iterval for p Now we cosider cofidece itervals for p give iid Beroulli observatios. Suppose we had this data where meas a default ad 0 meas o default. C.0 0.5 What do you thik the true default rate is ad how sure are you? 0.0 Idex 0 20 30 40 50 Our data cosist of Beroulli outcomes where a mortgage either defaults () or does ot (0). Our best estimate of p will be the sample fractio of defaults. That is: xi i pˆ For our data it is 2/50. 3

We play the same game as before: before we take our sample we ask what ca happe? This time the outcomes are realizatios of iid Beroulli(p). The sum of iid Beroulli s is a Biomial distributio so the umerator is the outcome of a Biomial(,p) where is the sample size ad p is the parameter we wat to kow. For iid Beroulli data, the estimate of p is observed umber of successes i the trials ˆp umber of trials Y~B(,p) Y 32

Before we get a sample of size, what kid of estimate ca we expect to get? Y E(p) ˆ E E(Y) p p (ubiased) p( p) ˆ Var(p) p( p) 2 Two thigs: ) The variace of pˆ is agai decreasig i the sample size. 2) The variace of depeds o the value of p. pˆ Ulike the ormal case, oly approximate results are available. Sice our estimate is a combiatio of idepedet beroullis, the cetral limit theorem tells us that it should be approximately ormal: pˆ p( p) N(p, ) 33

We make a fial approximatio. ˆ se(p) p( ˆ p) ˆ so, ˆp p N(0,) se(p) ˆ The 95% iterval is for the true proportio p is, pˆ 2se(p) ˆ I our example our iterval would be: IN ECEL, use the formulas: =.24-2*sqrt(.24*(-.24)/50) =.24 + 2*sqrt(.24*(-.24)/50) ECEL gives us the values: 0.9203 0.360797 34

Example Remember the discrimiatio case? We used.07 for p. yy 0 5 0 Not coutig the firm beig sued, we had 28 parters 77 of which were female. 50 60 70 80 90 00 0 77 p. 07 28 The cofidece iterval is. 07 2. 07(. 07) 28. 07. 052 This iterval tells us where we thik p is: (.0548,.0852) 35

Suppose we had oly 00 parters 7% of whom are female. The iterval would be. 07 2. 07(. 07) 00 (.02,.2). 07. 05 This iterval is much bigger, tellig us that with oly 00 observatios, our estimate could be a lot farther from the truth. 00 observatios has less iformatio tha 28. Natioal Poll of likely voters (CNN) 36

Trump tops Clito 58% to 56% i ufavorable poll by CNN/ORC The CNN/ORC Poll was coducted by telephoe October 20 23 amog a radom atioal sample of,07 adults, icludig 779 who were determied to be likely voters. The margi of samplig error for results amog the sample of likely voters is plus or mius 3.5 percetage poits. What is the margi of error? It is actually a 95% Cofidece Iterval..58*.58 sepˆ.076 779 2* se pˆ 2*.076.0353 3.5% 37

Why is the Margi of Error ofte 3%? Geerally the sample size is a little over 000. The umerator of the stadard error depeds o p, so why are the errors ot depedet o the value of the estimate of p? The media uses the largest iterval that is obtaied whe pˆ.5..5.5 2* sepˆ 2*.036 3% 000 So if the estimate for p is differet from.5, the cofidece iterval (margi of error) will by smaller tha 3%. This meas that we are at least 95% sure that the true value of p lies i the plus or mius three percetage poits. 38

5.6 The Cetral Limit Theorem ad a Geeral Approximate Cofidece Iterval for Suppose we are willig to assume that our data are iid but ot willig to assume that they are ormally distributed. ad they are ot Beroulli. We might still wat to estimate E( i) It turs out that the approach we used for ormal data is approximately correct (with large sample sizes): 2 N(, ) 2 N(, se() ) (first squiggle is the clt, secod squiggle we just hope our estimate of is good) so, N(0,) se() 39

This is a extremely powerful result. It says that eve if we do t kow the distributio of the populatio we are samplig from is Normal, the distributio of the sample average will still be Normally distributed i large samples! Give iid observatios i, ad approximate 95% cofidece iterval for =E( i ) is give by, x 2se() 40