Discrete Mathematics for CS Spring 2008 David Wagner Note 22

CS 70 Discrete Mathematics for CS Sprig 2008 David Wager Note 22 I.I.D. Radom Variables Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig a small radom sample. How large does our sample have to be to guaratee that our estimate will be withi (say) ±1 percetage poits (i absolute terms) of the true value with probability at least 0.95? This is perhaps the most basic statistical estimatio problem, ad it shows up everywhere. We will develop a simple solutio that uses oly Chebyshev s iequality. More refied methods ca be used to get sharper results. Let s deote the size of our sample by (to be determied), ad the umber of Democrats i it by the radom variable S. (The subscript just remids us that the r.v. depeds o the size of the sample.) The our estimate will be the value A = 1 S. Now as has ofte bee the case, we will fid it helpful to write S = X 1 + X 2 + +X, where { 1 if perso i i sample is a Democrat; X i = 0 otherwise. Note that each X i ca be viewed as a coi toss, with Heads probability p (though of course we do ot kow the value of p). Ad the coi tosses are idepedet. 1 [We ca say that the X i s are idepedet ad idetically distributed, or just i.i.d. for short.] What is the expectatio of our estimate? E[A ] = E[ 1 S ] = 1 E[X 1 + X 2 + +X ] = 1 (p) = p. So for ay value of, our estimate will always have the correct expectatio p. [Such a r.v. is ofte called a ubiased estimator of p.] Now presumably, as we icrease our sample size, our estimate should get more ad more accurate. This will show up i the fact that the variace decreases with : i.e., as icreases, the probability that we are far from the mea p will get smaller. To see this, we eed to compute Var(A ). Ad sice A = 1 i=1 X i, we eed to figure out how to compute the variace of a sum of radom variables. Theorem 22.1: For ay radom variable X ad costat c, we have Var(cX) = c 2 Var(X). Ad for idepedet radom variables X,Y, we have Var(X +Y) = Var(X)+Var(Y). 1 We are assumig here that the samplig is doe with replacemet ; i.e., we select each perso i the sample from the etire populatio, icludig those we have already picked. So there is a small chace that we will pick the same perso twice. CS 70, Sprig 2008, Note 22 1

Proof: From the defiitio of variace, we have Var(cX) = E[(cX E[cX]) 2 ] = E[(cX ce[x]) 2 ] = E[c 2 (X E[X]) 2 ] = c 2 Var(X). The proof of the secod claim is left as a exercise. Note that the secod claim does ot i geeral hold uless X ad Y are idepedet. Usig this theorem, we ca ow compute Var(A ). Var(A ) = Var( 1 i=1 X i ) = ( 1 )2 Var( i=1 X i ) = ( 1 )2 Var(X i ) = σ 2 i=1, where we have writte σ 2 for the variace of each of the X i, i.e., σ 2 = Var(X i ). So we see that the variace of A decreases liearly with. This fact esures that, as we take larger ad larger sample sizes, the probability that we deviate much from the expectatio p gets smaller ad smaller. Let s ow use Chebyshev s iequality to figure out how large has to be to esure a specified accuracy i our estimate of the proportio of Democrats p. A atural way to measure this is for us to specify two parameters, ε ad δ, both i the rage (0,1). The parameter ε cotrols the error we are prepared to tolerate i our estimate, ad δ cotrols the cofidece we wat to have i our estimate. A more precise versio of our origial questio is the the followig: Questio: For the Democrat-estimatio problem above, how large does the sample size have to be i order to esure that Pr[ A p ε] δ? I our origial questio, we had ε = 0.01 ad δ = 0.05: we wated to kow how large eeds to be so that Pr[p 0.01 < A < p + 0.01] 0.95, which is equivalet to askig how large eeds to be so that Pr[ A p 0.01] 0.05. Notice that, i this example, ε measures the absolute error, i.e., the differece betwee the estimate A ad the true value p. I may applicatios, the relative error is a better measure of error, but i the case of pollig, it s usually eough to boud the absolute error. 2 Let s apply Chebyshev s iequality to aswer our more precise questio above. Sice we kow Var(A ), this will be quite simple. From Chebyshev s iequality, we have Pr[ A p ε] Var(A ) ε 2 = σ 2 ε 2. To make this less tha the desired value δ, we eed to set σ 2 ε 2 δ. (1) 2 I may other applicatios, the absolute error is a poor measure, because a give absolute error (say, ±0.01) might be quite small i the cotext of measurig a large value like p = 0.5, but very large whe measurig a small value like p = 0.005. For this reaso, i most real-life applicatios, it is more useful to examie the relative error, i.e., to measure the error as a ratio of the target value p. (Thus the absolute error of the estimate A is A p, while the relative error is A p /p = A p 1.) The relative error has the advatage of treatig all values of p equally. However, pollig is a special case where it is ofte sufficiet to use the absolute error, ad the mathematics for the absolute error is slightly simpler, so we will cotiue to use the absolute error i this example, cofidet i the kowledge that we could modify these calculatios to use a relative error measure if we wated. CS 70, Sprig 2008, Note 22 2

Now recall that σ 2 = Var(X i ) is the variace of a sigle sample X i. Sice X i is a 0/1-valued r.v., we have σ 2 = p(1 p). It is easy to see usig a bit of calculus that p(1 p) 1 4 (sice 0 p 1). As a result, iequality (1) becomes 1 4ε 2 δ. Pluggig i ε = 0.01 ad δ = 0.05, we see that a sample size of = 50,000 is sufficiet. Oe amazig corollary is that ecessary sample size depeds oly upo the desired margi of error (ε) ad cofidece level (δ), but ot o the size of the uderlyig populatio. We could be pollig the state of Wyomig, the state of Califoria, the whole of the US, or the etire world ad the same sample size is sufficiet. This is perhaps a bit couterituitive. Estimatig a geeral expectatio What if we wated to estimate somethig a little more complex tha the proportio of Democrats i the populatio, such as the average wealth of people i the US? The we could use exactly the same scheme as above, except that ow the r.v. X i is the wealth of the ith perso i our sample. Clearly E[X i ] = µ, the average wealth of people i the US (which is what we are tryig to estimate). Ad our estimate will agai be A = 1 i=1 X i, for a suitably chose sample size. Oce agai the X i are i.i.d. radom variables, so we agai have E[A ] = µ ad Var(A ) = σ 2, where σ 2 = Var(X i ) is the variace of the X i. (Recall that the oly facts we used about the X i was that they were idepedet ad had the same distributio. Actually it would be eough for them to be idepedet ad all have the same expectatio ad variace do you see why?) I this case, we probably wat to use the relative error: we wat to choose to esure that Pr[(1 ε)µ < A < (1+ε)µ] 1 δ], i.e., to esure that Pr[ A µ εµ] δ. Applyig Chebyshev s iequality much as before, we fid Pr[ A µ εµ] Var(A ) (εµ) 2 = σ 2 ε 2 µ 2. Hece it is eough for the sample size to satisfy (2) σ 2 µ 2 1 ε 2 δ. (3) Here ε ad δ are the desired error ad cofidece respectively, as before. Now of course we do t kow the other two quatities, µ ad σ 2, appearig i equatio (3). I practice, we would try to fid some reasoable lower boud o µ ad some reasoable upper boud o σ 2 (just as we used a upper boud o p(1 p) i the Democrats problem). Pluggig these bouds ito equatio (3) will esure that our sample size is large eough. For example, i the average wealth problem we could probably safely take µ to be at least (say) $20k (probably more). However, the existece of people such as Bill Gates meas that we would eed to take a very high value for the variace σ 2. Ideed, if there is at least oe idividual with wealth $50 billio, the assumig a relatively small value of µ meas that the variace must be at least about (50 109 ) 2 = 10 13. 250 10 6 (Check this.) However, this idividual s cotributio to the mea is oly 50 109 = 200. There is really o 250 10 6 way aroud this problem with simple uiform samplig: the ueve distributio of wealth meas that the variace is iheretly very large, ad we will eed a huge umber of samples before we are likely to fid aybody who is immesely wealthy. But if we do t iclude such people i our sample, the our estimate will be way too low. CS 70, Sprig 2008, Note 22 3

As a further example, suppose we are tryig to estimate the average rate of emissio from a radioactive source, ad we are willig to assume that the emissios follow a Poisso distributio with some ukow parameter λ of course, this λ is precisely the expectatio we are tryig to estimate. Now i this case we have µ = λ ad also σ 2 = λ (see the previous lecture otes). So σ 2 = 1 µ 2 λ. Thus i this case a sample size of = 1 suffices. (Agai, i practice we would use a lower boud o λ.) λε 2 δ The Law of Large Numbers The estimatio method we used i the previous two sectios is based o a priciple that we accept as part of everyday life: amely, the Law of Large Numbers. This asserts that, if we observe some radom variable may times, ad take the average of the observatios, the this average will coverge to a sigle value, which is of course the expectatio of the radom variable. I other words, averagig teds to smooth out ay large fluctuatios, ad the more averagig we do the better the smoothig. Theorem 22.2: [Law of Large Numbers] Let X 1,X 2,...,X be i.i.d. radom variables with commo expectatio µ = E[X i ]. Defie A = 1 i=1 X i. The for ay α > 0, we have Pr[ A µ α] 0 as. We will ot prove this theorem here. Notice that it says that the probability of ay deviatio α from the mea, however small, teds to zero as the umber of observatios i our average teds to ifiity. Thus by takig large eough, we ca make the probability of ay give deviatio as small as we like. [Note, however, that the Law of Large Numbers does ot say aythig about how large has to be to achieve a certai accuracy. For that, we eed Chebyshev s iequality or some other quatitative tool.] Actually we ca say somethig much stroger tha the Law of Large Numbers: amely, the distributio of the sample average A, for large eough, looks like a bell-shaped curve cetered about the mea µ. The width of this curve decreases with, so it approaches a sharp spike at µ. This fact is kow as the Cetral Limit Theorem. To say this precisely, we eed to defie the bell-shaped curve. This is the so-called Normal distributio, ad it is the first (ad oly) o-discrete distributio we will meet i this course. For radom variables that take o cotiuous real values, it o loger makes sese to talk about Pr[X = a]. As a example, cosider a r.v. X that has the uiform distributio o the cotiuous iterval [0,1]. The for ay sigle poit 0 a 1, we have Pr[X = a] = 0. However, clearly it is the case that, for example, Pr[ 1 4 X 3 4 ] = 1 2. So i place of poit probabilities Pr[X = a], we eed a differet otio of distributio for cotiuous radom variables. Defiitio 22.1 (desity fuctio): For a real-valued r.v. X, a real-valued fuctio f(x) is called a (probability) desity fuctio for X if Pr[X a] = a f(x)dx. Thus we ca thik of f(x) as defiig a curve, such that the area uder the curve betwee poits x = a ad x = b is precisely Pr[a X b]. Note that we must always have f(x)dx = 1. (Do you see why?) As a example, for the uiform distributio o [0,1] the desity would be 0 for x < 0; f(x) = 1 for 0 x 1; 0 for x > 1. [Check you agree with this. What would be the desity for the uiform distributio o [ 1,1]?] CS 70, Sprig 2008, Note 22 4

Expectatios of cotiuous r.v. s are computed i a aalagous way to those for discrete r.v. s, except that we use itegrals istead of summatios. Thus Ad also E[X] = x f(x)dx. Var(X) = E[X 2 ] (E[X]) 2, where E[X 2 ] = x2 f(x)dx. [You should check that, for the uiform distributio o [0,1], the expectatio is 1 2 Now we are i a positio to defie the Normal distributio. Defiitio 22.2 (Normal distributio): distributio with desity fuctio f(x) = 1 σ 2π e (x µ)2 /2σ 2. ad the variace is 1 12.] The Normal distributio with mea µ ad variace σ 2 is the 1 [The costat factor σ comes from the fact that 2π e (x µ)2 /σ 2 dx = σ 2π. So, we have to ormalize by this costat factor to esure that f(x)dx = 1. If you like calculus, you might like to do the itegrals to check that the expectatio x f(x)dx is ideed µ ad that the variace is ideed σ 2.] If you plot the above desity fuctio f(x), you will see that it is a symmetrical bell-shaped curve cetered aroud the mea µ. Its height ad width are determied by the stadard deviatio σ as follows: 50% of the mass is cotaied i the iterval of width 0.67σ either side of the mea, ad 99.7% i the iterval of width 3σ either side of the mea. (Note that, to get the correct scale, deviatios are o the order of σ rather tha σ 2.) Put aother way, if we sample a radom value from a Normal distributio, the 50% of the time, the value we get will be withi 0.67 stadard deviatios of the mea (i.e., i the rage [µ 0.67σ, µ + 0.67σ]); ad 99.7% of the time, it will be withi 3 stadard deviatios of the mea. Now we are i a positio to state the Cetral Limit Theorem. Because our treatmet of cotiuous distributios has bee rather sketchy, we shall be cotet with a rather imprecise statemet. This ca be made completely rigorous without too much extra effort. Theorem 22.3: [Cetral Limit Theorem] Let X 1,X 2,...,X be i.i.d. radom variables with commo expectatio µ = E[X i ] ad variace σ 2 = Var(X i ) (both assumed to be fiite). Defie A = 1 i=1 X i. The as, the distributio of A approaches the Normal distributio with mea µ ad variace σ 2. Note that the variace is σ 2 of as icreases. (as we would expect) so the width of the bell-shaped curve decreases by a factor The Cetral Limit Theorem is actually a very strikig fact. What it says is the followig. If we take a average of observatios of absolutely ay r.v. X, the the distributio of that average will be a bell-shaped curve cetered at µ = E[X]. Thus all trace of the distributio of X disappears as gets large: all distributios, o matter how complex, 3 look like the Normal distributio whe they are averaged. The oly effect of the origial distributio is through the variace σ 2, which determies the width of the curve for a give value of, ad hece the rate at which the curve shriks to a spike. Oe useful cosequece is that the Biomial distributio ca be approximated by the Normal distributio, sice the Biomial(, p) distributio is obtaied as the sum of i.i.d. (0/1-valued) radom variables. I particular, if we hold p fixed ad let X be a radom variable with the distributio X Biomial(, p), the 3 We do eed to assume that the mea ad variace of X are fiite. CS 70, Sprig 2008, Note 22 5

as, 1 X approaches the Normal distributio with mea p ad variace p(1 p). This meas that if 1 is sufficietly large, we ca approximate the r.v. X as a Normal distributio (with mea p ad variace p(1 p) ). Or, i other words, if is sufficietly large, we ca approximate the r.v. X as a Normal distributio with mea p ad variace p(1 p). This meas that, for large, the Biomial distributio ca ofte be approximated as a Normal distributio. This ca be a helpful tool for approximate computatios about Biomially distributed radom variables. What is amazig about the Cetral Limit Theorem is it shows that the same kid of approximatios apply ot oly to sums of 0/1-valued radom values (i.e., ot oly to the Biomial distributio) but also to sums of ay other kid of i.i.d. r.v.s, as log as is sufficietly large. CS 70, Sprig 2008, Note 22 6