Discrete Mathematics and Probability Theory Summer 2014 James Cook Note 17

CS 70 Discrete Mthemtics nd Proility Theory Summer 2014 Jmes Cook Note 17 I.I.D. Rndom Vriles Estimting the is of coin Question: We wnt to estimte the proportion p of Democrts in the US popultion, y tking smll rndom smple. How lrge does our smple hve to e to gurntee tht our estimte will e within (sy) n dditive fctor of 0.1 of the true vlue with proility t lest 0.95? This is perhps the most sic sttisticl estimtion prolem, nd shows up everywhere. We will develop simple solution tht uses only Cheyshev s inequlity. More refined methods cn e used to get shrper results. Let s denote the size of our smple y n (to e determined), nd the numer of Democrts in it y the rndom vrile S n. (The suscript n just reminds us tht the rndom vrile depends on the size of the smple.) Then our estimte will e the vlue A n = 1 n S n. Now s hs often een the cse, we will find it helpful to write S n = X 1 + X 2 + + X n, where { 1 if person i in smple is Democrt; X i = 0 otherwise. Note tht ech X i cn e viewed s coin toss, with Heds proility p (though of course we do not know the vlue of p!). And the coin tosses re independent 1. We cll such fmily of rndom vriles independent, identiclly distriuted, or i.i.d. for short. Wht is the expecttion of our estimte? E(A n ) = E( 1 n S n) = 1 n E(X 1 + X 2 + + X n ) = 1 n (np) = p. So for ny vlue of n, our estimte will lwys hve the correct expecttion p. (Such r.v. is often clled n unised estimtor of p.) Now presumly, s we increse our smple size n, our estimte should get more nd more ccurte. This will show up in the fct tht the vrince decreses s n grows: i.e, the proility tht we re fr from the men p will get smller. To see this, we need to compute Vr(A n ). But A n = 1 n n i=1 X i, which is just multiple of sum of independent rndom vriles. Theorem 17.1: For ny rndom vrile X nd constnt c, we hve Vr(cX) = c 2 Vr(X). And for independent rndom vriles X nd Y, we hve Vr(X +Y ) = Vr(X) + Vr(Y ). 1 We re ssuming here tht the smpling is done with replcement ; i.e, we select ech person in the smple from the entire popultion, including those we hve lredy picked. So there is smll chnce tht we will pick the sme person twice. CS 70, Summer 2014, Note 17 1

Before we prove this theorem, let us look more crefully t something we hve een using implicitly for some time: Joint Distriutions Consider two rndom vriles X nd Y defined on the sme proility spce. By linerity of expecttion, we know tht E(X +Y ) = E(X)+E(Y ). Since E(X) cn e clculted if we know the distriution of X nd E(Y ) cn e clculted if we know the distriution of Y, this mens tht E(X +Y ) cn e computed knowing only the two individul distriutions. No informtion is needed out the reltionship etween X nd Y. This is not true if we need to compute, sy, E((X +Y ) 2 ), e.g. s when we computed the vrince of inomil r.v. This is ecuse E((X +Y ) 2 ) = E(X 2 )+2E(XY )+E(Y 2 ), nd E(XY ) depends on the reltionship etween X nd Y. How cn we cpture such reltionship? Recll tht the distriution of single rndom vrile X is the collection of the proilities of ll events X =, for ll possile vlues of tht X cn tke on. When we hve two rndom vriles X nd Y, we cn think of (X,Y ) s two-dimensionl rndom vrile, in which cse the events of interest re X = Y = for ll possile vlues of (,) tht (X,Y ) cn tke on. Thus, nturl generliztion of the notion of distriution to multiple rndom vriles is the following. Definition 17.1 (joint distriution): The joint distriution of two discrete rndom vriles X nd Y is the collection of vlues {(,,Pr[X = Y = ]) : (,) A B}, where A nd B re the sets of ll possile vlues tken y X nd Y respectively. This notion oviously generlizes to three or more rndom vriles. Since we will write Pr[X = Y = ] quite often, we will revite it to Pr[X =,Y = ]. Just like the distriution of single rndom vrile, the joint distriution is normlized, i.e. Pr[X =,Y = ] = 1. A, B This follows from noticing tht the events X = Y =, A, B, prtition the smple spce. The joint distriution of two rndom vriles fully descrie their sttisticl reltionships, nd provides enough informtion to compute ny proility or expecttion involving the two rndom vriles. For exmple, E(XY ) = c More generlly, if f is ny function on R R, E( f (X,Y )) = c c Pr[XY = c] = Pr[X =,Y = ]. c Pr[ f (X,Y ) = c] = f (,) Pr[X =,Y = ]. Moreover, the individul distriutions of X nd Y cn e recovered from the joint distriution s follows: Pr[X = ] = Pr[X =,Y = ] A, (1) B Pr[Y = ] = Pr[X =,Y = ] B. (2) A The first follows from the fct tht the events Y =, B, form prtition of the smple spce Ω, nd so the events X = Y =, B re disjoint nd their union yields the event X =. Similr logic pplies to the second fct. Pictorilly, one cn think of the joint distriution vlues s entries filling tle, with the columns indexed y the vlues tht X cn tke on nd the rows indexed y the vlues Y cn tke on (Figure 1). To get the CS 70, Summer 2014, Note 17 2

Figure 1: A tulr representtion of joint distriution. distriution of X, ll one needs to do is to sum the entries in ech of the columns. To get the distriution of Y, just sum the entries in ech of the rows. This process is sometimes clled mrginliztion nd the individul distriutions re sometimes clled mrginl distriutions to differentite them from the joint distriution. Independent Rndom Vriles Independence of rndom vriles is defined in nlogous fshion to independence for events: Definition 17.2 (independent r.vs): Rndom vriles X nd Y on the sme proility spce re sid to e independent if the events X = nd Y = re independent for ll vlues,. Equivlently, the joint distriution of independent r.vs decomposes s Pr[X =,Y = ] = Pr[X = ]Pr[Y = ],. Note tht for independent r.vs, the joint distriution is fully specified y the mrginl distriutions. Mutul independence of more thn two r.vs is defined similrly. A very importnt exmple of independent r.vs is indictor r.vs for independent events. Thus, for exmple, if {X i } re indictor r.vs for the ith toss of coin eing Heds, then the X i re mutully independent r.vs. We sw tht the expecttion of sum of r.vs is the sum of the expecttions of the individul r.vs. This is not true in generl for vrince. However, s the ove theorem sttes, this is true if the rndom vriles re independent. To see this, first we look t the expecttion of product of independent r.vs (which is quntity tht frequently shows up in vrince clcultions, s we hve seen). Theorem 17.2: For independent rndom vriles X nd Y, we hve E(XY ) = E(X)E(Y ). Proof: We hve E(XY ) = = = ( Pr[X =,Y = ] Pr[X = ] Pr[Y = ] Pr[X = ] = E(X) E(Y ), ) ( Pr[Y = ] s required. In the second line here we mde crucil use of independence. For exmple, this theorem would hve llowed us to conclude immeditely in our rndom wlk exmple t the eginning of Lecture Note 16 tht E(X i X j ) = E(X i )E(X j ) = 0, without the need for clcultion. ) CS 70, Summer 2014, Note 17 3

We now use the ove theorem to conclude the nice property of the vrince of independent rndom vriles stted in the theorem ove, nmely tht for independent rndom vriles X nd Y, Vr(X +Y ) = Vr(X)+ Vr(Y ): Proof: From the lterntive formul for vrince in Theorem 16.1, we hve, using linerity of expecttion extensively, Vr(X +Y ) = E((X +Y ) 2 ) E(X +Y ) 2 = E(X 2 ) + E(Y 2 ) + 2E(XY ) (E(X) + E(Y )) 2 = (E(X 2 ) E(X) 2 ) + (E(Y 2 ) E(Y ) 2 ) + 2(E(XY ) E(X)E(Y )) = Vr(X) + Vr(Y ) + 2(E(XY ) E(X)E(Y )). Now ecuse X nd Y re independent, y Theorem 18.1 the finl term in this expression is zero. Hence we get our result. Note: The expression E(XY ) E(X)E(Y ) ppering in the ove proof is clled the covrince of X nd Y, nd is mesure of the dependence etween X nd Y. It is zero when X nd Y re independent. It is very importnt to rememer tht neither of these two results is true in generl, without the ssumption tht X nd Y re independent. As simple exmple, note tht even for 0-1 r.v. X with Pr[X = 1] = p, E(X 2 ) = p is not equl to E(X) 2 = p 2 (ecuse of course X nd X re not independent!). Note lso tht the theorem does not quite sy tht vrince is liner for independent rndom vriles: it sys only tht vrinces sum. It is not true tht Vr(cX) = cvr(x) for constnt c. It sys tht Vr(cX) = c 2 Vr(X). The proof is left s strightforwrd exercise. We now return to our exmple of estimting the proportion of Democrts, where we were out to compute Vr(A n ): Vr(A n ) = Vr( 1 n n X i ) = ( 1 n n )2 Vr( X i ) = ( 1 n )2 n Vr(X i ) = σ 2 i=1 i=1 i=1 n, where we hve written σ 2 for the vrince of ech of the X i. So we see tht the vrince of A n decreses linerly with n. This fct ensures tht, s we tke lrger nd lrger smple sizes n, the proility tht we devite much from the expecttion p gets smller nd smller. Let s now use Cheyshev s inequlity to figure out how lrge n hs to e to ensure specified ccurcy in our estimte of the proportion of Democrts p. A nturl wy to mesure this is for us to specify two prmeters, ε nd δ, oth in the rnge (0,1). The prmeter ε controls the error we re prepred to tolerte in our estimte, nd δ controls the confidence we wnt to hve in our estimte. A more precise version of our originl question is then the following: Question: For the Democrt-estimtion prolem ove, how lrge does the smple size n hve to e in order to ensure tht Pr[ A n p ε] δ? In our originl question, we hd ε = 0.1 nd δ = 0.05. Let s pply Cheyshev s inequlity to nswer our more precise question ove. Since we know Vr(A n ), this will e quite simple. From Cheyshev s inequlity, we hve Pr[ A n p ε] Vr(A n) ε 2 = σ 2 nε 2 CS 70, Summer 2014, Note 17 4

To mke this less thn the desired vlue δ, we need to set n σ 2 ε 2 δ. (3) Now recll tht σ 2 = Vr(X i ) is the vrince of single smple X i. So, since X i is 0/1-vlued r.v, we hve σ 2 = p(1 p), nd inequlity (3) ecomes n p(1 p) ε 2. (4) δ Since p(1 p) is tkes on its mximum vlue for p = 1/2, we cn conclude tht it is sufficient to choose n such tht: n 1 4ε 2 δ. (5) Plugging in ε = 0.1 nd δ = 0.05, we see tht smple size of n = 500 is sufficient. Notice tht the size of the smple is independent of the totl size of the popultion! This is how polls cn ccurtely estimte quntities of interest for popultion of severl hundred million while smpling only very smll numer of people. Estimting generl expecttion Wht if we wnted to estimte something little more complex thn the proportion of Democrts in the popultion, such s the verge welth of people in the US? Then we could use exctly the sme scheme s ove, except tht now the r.v. X i is the welth of the ith person in our smple. Clerly E(X i ) = µ, the verge welth (which is wht we re trying to estimte). And our estimte will gin e A n = 1 n n i=1 X i, for suitly chosen smple size n. Once gin the X i re i.i.d. rndom vriles, so we gin hve E(A n ) = µ nd Vr(A n ) = σ 2 n, where σ 2 = Vr(X i ) is the vrince of the X i. (Recll tht the only fcts we used out the X i ws tht they were independent nd hd the sme distriution ctully the sme expecttion nd vrince would e enough: why?) This time, however, since we do not hve ny priori ound on the men µ, it mkes more sense to let ε e the reltive error. i.e. we wish to find n estimte tht is within n dditive error of εµ. Using eqution (3), ut sustituting εµ in plce of ε, it is enough for the smple size n to stisfy n σ 2 µ 2 1 ε 2 δ. (6) Here ε nd δ re the desired reltive error nd confidence respectively. Now of course we don t know the other two quntities, µ nd σ 2, ppering in eqution (6). In prctice, we would use lower ound on µ nd n upper ound on σ 2 (just s we used lower ound on p in the Democrts prolem). Plugging these ounds into eqution (6) will ensure tht our smple size is lrge enough. For exmple, in the verge welth prolem we could proly sfely tke µ to e t lest (sy) $20k (proly more). However, the existence of people such s Bill Gtes mens tht we would need to tke very high vlue for the vrince σ 2. Indeed, if there is t lest one individul with welth $50 illion, then ssuming reltively smll vlue of µ mens tht the vrince must e t lest out (50 109 ) 2 = 10 13. (Check 250 10 6 this.) There is relly no wy round this prolem with simple uniform smpling: the uneven distriution of welth mens tht the vrince is inherently very lrge, nd we will need huge numer of smples efore we re likely to find nyody who is immensely welthy. But if we don t include such people in our smple, then our estimte will e wy too low. CS 70, Summer 2014, Note 17 5

As further exmple, suppose we re trying to estimte the verge rte of emission from rdioctive source, nd we re willing to ssume tht the emissions follow Poisson distriution with some unknown prmeter λ of course, this λ is precisely the expecttion we re trying to estimte. Now in this cse we hve µ = λ nd lso σ 2 = λ (see the previous lecture note). So σ 2 = 1 µ 2 λ. Thus in this cse smple size of n = 1 suffices. (Agin, in prctice we would use lower ound on λ.) λε 2 δ The Lw of Lrge Numers The estimtion method we used in the previous two sections is sed on principle tht we ccept s prt of everydy life: nmely, the Lw of Lrge Numers (LLN). This sserts tht, if we oserve some rndom vrile mny times, nd tke the verge of the oservtions, then this verge will converge to single vlue, which is of course the expecttion of the rndom vrile. In other words, verging tends to smooth out ny lrge fluctutions, nd the more verging we do the etter the smoothing. Theorem 17.3: [Lw of Lrge Numers] Let X 1,X 2,...,X n e i.i.d. rndom vriles with common expecttion µ = E(X i ). Define A n = 1 n n i=1 X i. Then for ny α > 0, we hve Pr[ A n µ α] 0 s n. Proof: Let Vr(X i ) = σ 2 e the common vrince of the r.vs; we ssume tht σ 2 is finite 2. With this (reltively mild) ssumption, the LLN is n immedite consequence of Cheyshev s Inequlity. For, s we hve seen ove, E(A n ) = µ nd Vr(A n ) = σ 2 n, so y Cheyshev we hve This completes the proof. Pr[ A n µ α] Vr(A n) α 2 = σ 2 0 s n. nα2 Notice tht the LLN sys tht the proility of ny devition α from the men, however smll, tends to zero s the numer of oservtions n in our verge tends to infinity. Thus y tking n lrge enough, we cn mke the proility of ny given devition s smll s we like. 2 If σ 2 is not finite, the LLN still holds ut the proof is much trickier. CS 70, Summer 2014, Note 17 6