Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Lecture 16

EECS 70 Discrete Mathematics ad Probability Theory Sprig 2014 Aat Sahai Lecture 16 Variace Questio: Let us retur oce agai to the questio of how may heads i a typical sequece of coi flips. Recall that we used the Galto board to visualize this as follows: cosider a toke fallig through the Galto board, deflected radomly to the right or left at each step. We thik of these radom deflectios as takig place accordig to whether a fair coi comes up heads or tails. The expected positio of the toke whe it reaches the bottom is right i the middle, but how far from the middle should we typically expect it to lad? Deotig a right-move by +1 ad a left-move by 1, we ca describe the probability space here as the set of all words of legth over the alphabet {±1}, each havig equal probability 1 2. Let the r.v. X deote our positio (relative to our startig poit 0) after moves. Thus where X i = { +1 if ith toss is Heads; 1 otherwise. X = X 1 + X 2 + + X, Now obviously we have E(X) = 0. The easiest way to see this is to ote that E(X i ) = ( 1 2 1)+(1 2 ( 1)) = 0, so by liearity of expectatio E(X) = i=1 E(X i) = 0. But of course this is ot very iformative, ad is due to the fact that positive ad egative deviatios from 0 cacel out. What the above questio is really askig is: What is the expected value of X, the distace of the distace from 0? Rather tha cosider the r.v. X, which is a little awkward due to the absolute value operator, we will istead look at the r.v. X 2. Notice that this also has the effect of makig all deviatios from 0 positive, so it should also give a good measure of the distace traveled. However, because it is the squared distace, we will eed to take a square root at the ed to make the uits make sese. Let s calculate E(X 2 ): E(X 2 ) = E((X 1 + X 2 + + X ) 2 ) = E( i=1 X 2 i + i j X i X j ) = i=1 E(X 2 i ) + i j E(X i X j ) I the last lie here, we used liearity of expectatio. To proceed, we eed to compute E(Xi 2) ad E(X ix j ) (for i j). Let s cosider first Xi 2. Sice X i ca take o oly values ±1, clearly Xi 2 = 1 always, so E(Xi 2) = 1. What about E(X i X j )? Well, X i X j = +1 whe X i = X j = +1 or X i = X j = 1, ad otherwise X i X j = 1. Also, Pr[(X i = X j = +1) (X i = X j = 1)] = Pr[X i = X j = +1] + Pr[X i = X j = 1] = 1 4 + 1 4 = 1 2, so X i X j = 1 with probability 1 2. I the above calculatio we used the fact that the evets X i = +1 ad X j = +1 are idepedet, so Pr[X i = X j = +1] = Pr[X i = +1] Pr[X j = +1] = 1 2 1 2 = 1 4 (ad similarly for Pr[X i = X j = 1]). Therefore X i X j = 1 with probability 1 2 also. Hece E(X ix j ) = 0. Sice X i ad X j are idepedet, we saw i the last lecture ote that it is the case that E(X i X j ) = E(X i )E(X j ) = 0. EECS 70, Sprig 2014, Lecture 16 1

Pluggig these values ito the above equatio gives E(X 2 ) = ( 1) + 0 =. So we see that our expected squared distace from 0 is. Oe iterpretatio of this is that we might expect to be a distace of about away from 0 after steps. However, we have to be careful here: we caot simply argue that E( X ) = E(X 2 ) =. (Why ot?) We will see later i the lecture how to make precise deductios about X from kowledge of E(X 2 ). For the momet, however, let s agree to view E(X 2 ) as a ituitive measure of spread of the r.v. X. I fact, for a more geeral r.v. with expectatio E(X) = µ, what we are really iterested i is E((X µ) 2 ), the expected squared distace from the mea. I our radom walk example, we had µ = 0, so E((X µ) 2 ) just reduces to E(X 2 ). Defiitio 16.1 (variace): For a r.v. X with expectatio E(X) = µ, the variace of X is defied to be Var(X) = E((X µ) 2 ). The square root σ(x) := Var(X) is called the stadard deviatio of X. The poit of the stadard deviatio is merely to udo the squarig i the variace. Thus the stadard deviatio is o the same scale as the r.v. itself. Sice the variace ad stadard deviatio differ just by a square, it really does t matter which oe we choose to work with as we ca always compute oe from the other immediately. We shall usually use the variace. For the radom walk example above, we have that Var(X) =, ad the stadard deviatio of X, σ(x), is. The followig easy observatio gives us a slightly differet way to compute the variace that is simpler i may cases. Theorem 16.1: For a r.v. X with expectatio E(X) = µ, we have Var(X) = E(X 2 ) µ 2. Proof: From the defiitio of variace, we have Var(X) = E((X µ) 2 ) = E(X 2 2µX + µ 2 ) = E(X 2 ) 2µE(X) + µ 2 = E(X 2 ) µ 2. I the third step here, we used liearity of expectatio. Examples Let s see some examples of variace calculatios. 1. Fair die. Let X be the score o the roll of a sigle fair die. Recall from a earlier lecture that E(X) = 7 2. So we just eed to compute E(X 2 ), which is a routie calculatio: E(X 2 ) = 1 6 ( 1 2 + 2 2 + 3 2 + 4 2 + 5 2 + 6 2) = 91 6. Thus from Theorem 16.1 Var(X) = E(X 2 ) (E(X)) 2 = 91 6 49 4 = 35 12. More geerally, if X is a radom variable that takes o values 1,..., with equal probability 1/ (i.e. X has a uiform distributio), the mea, variace ad stadard deviatio of X are: EECS 70, Sprig 2014, Lecture 16 2

(You should verify these.) E(X) = + 1 2, Var(X) = 2 1 12, σ(x) = 2 1 12. 2. Number of fixed poits. Let X be the umber of fixed poits i a radom permutatio of items (i.e., the umber of studets i a class of size who receive their ow homework after shufflig). We saw i a earlier{ lecture that E(X) = 1 (regardless of ). To compute E(X 2 ), write X = X 1 + X 2 + + X, 1 if i is a fixed poit; where X i = 0 otherwise The as usual we have E(X 2 ) = i=1 E(X 2 i ) + i j E(X i X j ). (1) Sice X i is a idicator r.v., we have that E(X 2 i ) = Pr[X i = 1] = 1. Sice both X i ad X j are idicators, we ca compute E(X i X j ) as follows: E(X i X j ) = Pr[X i = 1 X j = 1] = Pr[both i ad j are fixed poits] = 1 ( 1). [Check that you uderstad the last step here.] Pluggig this ito equatio (1) we get E(X 2 ) = ( 1 1 ) + (( 1) ( 1) ) = 1 + 1 = 2. Thus Var(X) = E(X 2 ) (E(X)) 2 = 2 1 = 1. I.e., the variace ad the mea are both equal to 1. Like the mea, the variace is also idepedet of. Ituitively at least, this meas that it is ulikely that there will be more tha a small umber of fixed poits eve whe the umber of items,, is very large. Variace for sums of idepedet radom variables Oe of the most importat ad useful facts about variace is if a radom variable X is the sum of idepedet radom variables X = X 1 + X, the its variace is the sum of the variaces of the idividual r.v. s. This is ot just true for the specific coi-toss example cosidered at the begiig of the ote. I particular, if the idividual r.v. s X i are idetically distributed, the Var(X) = i Var(X i ) = Var(X 1 ). This meas that the stadard deviatio σ(x) = ()σ(x 1 ). Note that by cotrast, the expected value E[X] = E[X 1 ]. Ituitively this meas that whereas the average value of X grows proportioally to, the spread of the distributio grows proportioally to. I other words the distributio of X teds to cocetrate aroud its mea. Let us formalize these ideas: Theorem 16.2: For ay radom variable X ad costat c, we have Ad for idepedet radom variables X,Y, we have Var(cX) = c 2 Var(X). Var(X +Y ) = Var(X) + Var(Y ). EECS 70, Sprig 2014, Lecture 16 3

Proof: The first part of the proof is a simple calculatio. You should verify this yourself as a exercise. From the alterative formula for variace i Theorem 16.1, we have, usig liearity of expectatio extesively, Var(X +Y ) = E((X +Y ) 2 ) E(X +Y ) 2 = E(X 2 ) + E(Y 2 ) + 2E(XY ) (E(X) + E(Y )) 2 = (E(X 2 ) E(X) 2 ) + (E(Y 2 ) E(Y ) 2 ) + 2(E(XY ) E(X)E(Y )) = Var(X) + Var(Y ) + 2(E(XY ) E(X)E(Y )). Now because X,Y are idepedet, the fial term i this expressio is zero. Hece we get our result. Note: The expressio E(XY ) E(X)E(Y ) appearig i the above proof is called the covariace of X ad Y, ad is a measure of the depedece betwee X,Y. It is always zero whe X,Y are idepedet, but it ca also be zero i other cases as well. Those other cases are called ucorrelated radom variables, but eed ot be idepedet. The couterexample at the ed of the last ote ca be used to fid a case of radom variables that are ot idepedet, but are ucorrelated. By iductio, you ca exted the result above to larger collectio of idepedet radom variables. However i the homework, you will see that actually, pairwise idepedece will do just fie. Actual idepedece is ot eeded. Example Let s retur to our motivatig example of a sequece of coi tosses. Let X the the umber of Heads i tosses of a biased coi with Heads probability p (i.e., X has the biomial distributio { with parameters, p). 1 if ith toss is Head; We already kow that E(X) = p. As usual, let X = X 1 +X 2 + +X, where X i =. 0 otherwise We ca compute Var(X i ) = E(X 2 i ) E(X i) 2 = p p 2 = p(1 p). So Var(X) = p(1 p). As a example, for a fair coi the expected umber of Heads i tosses is 2, ad the stadard deviatio is 2. Note that sice the maximum umber of Heads is, the stadard deviatio is much less tha this maximum umber for large. This is i cotrast to the previous example of the uiformly distributed radom variable, where the stadard deviatio σ(x) = ( 2 1)/12 / 12 is of the same order as the largest value. I this sese, the spread of a biomially distributed r.v. is much smaller tha that of a uiformly distributed r.v. However, to actually make it precise exactly how the variace is coected to the spread of a distributio, we eed to build oe more critical tool a way to use expectatios to boud the uderlyig probability distributios themselves. Chebyshev s Iequality We have see that, ituitively, the variace (or, more correctly the stadard deviatio) is a measure of spread, or deviatio from the mea. Our ext goal is to make this ituitio quatitatively precise. What we ca show is the followig: EECS 70, Sprig 2014, Lecture 16 4

Theorem 16.3: [Chebyshev s Iequality] For a radom variable X with expectatio E(X) = µ, ad for ay α > 0, Pr[ X µ α] Var(X) α 2. Before provig Chebyshev s iequality, let s pause to cosider what it says. It tells us that the probability of ay give deviatio, α, from the mea, either above it or below it (ote the absolute value sig), is at most Var(X). As expected, this deviatio probability will be small if the variace is small. A immediate corollary α 2 of Chebyshev s iequality is the followig: Corollary 16.4: For a radom variable X with expectatio E(X) = µ, ad stadard deviatio σ = Var(X), Pr[ X µ βσ] 1 β 2. Proof: Plug α = β σ ito Chebyshev s iequality. So, for example, we see that the probability of deviatig from the mea by more tha (say) two stadard deviatios o either side is at most 1 4. I this sese, the stadard deviatio is a good workig defiitio of the width or spread of a distributio. We should ow go back ad prove Chebyshev s iequality. The proof will make use of the followig simpler boud, which applies oly to o-egative radom variables (i.e., r.v. s which take oly values 0). Theorem 16.5: [Markov s Iequality] For a o-egative radom variable X with expectatio E(X) = µ, ad ay α > 0, Pr[X α] E(X) α. Proof: From the defiitio of expectatio, we have E(X) = a a Pr[X = a] a α a Pr[X = a] α a α Pr[X = a] = α Pr[X α]. The crucial step here is the secod lie, where we have used the fact that X takes o oly o-egative values. (Why is this step ot valid otherwise?) There is a ituitive way of uderstadig Markov s iequality through a aalogy of a seesaw. Imagie that the distributio of a o-egative radom variable X is restig o a fulcrum, µ = E(X). We are tryig to fid a upper boud o the percetage of the distributio which lies beyod kµ, i.e. Pr[X kµ]. I other words, we seek to add as much weight m 2 as possible o the seesaw at kµ while miimizig the effect it has o the seesaw s balace. This weight will represet the upper boud we are searchig for. To miimize the weight s effect, we must imagie that the weight of the distributio which lies beyod kµ is cocetrated at exactly kµ. However, to keep thigs stable ad maximize the weight at kµ, we must add aother weight m 1 as far left to the fulcrum as we ca so that m 2 is as large as it ca be. The farthest we ca go to the left is 0, sice X is o-egative. Moreover, the two weights m 1 ad m 2 must add up to 1, sice they represet the area uder the distributio curve: Sice the lever arms are i the ratio k 1 to 1, a uit weight at kµ balaces k 1 uits of weight at 0. So the weights should be k 1 k at 0 ad 1 k at kµ, which is exactly Markov s boud. EECS 70, Sprig 2014, Lecture 16 5

Figure 1: Markov s iequality iterpreted as balacig a seesaw. Now we ca prove Chebyshev s iequality quite easily. Proof of Theorem 16.3 Defie the r.v. Y = (X µ) 2. Note that E(Y ) = E((X µ) 2 ) = Var(X). Also, otice that the probability we are iterested i, Pr[ X µ α], is exactly the same as Pr[Y α 2 ]. (Why?) Moreover, Y is obviously o-egative, so we ca apply Markov s iequality to it to get This completes the proof. Examples Pr[Y α 2 ] E(Y ) α 2 = Var(X) α 2. Here are some examples of applicatios of Chebyshev s iequality (you should check the algebra i them): 1. Coi tosses. Let X be the umber of Heads i tosses of a fair coi. The probability that X deviates from µ = 2 by more tha is at most 1 4. The probability that it deviates by more tha 5 is at most 1 100. You have see this i your virtual labs. You should also kow that this is a pretty coarse boud. 2. Fixed poits. Let X be the umber of fixed poits i a radom permutatio of items; recall that E(X) = Var(X) = 1. Thus the probability that more tha (say) 10 studets get their ow homeworks 1 after shufflig is at most 100, however large is. I some special cases, icludig the coi tossig example above, it is possible to get much tighter bouds o the probability of deviatios from the mea. However, for geeral radom variables Chebyshev s iequality is sometimes the oly tool. Its power derives from the fact that it ca be applied to ay radom variable, as log as it has a variace. Estimatig the bias of a coi Questio: We wat to estimate the proportio p of Democrats i the US populatio, by takig a small radom sample. How large does our sample have to be to guaratee that our estimate will be withi (say) ad additive factor of 0.1 of the true value with probability at least 0.95? EECS 70, Sprig 2014, Lecture 16 6

This is perhaps the most basic statistical estimatio problem, ad shows up everywhere. We will develop a simple solutio that uses oly Chebyshev s iequality. More refied methods ca be used to get sharper results. Let s deote the size of our sample by (to be determied), ad the umber of Democrats i it by the radom variable S. (The subscript just remids us that the r.v. depeds o the size of the sample.) The our estimate will be the value A = 1 S. Now as has ofte bee the case, we will fid it helpful to write S = X 1 + X 2 + + X, where { 1 if perso i i sample is a Democrat; X i = 0 otherwise. Note that each X i ca be viewed as a coi toss, with Heads probability p (though of course we do ot kow the value of p!). Ad the coi tosses are idepedet. 1 We call such a family of radom variables idepedet, idetically distributed, or i.i.d. for short. What is the expectatio of our estimate? E(A ) = E( 1 S ) = 1 E(X 1 + X 2 + + X ) = 1 (p) = p. So for ay value of, our estimate will always have the correct expectatio p. [Such a r.v. is ofte called a ubiased estimator of p.] Now presumably, as we icrease our sample size, our estimate should get more ad more accurate. This will show up i the fact that the variace decreases with : i.e., as icreases, the probability that we are far from the mea p will get smaller. To see this, we eed to compute Var(A ). But A = 1 i=1 X i, which is just a multiple of a sum of idepedet radom variables. Var(A ) = Var( 1 i=1 X i ) = ( 1 )2 Var( i=1 X i ) = ( 1 )2 Var(X i ) = σ 2 i=1, where we have writte σ 2 for the variace of each of the X i. So we see that the variace of A decreases liearly with. This fact esures that, as we take larger ad larger sample sizes, the probability that we deviate much from the expectatio p gets smaller ad smaller. Let s ow use Chebyshev s iequality to figure out how large has to be to esure a specified accuracy i our estimate of the proportio of Democrats p. A atural way to measure this is for us to specify two parameters, ε ad δ, both i the rage (0,1). The parameter ε cotrols the error we are prepared to tolerate i our estimate, ad δ cotrols the cofidece we wat to have i our estimate. A more precise versio of our origial questio is the the followig: Questio: For the Democrat-estimatio problem above, how large does the sample size have to be i order to esure that Pr[ A p ε] δ? I our origial questio, we had ε = 0.1 ad δ = 0.05. Let s apply Chebyshev s iequality to aswer our more precise questio above. Sice we kow Var(A ), this will be quite simple. From Chebyshev s iequality, we have Pr[ A p ε] Var(A ) ε 2 = σ 2 ε 2. 1 We are assumig here that the samplig is doe with replacemet ; i.e., we select each perso i the sample from the etire populatio, icludig those we have already picked. So there is a small chace that we will pick the same perso twice. EECS 70, Sprig 2014, Lecture 16 7

To make this less tha the desired value δ, we eed to set σ 2 ε 2 δ. (2) Now recall that σ 2 = Var(X i ) is the variace of a sigle sample X i. So, sice X i is a 0/1-valued r.v., we have σ 2 = p(1 p), ad iequality (2) becomes p(1 p) ε 2. (3) δ Sice p(1 p) is takes o its maximum 2 value for p = 1/2, we ca coclude that it is sufficiet to choose such that: 1 4ε 2 δ. (4) Pluggig i ε = 0.1 ad δ = 0.05, we see that a sample size of = 500 is sufficiet. Notice that the size of the sample is idepedet of the total size of the populatio! This is how polls ca accurately estimate quatities of iterest for a populatio of several hudred millio while samplig oly a very small umber of people. Estimatig a geeral expectatio What if we wated to estimate somethig a little more complex tha the proportio of Democrats i the populatio, such as the average wealth of people i the US? The we could use exactly the same scheme as above, except that ow the r.v. X i is the wealth of the ith perso i our sample. Clearly E(X i ) = µ, the average wealth (which is what we are tryig to estimate). Ad our estimate will agai be A = 1 i=1 X i, for a suitably chose sample size. Oce agai the X i are i.i.d. radom variables, so we agai have E(A ) = µ ad Var(A ) = σ 2, where σ 2 = Var(X i ) is the variace of the X i. (Recall that the oly facts we used about the X i was that they were idepedet ad had the same distributio actually the same expectatio ad variace would be eough: why?) This time, however, sice we do ot have ay a priori boud o the mea µ, it makes more sese to let ε be the relative error. i.e. we wish to fid a estimate A that is withi a additive error of εµ: Pr[ A µ εµ] δ. Usig equatio (2), but substitutig εµ i place of ε, it is eough for the sample size to satisfy σ 2 µ 2 1 ε 2 δ. (5) Here ε ad δ are the desired relative error ad cofidece respectively. Now of course we do t kow the other two quatities, µ ad σ 2, appearig i equatio (5). I practice, we would use a lower boud o µ ad a upper boud o σ 2 (just as we used a lower boud o p i the Democrats problem). Pluggig these bouds ito equatio (5) will esure that our sample size is large eough. For example, i the average wealth problem we could probably safely take µ to be at least (say) $20k (probably more). However, the existece of people such as Bill Gates meas that we would eed to take a very high value for the variace σ 2. Ideed, if there is at least oe idividual with wealth $50 billio, the assumig a relatively small value of µ meas that the variace must be at least about (50 109 ) 2 250 10 6 = 10 13. (Check 2 Use calculus if you eed to see why this is true. EECS 70, Sprig 2014, Lecture 16 8

this.) There is really o way aroud this problem with simple uiform samplig: the ueve distributio of wealth meas that the variace is iheretly very large, ad we will eed a huge umber of samples before we are likely to fid aybody who is immesely wealthy. But if we do t iclude such people i our sample, the our estimate will be way too low. The Law of Large Numbers The estimatio method we used i the previous two sectios is based o a priciple that we accept as part of everyday life: amely, the Law of Large Numbers (LLN). This asserts that, if we observe some radom variable may times, ad take the average of the observatios, the this average will coverge to a sigle value, which is of course the expectatio of the radom variable. I other words, averagig teds to smooth out ay large fluctuatios, ad the more averagig we do the better the smoothig. Theorem 16.6: [Law of Large Numbers] Let X 1,X 2,...,X be i.i.d. radom variables with commo expectatio µ = E(X i ). Defie A = 1 i=1 X i. The for ay α > 0, we have Pr[ A µ α] 0 as. Proof: Let Var(X i ) = σ 2 be the commo variace of the r.v. s; we assume that σ 2 is fiite 3. With this (relatively mild) assumptio, the LLN is a immediate cosequece of Chebyshev s Iequality. For, as we have see above, E(A ) = µ ad Var(A ) = σ 2, so by Chebyshev we have This completes the proof. Pr[ A µ α] Var(A ) α 2 = σ 2 0 as. α2 Notice that the LLN says that the probability of ay deviatio α from the mea, however small, teds to zero as the umber of observatios i our average teds to ifiity. Thus by takig large eough, we ca make the probability of ay give deviatio as small as we like. 3 If σ 2 is ot fiite, the LLN still holds but the proof is much trickier. EECS 70, Sprig 2014, Lecture 16 9