princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 3: Large deviations bounds and applications Lecturer: Sanjeev Arora

prnceton unv. F 13 cos 521: Advanced Algorthm Desgn Lecture 3: Large devatons bounds and applcatons Lecturer: Sanjeev Arora Scrbe: Today s topc s devaton bounds: what s the probablty that a random varable devates from ts mean by a lot? Recall that a random varable X s a mappng from a probablty space to R. The expectaton or mean s denoted E[X] or sometmes as µ. In many settngs we have a set of n random varables X 1, X 2, X 3,..., X n defned on the same probablty space. To gve an example, the probablty space could be that of all possble outcomes of n tosses of a far con, and X s the random varable that s 1 f the th toss s a head, and s 0 otherwse, whch means E[X ] = 1/2. The frst observaton we make s that of the Lnearty of Expectaton, vz. E[ X ] = E[X ] It s mportant to realze that lnearty holds regardless of the whether or not the random varables are ndependent. Can we say somethng about E[X 1 X 2 ]? In general, nothng much but f X 1, X 2 are ndependent events (formally, ths means that for all a, b Pr[X 1 = a, X 2 = b] = Pr[X 1 = a] Pr[X 2 = b]) then E[X 1 X 2 ] = E[X 1 ] E[X 2 ]. Note that f the X s are parwse ndependent (.e., each par are mutually ndependent) then ths means that var[ X ] = var[x ]. 1 Three progressvely stronger tal bounds Now we gve three methods that gve progressvely stronger bounds. 1.1 Markov s Inequalty (aka averagng) The frst of a number of nequaltes presented today, Markov s nequalty says that any non-negatve random varable X satsfes Pr (X k E[X]) 1 k. Note that ths s just another way to wrte the trval observaton that E[X] k Pr[X k]. Can we gve any meanngful upperbound on Pr[X < c E[X]] where c < 1, n other words the probablty that X s a lot less than ts expectaton? In general we cannot. However, f we know an upperbound on X then we can. For example, f X [0, 1] and E[X] = µ then for any c < 1 we have (smple exercse) Pr[X cµ] 1 µ 1 cµ. Sometmes ths s also called an averagng argument. 1

2 Example 1 Suppose you took a lot of exams, each scored from 1 to 100. If your average score was 90 then n at least half the exams you scored at least 80. 1.2 Chebyshev s Inequalty The varance of a random varable X s one measure (there are others too) of how spread out t s around ts mean. It s defned as E[(x µ) 2 ] = E[X 2 ] µ 2. A more powerful nequalty, Chebyshev s nequalty, says Pr[ X µ kσ] 1 k 2, where µ and σ 2 are the mean and varance of X. Recall that σ 2 = E[(X µ) 2 ] = E[X 2 ] µ 2. Actually, Chebyshev s nequalty s just a specal case of Markov s nequalty: by defnton, and so, E [ X µ 2] = σ 2, Pr [ X µ 2 k 2 σ 2] 1 k 2. Here s smple fact that s used a lot: If Y 1, Y 2,..., Y t are d (whch s jargon for ndependent and dentcally dstrbuted) then the varance of ther average 1 k Y t s exactly 1/t tmes the varance of one of them. Usng Chebyshev s nequalty, ths already mples that the average of d varables converges sort-of strongly to the mean. 1.2.1 Example: Load balancng Suppose we toss m balls nto n bns. You can thnk of m jobs beng randomly assgned to n processors. Let X = number of balls assgned to the frst bn. Then E[X] = m/n. What s the chance that X > 2m/n? Markov s nequalty says ths s less than 1/2. To use Chebyshev we need to compute the varance of X. For ths let Y be the ndcator random varable that s 1 ff the th ball falls n the frst bn. Then X = Y. Hence E[X 2 ] = E[ Y 2 + 2 <j Y Y j ] = E[Y 2 ] + <j E[Y Y j ]. Now for ndependent random varables E[Y Y j ] = E[Y ] E[Y j ] so E[X 2 ] = m n + m(m 1). n 2 Hence the varance s very close to m/n, and thus Chebyshev mples that the probablty that Pr[X > 2 m n ] < n m. When m > 3n, say, ths s stronger than Markov. 1.3 Large devaton bounds When we toss a con many tmes, the expected number of heads s half the number of tosses. How tghtly s ths dstrbuton concentrated? Should we be very surprsed f after 1000 tosses we have 625 heads? The Central Lmt Theorem says that the sum of n ndependent random varables (wth bounded mean and varance) converges to the famous Gaussan dstrbuton (popularly

3 known as the Bell Curve). Ths s very useful n algorthm desgn: we maneuver to desgn algorthms so that the analyss bols down to estmatng the sum of ndependent (or somewhat ndependent) random varables. To do a back of the envelope calculaton, f all n con tosses are far (Heads has probablty 1/2) then the Gaussan approxmaton mples that the probablty of seeng N heads where N n/2 > a n s at most e a2 /2. The chance of seeng at least 625 heads n 1000 tosses of an unbased con s less than 5.3 10 7. These are pretty strong bounds! Of course, for fnte n the sum of n random varables need not be an exact Gaussan and that s where Chernoff bounds come n. (By the way these bounds are also known by other names n dfferent felds snce they have been ndependently dscovered.) Frst we gve an nequalty that works for general varables that are real-valued n [ 1, 1]. (To apply t to more general bounded varables just scale them to [ 1, 1] frst.) Theorem 1 (Quanttatve verson of CLT due to H. Chernoff) If X 1, X 2,..., X n are ndependent random varables and each X [ 1, 1]. Let µ = E[X ] and σ 2 = var[x ]. Then X = X satsfes where µ = µ and σ 2 = σ2. Pr[ X µ > kσ] 2 exp( k2 4n ), Instead of provng the above we prove a smpler theorem for bnary valued varables whch showcases the basc dea. Theorem 2 Let X 1, X 2,..., X n be ndependent 0/1-valued random varables and let p = E[X ], where 0 < p < 1. Then the sum X = n =1 X, whch has mean µ = n =1 p, satsfes where c δ s shorthand for [ e δ (1+δ) (1+δ) ]. Pr[X (1 + δ)µ] (c δ ) µ Remark: There s an analogous nequalty that bounds the probablty of devaton below the mean, whereby δ becomes negatve and the n the probablty becomes and the c δ s very smlar. Proof: Surprsngly, ths nequalty also s proved usng the Markov nequalty. We ntroduce a postve dummy varable t and observe that E[exp(tX)] = E[exp(t X )] = E[ exp(tx )] = E[exp(tX )], (1) where the last equalty holds because the X r.v.s are ndependent. Now, E[exp(tX )] = (1 p ) + p e t, therefore, E[exp(tX )] = [1 + p (e t 1)] exp(p (e t 1)) = exp( p (e t 1)) = exp(µ(e t 1)), (2)

4 as 1 + x e x. Fnally, apply Markov s nequalty to the random varable exp(tx), vz. Pr[X (1 + δ)µ] = Pr[exp(tX) exp(t(1 + δ)µ)] E[exp(tX)] exp(t(1 + δ)µ) = exp((et 1)µ) exp(t(1 + δ)µ), usng lnes (1) and (2) and the fact that t s postve. Snce t s a dummy varable, we can choose any postve value we lke for t. The rght hand sze s mnmzed f t = ln(1+δ) just dfferentate and ths leads to the theorem statement. 2 Applcaton 1: Samplng/Pollng Opnon polls and statstcal samplng rely on tal bounds. Suppose there are n arbtrary numbers n [0, 1] If we pck t of them randomly (wth replacement!) then the sample mean s wthn (1±ɛ]) of the true mean wth probablty at least 1 δ f t > Ω( 1 ɛ 2 log 1/δ). (Verfy ths calculaton!) In general, Chernoff bounds mples that takng k ndependent estmates and takng ther mean ensures that the value s hghly concentrated about ther mean; large devatons happen wth exponentally small probablty. 3 Balls and Bns revsted: Load balancng Suppose we toss m balls nto n bns. You can thnk of m jobs beng randomly assgned to n processors. Then the expected number of balls n each bn s m/n. When m = n ths expectaton s 1 but we saw n Lecture 1 that the most overloaded bn has Ω(log n/ log log n) balls. However, f m = cn log n then the expected number of balls n each bn s c log n. Thus Chernoff bounds mply that the chance of seeng less than 0.5c log n or more than 1.5c log n s less than γ c log n for some constant γ (whch depends on the 0.5, 1.5 etc.) whch can be made less than say 1/n 2 by choosng c to be a large constant. Moral: f an offce boss s tryng to allocate work farly, he/she should frst create more work and then do a random assgnment. 4 What about the medan? Gven n numbers n [0, 1] can we approxmate the medan va samplng? Ths wll be part of your homework. Exercse: Show that t s mpossble to estmate the value of the medan wthn say 1.1 factor wth o(n) samples. But what s possble s to produce a number that s an approxmate medan: t s greater than at least n/2 n/t numbers below t and less than at least n/2 n/t numbers. The dea s to take a random sample of a certan sze and take the medan of that sample. (Hnt: Use balls and bns.) One can use the approxmate medan algorthm to descrbe a verson of qucksort wth very predctable performance. Say we are gven n numbers n an array. Recall that (random) qucksort s the sortng algorthm where you randomly pck one of the n numbers as a pvot,

then partton the numbers nto those that are bgger than and smaller than the pvot (whch takes O(n) tme). Then you recursvely sort the two subsets. Ths procedure works n expected O(n log n) tme as you may have learnt n an undergrad course. But ts performance s uneven because the pvot may not dvde the nstance nto two exactly equal peces. For nstance the chance that the runnng tme exceeds 10n log n tme s qute hgh. A better way to run qucksort s to frst do a quck estmaton of the medan and then do a pvot. Ths algorthm runs n very close to n log n tme, whch s optmal. 5