1 Large Deviations. Korea Lectures June 2017 Joel Spencer Tuesday Lecture III

Size: px

Start display at page:

Download "1 Large Deviations. Korea Lectures June 2017 Joel Spencer Tuesday Lecture III"

Jade Ramsey
5 years ago
Views:

1 Korea Lectures June 017 Joel Spencer Tuesday Lecture III 1 Large Deviations We analyze how random variables behave asymptotically Initially we consider the sum of random variables that take values of 1 and 1 Then we will consider sums of arbitrary random variables The key is what are called Chernoff Bounds Markov s Inequality: Given a non-negative random variable Z, Pr[Z αe[z]] α 1 Chernoff Bound: For any Z and any a Pr[Z > a] = Pr[E[e λz > e λa ] E[eλZ ] e λa for any nonnegative real λ The strength of the Chernoff Bound is that one selects λ optimally (more usually near optimally as the calculations are too hard for true optimality and generally we are happy with a reasonable bound) Let X 1,,X n be independent random variables defined as ±1 with probability 1/ It is immediate that E[X i ] = 0 and V ar[x i ] = 1 Define also S n = X X n The Central Limit Theorem says that when n is large S n N(0,n) Applying the Chernoff bound, for any λ > 0 Pr[S n an 1/ ] = Pr[e λsn e λan1/ ] E[e λsn ]e λan1/, where the first equality is because λ > 0 and monotonicity of the exponential function Theorem: Given n and a > 0, Pr[S n an 1/ ] e a / Proof: Recall that although the expectation of the multiplication of many RVs is not necessarily the multiplication of the expectation, that is true when the RVs are mutually independent As that is the case here, [ ] E[e λsn ] = E e λx i = E[e λx i ] = ( e λ + e λ ) n = cosh n λ ( e λ / ) n The last inequality can be shown by comparing the terms of both Taylor series Using the Chernoff Bound, we have 1

2 Pr[S n an 1/ ] e λ n/ λan 1/ = e a /, setting λ = an 1/ which minimizes λ n λan 1/ End of Proof Now consider a more general setting Let Y 1,,Y n be a set of mutually independent RVs with E[Y i ] = 0 and V [Y i ] = σi Define the sum of them as Y = Y i, thus E[Y ] = 0 and V [Y ] = σ σi We would like to conclude, as before, that the decay of Y is like e a /, but unfortunately that is not always true Using again the Markov Inequality and independence, But in this case Pr[Y aσ] E[e λy ]e λaσ = E[e λy i ]e λaσ [ ] E[e λy i λ j ] = E j! Y j i j=0 = λ σ i +? e λ σ i / is not always true For the last equality to be true we need that the higher order terms are small with respect to the quadratic one When this happens for all i, multiplying all together we see that Pr[Y aσ] < e λ σ / λaσ = e a /, letting λ be a/σ Most of the times we would be using this result, a will be small compared to σ because otherwise we would be looking very far off the mean Therefore, λ will be small and the high order terms will be negligible, which will make the bound correct Now let s try the same for a normal distribution If N N(0,1), it is known that its transform is E[e λn ] = Doing the same thing one more time, e t 1/ π e λt dt = e λ / Pr[N a] e λ / λa = e a /, letting λ be a The last equation is what it should intuitively be Indeed, as the sum of mutually independent RVs is roughly normal, the Chernoff Bound and this bound should say the same Applications Let s consider tournaments again We would like to see how well we can rank the vertices in a way that reflects the outcome of the tournament Given a ranking σ, we will measure its quality by defining FIT[T n,σ] = #nonupsets #upsets, where {i,j} is a nonupset when σ i < σ j and (σ i,σ j ) T n and an upset otherwise The overall measure of how good we do is given by

3 F(n) = min max FIT(T n,σ) T N σ This can be though of as an adversary giving us a tournament which we must rank as good as possible The adversary will of course choose the tournament that makes us return a bad solution Theorem: F(n) 0 Proof: Take a ranking and if it does not work just consider the opposite ranking Thats it Theorem: (Erdos- Moon) F(n) n 3/ ln n This means that there always exists a tournament T n such that no σ has FIT[T n,σ] > n 3/ ln n No matter how we rank it, it will be bad For a large n we won t be able to rank more than 50% of the games correctly Proof: Take a random T n Define the event B σ as FIT[T n,σ] > α By construction of T n, every game will count as 1 or 1 with equal probability As there are ( n ) games, FIT[T n,σ] S ( n ), where means that the two have the same distribution Thus Then with α = n 3/ ln n Pr[B σ ] = Pr[S ( n α] exp ) Pr[ B σ ] n!exp ( ) α ( n) ( ) α ( n) 3 Large Deviations: Generalizing the Chernoff Bound Theorem: (Generalized Chernoff Bound) Given n independent variables X i, with 1 X i 1, and E[X i ] = 0 for every i, let X = i X i Then Pr[X > a n] < e a / Proof: The only difference between this proof and the one we gave in the simpler case is that this time, we can t simply replace E[e λx i ] by eλ +e λ However, it turns out that, for X i as above, E[e λx i ] eλ + e λ Why would this be true? Let s examine the function e λx (see figure) As one can see, the function is convex, which means that it lies entirely under the linear function h(x) But then it follows that e λx h(x) so E[e λx i ] E[h(X i )] However, since h is linear, E[h(X i )] = h(e[x i ]), so we get E[e λx i ] E[h(X i )] = h(e[x i ]) = h(0) = eλ + e λ, so we have our upper bound, and the proof of the Chernoff Bound continues just as in the simpler case 3 < 1

4 4 More Applications to Large Deviations 41 Balancing Vectors Theorem: Let v 1,, v n be vectors in R m, with v i 1 for every i Then there is a choice of signs ǫ i { 1,1} for every v i, such that p n = i ǫ i v i nṗroof: As it happens, this theorem is easy to proof in non-probabilistic context, in an algorithmical way: introduce the vectors one by one, keeping track of the position vectors p i If the new vector v i+1 makes an acute angle with p i, add it (ǫ i = 1); if not, subtract it (ǫ i = 1); if they are perpedicular, do any of the two This result in a new position vector p i+1, which, by the law of cosines, has p i+1 p i + 1; by induction, this shows that the choice of ǫ s yields p n n The probabilistic proof is somewhat shorter: independently choose ǫ i to be 1 or 1 with equal probabilities Then look at E[ p n ] = E[ i,j ǫ i ǫ j v i v j ] = i,j E[ǫ i ǫ j ] v i v j ; When i j, E[ǫ i ǫ j ] = E[ǫ i ]E[ǫ j ] = 0 while when i = j, E[ǫ i ǫ j ] = E[ǫ i ] = 1 so the sum sieves out to E[ p n ] = i E[ǫ i ] v i i E[ǫ i ] = n ; so by the Probabilistic Method there must be a choice of ǫ s such that p n n End of Proof Theorem: Let v 1,, v n be vectors in R m with v i 1 for every i, and p 1,,p n be numbers in [0,1] Set w = i v ip i Then there is a choice of signs ǫ i { 1,1} for every v i, such that n ǫ i v i w i This has an interesting geometric interpretation When they are linearly independent the v i generate a parallelopiped with the n vertices i ǫ i v i w can lie anywhere inside this parallelopiped and some vertex will be within n from it distance proof Choose ǫ i s independently to be 1 with probability p i, and 0 with probability 1 p i Now E[(ǫ i p i )(ǫ j p j )] is zero when i j and p i (1 p i ) when i = j So E[ i ǫ i v i w ] = i,j E[(ǫ i p i )(ǫ j p j )] v i v j = i E[(ǫ i p i ) ] v i i E[(ǫ i p i ) ] = i p i (1 p i ) n 4, since x(1 x) 1/4, for any x [0,1] Hence, by the Probabilistic Method, there must be a choice of signs which leads to n ǫ i v i w i 4

5 4 Balancing Lights Description Suppose we have an n n array of lights, and n switches (1 for each row and each column) A switch changes all the lights on a particular column/row The purpose of the problem is to switch on as many lights as possible Anecdotal History This problem has been studied by Berlekamp (who is now at Berkeley); not surprisingly, both his undergraduate and graduate degrees are in Electrical Engineering! He actually built an 8 8 circuit, which he kept in his office, and invited people to play with it, to see how many lights they could turn on Helpful Remarks Switch pulling is order-independent, and is does not make sense to pull the same switch twice, since the effects will cancel each other Mathematical Formulation Since the total number of lights is n, maximizing the number of on s is the same with maximizing the difference (discrepancy) between the number of on s and off s Hence the above problem is equivalent to finding F(n) = min a ij = ±1 1 i,j, n max x i = ±1 y j = ±1 1 i,j, n a ij x i y j = xa y; i,j x i = 1 or y j = 1 corresponds to a switch in the ith row or jth column Note that F(n) 0, because if in the initial state of the matrix we have more off lights, we can switch every row and no columns, and thus switch it all around, and get to a state with more on lights Theorem: F(n) β = n 3/ ln Proof: The purpose is to find a matrix A which minimizes the discrepancy in lights Randomly set the a ij s to ±1, and define the bad event then define the union of bad events B x, y : xa y > β ; B = x, y B x, y Since there are n pairs of vectors x, y, and we want to use Union Bound, we want to make Pr[B x, y ] < 1 n As it turns out, Pr[ xa y > β] Pr[S n > β], since the great property of randomization is that multiplication by a random number randomizes, ie if x i, y j are arbitrary, then a ij x i y j is randomly ±1 So we get that Pr[B x, y ] < e β /n = 1 n ; in fact, you can see that this is how β was chosen in the first place 5

6 Hence, by the Probabilistic Method, there must be an A such that the discrepancy of it is no more than n 3/ ln, and the theorem follows End of Proof Remarks The above can be interpreted as follows: if an adversary is to hand us an n n array, we can not hope to turn on more than n /+n 3/ ln of the lights In reality, it s even less than that, because in the above proof, we have not accounted for the fact that antipodal switching (all rows and all columns) creates the exact same array of lights, so instead of having n possibilities, we have only n 1 As it turns out, this makes no difference whatsoever in the asymptotics The moral is, don t sweat the small stuff (assuming you can tell which stuff is small ) 6

Machine Learning Theory Lecture 4

Machine Learning Theory Lecture 4 Nicholas Harvey October 9, 018 1 Basic Probability One of the first concentration bounds that you learn in probability theory is Markov s inequality. It bounds the right-tail