CSE 29: Learning Theory Fall 2006 Lecture Measure concentration Lecturer: Sanjoy Dasgupta Scribe: Nakul Verma, Aaron Arvey, and Paul Ruvolo. Concentration of measure: examples We start with some examples of concentration of measure. This phenomenon is very useful in analyzing machine learning algorithms and can be used to bound things like error probabilities. The first and the most standard of concentration results is for averages. It states that the average of bounded independent random variables is tightly concentrated around its expectation... Example: coin tosses Suppose a coin of unknown bias p is tossed n times: X,..., X n {0, }. Then the average of the X i is tightly concentrated around p. Specifically ( ) X +... + X n P p n ǫ 2e 2ǫ2 n Figure. shows the quick drop-off of the probability that the sample mean deviates from its expectation. So for a large enough n, we can estimate p quite accurately. p n p p + n Figure.. Shows the exponential decay in the probability of the sample mean deviating from its expectation (p) in the coin tossing experiment...2 Example: random points in a d-dimensional box Pick a point X [, +] d uniformly at random. Then it can be shown that X is tightly concentrated around d/3. To see this we note that X = (X,..., X d ), then E X 2 = E [ X 2 +... + Xd 2 ] = d EXi 2 = i= d i= + 2 x2 dx = d 3 where the second equality is due to the linearity of expectation. Now since each of the X i s are independent (and bounded), we can show the concentration ( P X 2 d ) 3 ǫd 2e 2ǫ2d. -
This provides us with the counter-intuitive result that the volume of the high-dimensional cube tends to lie in its corners where the points have length approximately d/3. Note that above examples are special cases of Hoeffding s Inequality: Lemma (Hoeffding s inequality). Suppose X,..., X n are independent and bounded variables, such that a i X i b i. Then, [ ( ) X +... + X n X +... + X n P E ǫ] 2e 2ǫ2 n 2 / P i (bi ai)2. (.) n n We will soon prove a much more general version of this, which is introduced next...3 Concentration of Lipschitz functions Observing the Hoeffding bound, one might wonder whether such concentration applies only to averages of random variables. After all, what is so special about averages? It turns out that the relevant feature of the average that yields tight concentration is that it is smooth. In fact any smooth function of bounded independent random variables is tightly concentrated around its expectation. The notion of smoothness we will use is Lipschitz. Definition 2. f : R n R is λ-lipschitz w.r.t. the l p -metric if, for all x, y, f(x) f(y) λ x y p. Example. For x = (x,..., x n ), define the average: a(x) = n (x +... + x n ). Then a( ) is (/n)-lipschitz with respect to l metric, since for any x,x, a(x) a(x ) = n [(x x ) +... + (x n x n )] n ( x x +... + x n x n ) = n x x. It turns out that Hoeffding s bound holds for all Lipschitz (with respect to l ) functions. Lemma 3 (Concentration of Lipschitz functions wrt l metric). Suppose X,..., X n are independent and bounded with a i x i b i Then, for any f : R n R which is λ-lipschitz w.r.t. l -metric Proof. See Section.5. P [f Ef + ǫ] e 2ǫ2 /λ 2 P i (bi ai)2 Remark. Since f is also λ-lipschitz, we can bound both above and below as P [ f Ef ǫ] 2e 2ǫ2 /λ 2 P i (bi ai)2. (.2) We now look at bounds for functions that are Lipschitz with respect to other metrics...4 Concentration of Lipschitz functions w.r.t. l 2 metric Let S d denote the surface of the unit sphere in R d, and let µ be the uniform distribution over S d. The following is known (we will prove it later in the course): -2
f = w f = + f is within ǫ of its median value exp Ω(ǫ 2 d) of the sphere s mass Figure.2. The function f(x) = w x is at one pole of the sphere, + at the other pole, and increases steadily from to + as one moves from one pole to the other. Since f is -Lipschitz on S d, most of the volume lies in a thin slice near the equator of the sphere (perpendicular to w). Lemma 4. Let f : S d R be λ-lipschitz w.r.t. l 2 -metric. Then, where med(f) is a median value of f. µ [f med(f) + ǫ] 4e ǫ2 d/2λ 2 (.3) One immediate consequence of (.3) is most of the volume of the sphere lies in a thin slice around the equator (for all equators!). To see this, fix any unit vector w S d. Then for X µ (this notation means X drawn from distribution µ ), E(w X) = 0 and also med(w X) = 0. Moreover, the function f(x) = w x is -Lipschitz wrt the l 2 norm: for all x, y S d, f(x) f(y) = w x w y = w (x y) w 2 x y 2 = x y 2 where the second-to-last inequality uses Cauchy-Schwarz. Thus by (.3), f is tightly concentrated around its median, i.e., µ [X : w X ǫ] 4e ǫ2 d/2. See Figure.2. Moreover, since there is nothing special about this particular w; the above bound is true for any equator!..5 Types of concentration Types of concentration we ll encounter in this course: Concentration of a product measure X = (X,...,X n ) where X i are independent and bounded, with respect to l and Hamming metric. Concentration of a uniform measure over S d, with respect to l 2 metric. Concentration of multivariate Gaussian measure, with respect to l 2 metric. -3
.2 Probability review.2. Warm-up problem Question. Let σ be a random permutation of {...n}. Let S be the number of fixed points of this permutation. What is the expected value and variance of S? Answer. Use n indicator random variables X i = (σ(i) = i), so that S X i. By linearity of expectation, we can solve the first problem as follows: ES = E(X + + X n ) = n EX i = i= n i= P(X i = ) n =. For the second problem, we use var(s) = E(S 2 ) (ES) 2 = E(S 2 ), and E(S 2 ) = E(X + + X n ) 2 = E Xi 2 + X i X j i i j EX 2 i + i j E(X i X j ) (linearity of expectation) n + i j n(n ) = 2 Thus var(s) =..2.2 Some basics Property 5 (Linearity of expectation). E(X +Y ) = EX +EY (holds even if X and Y are not independent). Property 6. var(x) = E(X EX) 2 = EX 2 (EX) 2. Property 7 (Jensen s inequality). If f is a convex function, then Ef(X) f(ex). Here s a picture to help you remember this enormously useful property of convex functions: Ef(X) f(ex) a EX b -4
Lemma 8. If X,...,X n are independent, then var(x + + X n ) = var(x ) + + var(x n ). Proof. Let X,..., X n be n independent random variables. Set Y i = X i EX i. Thus Y,..., Y n are independent with mean zero, and var(x + + X n ) = E[(X EX ) + + (X n EX n )] 2 = E(Y + + Y n ) 2 = E Yi 2 + Y i Y j i i j EY 2 i + i j EY i EY j EY 2 i E(X i EX i ) 2 var(x i ) As an example of an incorrect application, had we mistakingly assumed that the X i in the warmup problem were independent we would have found that the variance was n n instead of. Not too far off, since those X i are approximately independent (for large n). Lemma 9 (Markov s inequality). P( X a) E X a. Proof. Observe: X a ( X a); take expectations of both sides, using E[( X a)] = P( X a). Example. A simple application of Markov s inequality to the random variable S, which is always positive, is P(S k) /k. Lemma 0 (Chebyshev s inequality). Proof. Apply Markov s inequality to (X EX) 2 : P( X EX a) var(x) a 2. P( X EX a) = P((X EX) 2 a 2 ) E(X EX)2 a 2 = var(x) a 2 Example. Again, S is a strictly positive random variable, thus P(S k) /(k ) 2. Note that this is generally a better bound than that given by Markov s inequality. -5
.2.3 Example: symmetric random walk A symmetric random walk is a stochastic process on the line. One starts at the origin and at each time step moves either one unit to the left or one unit to the right, with equal probability. The move at time t is thus a random variable X t, where { + (right) with probability /2 X t = (left) with probability /2 Let S n = n i= X i be the position after n steps of the random walk. What are the expected value and variance of S n? The expected value of X i is 0 since we are equally likely to obtain + and, so ES n = E n X i = EX i = 0. i= Similarly, since the X i are independent, variance becomes linear as well. The variance of X i is EXi 2 =, therefore n var(s n ) = var( X i ) = var(x i ) = n. i= The standard deviation of S n is thus n; so we would expect that S n is ±O( n). We can make this more precise by using Markov s and Chebyshev s inequalities. (Markov) P( S n c n) E S n c n ES 2 n c n (Chebyshev) P( S n c n) var(s n) (c n) 2 = c 2 = var(sn ) c n = c.2.4 Moment-generating functions The Chebyshev inequality is just the Markov inequality applied to X 2 ; this often yields a better bound, as in the case of the symmetric random walk. We could similarly apply Markov s inequality to X 4, or X 6, or even higher powers of X. For the symmetric random walk, the bounds would get better and better (they would look like O(/c k ) for increasing powers of k). The natural culmination of all this is to apply Markov s inequality to e X (or, for a little flexibility, e tx, where t is a constant we will optimize). Lemma. (Chernoff s Bounding Method) Proof. Again, we use Markov s inequality, P(X c) EetX e tc for any t > 0. P(X c) = P(e tx e tc ) EetX e tc. Definition 2. The moment generating function of random variable X is the function ψ(t) = Ee tx. -6
Example. If X is Gaussian with mean 0 and variance, ψ(t) = e tx e x2 /2 dx = e t2 /2. 2π In general, the value Ee tx may not always be defined. However, if Ee t0x is defined for some t 0 > 0, then:. Ee tx is defined for all t < t 0. 2. All moments of X are finite and ψ(t) has derivatives of all orders at t = 0, with EX k = k ψ t k. t=0 3. {ψ(t), t t 0 } uniquely determines the distribution of X..3 Bounding Ee tx We can compute this expectation directly if we know the distribution of X (simply do an integral), but can we get bounds on it given just some coarse statistics of X? Lemma 3. If X [a, b] and X has mean 0, then Ee tx e t2 (b a) 2 /8. Proof. As shown in Figure.3, e tx is a convex function. a 0 Figure.3. e tx is a convex function. b If we write x = λa + ( λ)b (where 0 λ ), convexity tells us that Plugging in λ = (b x)/(b a) then gives e tx λe ta + ( λ)e tb. e tx b x b a etx + x a b a etb Take expectations of both sides, using linearity of expectation and the fact that EX = 0. Ee tx b EX b a eta + EX a b a etb = beta ae tb b a e t2 (b a) 2 /8 where the last step is just calculus. -7
.4 Hoeffding s Inequality Theorem 4 (Hoeffding s inequality). Let X,..., X n be independent and bounded with a i X i b i. Let S n = X + + X n. Then for any ǫ > 0, P(S n ES n ǫ) P(S n ES n ǫ) e 2ǫ2 / P i (bi ai)2 e 2ǫ2 / P i (bi ai)2 Proof. We ll just do the upper bound (lower bound proof is very similar). Define Y i = X i EX i ; then {Y i } are independent, with mean zero and range [a i EX i, b i EX i ]. For any t > 0, P(S n ES n ǫ) = P(Y + + Y n ǫ) = P(e t(y+ +Yn) e tǫ ) Eet(Y+ +Yn) by Chernoff s bounding method. Exploiting the independence of the Y i s, and using our generic bound (Lemma 3) for each Y i, we get P(S n ES n ǫ) by choosing t = 4ǫ/( (b i a i ) 2 ). Next: generalize to Lipschitz functions. EetY Ee ty2 Ee tyn.5 Concentration in metric spaces.5. Basic definitions e tǫ et2 (b a ) 2 /8 e t2 (b 2 a 2) 2 /8 e t2 (b n a n) 2 /8 e tǫ e 2ǫ2 / P (b i a i) 2 Definition 5. A metric space (S, d) consists of a set S and a function d : S S R which satisfies three properties.. d(x, y) 0, with equality iff x = y 2. d(x, y) = d(y, x) 3. d(x, z) d(x, y) + d(y, z) Example. (R n, l p -distance) is a metric space for any p. Definition 6. f : S R is λ-lipschitz if f(x) f(y) λd(x, y) for all x, y S. Now suppose that µ is a probability measure on S, and that we want to bound µ{f Ef + ǫ} = P X µ (f(x) Ef + ǫ). Once again, it would be natural to look at the moment-generating function E µ e tf = e tf(x) µ(dx). But we want a bound that holds for all Lipschitz functions, so we take the supremum of this quantity. e tǫ -8
Definition 7. The Laplace functional of metric measure space (S, d, µ) is L (S,d,µ) (t) = sup E µ e tf where the supremum is taken over all -Lipschitz functions with mean 0..5.2 Metric spaces of bounded diameter We start with an analog of Lemma 3. Lemma 8. If (S, d) has bounded diameter D = sup x,y S d(x, y) <, then for any probability measure µ on S, L (S,d,µ) (t) e t2 D 2 /2. Proof. First some intuition. Pick any function f : S R which is -Lipschitz and has mean zero. Then certainly f(x) D for all x, and so Ee tf e td. The bound we seek is much tighter than this for small values of t (recall that in Hoeffding s proof we chose t = O(ǫ 2 )). To see why it is plausible, let s write out the Taylor expansion of e tf and make an unjustifiable approximation: Ee tf = E [ + tf + t2 f 2 2 + t3 f 3 3! ] + + tef + t2 Ef 2 2 + t2 D 2 2 e t2 D 2 /2. We ve exploited the fact that Ef = 0 to eliminate the first term of the series. However, notice that e t2 D 2 /2 contains all the even powers of t, and so we really need to eliminate all the odd terms in the original Taylor series. When is Ef i = 0 for odd i? Answer: when the distribution of f is symmetric around zero. Since this might not be the case, we need to explicitly symmetrize f. Now let s start the real proof. Take any -Lipschitz mean-0 function f : S R. First note that by Jensen s inequality, E µ e tf e teµf =. Let X, Y be two independent draws from distribution µ. Then: E µ e tf E µ e tf E µ e tf = E X µ e tf(x) E Y µ e tf(y ) = E X,Y µ e t(f(x) f(y )), which is just what we wanted because f(x) f(y ) has a symmetric distribution. Thus its odd powers have zero mean: [ ] E µ e tf t i t i E (f(x) f(y ))i = i! i! E(f(X) f(y t 2i ))i = (2i)! E(f(X) f(y ))2i. i=0 Now we use the fact that f(x) f(y ) D, along with the inequality (2i)! i! 2 i, to get E µ e tf t 2i D 2i ( t 2 D 2 ) i = e t2 D 2 /2, (2i)! 2 i! and we re done. i=0 i=0 i=0 In fact, by being a little more careful and using the same technique as in Lemma 3, we can get a slightly better bound. Lemma 9. Under the same conditions as Lemma 8, L (S,d,µ) (t) e t2 D 2 /8. We will apply this lemma to individual coordinates, as we did in Hoeffing s proof. i=0-9
.5.3 Product spaces Lemma 20. If (S, d) and (T, δ) are metric spaces so is (S T, d + δ). Example. S = T = R and d(x, y) = x y = δ(x, y). In this case, the metric on the product space is l distance. Definition 2. If µ is a measure on S and ν is a measure on T, let µ ν denote the product measure on S T, i.e., which satisfies (µ ν)(a B) = µ(a)ν(b) for all measurable A S, B T. Lemma 22. If (S, d, µ) and (T, δ, ν) are metric measure spaces then L (S T,d+δ,µ ν) (t) L (S,d,µ) (t) L (T,δ,ν) (t). Proof. Pick any -Lipschitz f : S T R which has mean zero. For any y T, define f(y) = E X µ f(x, y). Then f has mean zero, over Y ν. Moreover, it is -Lipschitz on (T, δ) since for any y, y T, f(y) f(y ) = E X µ [f(x, y)] E X µ [f(x, y )] = E X µ [f(x, y) f(x, y )] δ(y, y ) (the last step uses the fact that f is -Lipschitz). Now for any fixed y, the function f(x, y) f(y) is -Lipschitz on (S, d) and has mean zero over X µ. Therefore, [ E µ ν e tf = E X µ E Y ν e tf(y ) t(f(x,y ) f(y e ))] [ = E Y ν e tf(y ) t(f(x,y ) f(y E X µ e ))] ] E Y ν [e tf(y ) L S,d,µ (t) L (S,d,µ) (t) L (T,δ,ν) (t). Theorem 23. Let (S, d, µ ),..., (S n, d n, µ n ) be metric measure spaces of bounded diameters D i <. Let S = (S S 2 S n, d + d 2 + + d n ) be the product space and µ = µ µ 2 µ n the product measure. Then for any -Lipschitz function f : S R, µ {f Ef + ǫ} e 2ǫ2 / P D 2 i. Proof. Combining Lemmas 9 and 22, we see that L (S,d,µ) (t) e (t2 /8)( P i D2 i ). Now it is a simple matter of applying Chernoff s bounding method, using the fact that f Ef is -Lipschitz with mean zero: and the rest is algebra. µ {f Ef ǫ} = µ {e t(f Ef) e tǫ} E µe t(f Ef) e tǫ L (S,d,µ)(t) e tǫ Example. Take S i = R and d i (x, y) = x y. Then S = R n and d(x, y) = x y. This leads to the following corollary. Corollary 24. Let X,..., X n be independent and bounded with a i X i b i. Then for any -Lipschitz function f : R n R with respect to the l metric, P ( f(x,..., X n ) Ef ǫ) 2e 2ǫ2 / P (b i a i) 2. Remark. Hoeffding s inequality is a special case of this corollary where f(x,..., x n ) = x + + x n. -0