Lecture 1 Measure concentration

Similar documents
Learning Theory: Lecture Notes

Part IA Probability. Theorems. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

2. Suppose (X, Y ) is a pair of random variables uniformly distributed over the triangle with vertices (0, 0), (2, 0), (2, 1).

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

Chapter 4. Chapter 4 sections

Lecture 2: Review of Basic Probability Theory

STAT 200C: High-dimensional Statistics

X = X X n, + X 2

Foundations of Machine Learning

6.1 Moment Generating and Characteristic Functions

Lecture 4 Lebesgue spaces and inequalities

Probability inequalities 11

Proving the central limit theorem

Part IA Probability. Definitions. Based on lectures by R. Weber Notes taken by Dexter Chua. Lent 2015

Probability Background

Concentration inequalities and the entropy method

Chapter 4. Continuous Random Variables 4.1 PDF

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

Probability and Measure

Lecture 4. P r[x > ce[x]] 1/c. = ap r[x = a] + a>ce[x] P r[x = a]

18.440: Lecture 28 Lectures Review

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

1 Dimension Reduction in Euclidean Space

Outline. Martingales. Piotr Wojciechowski 1. 1 Lane Department of Computer Science and Electrical Engineering West Virginia University.

Sometimes can find power series expansion of M X and read off the moments of X from the coefficients of t k /k!.

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 20

Lecture 1: August 28

Problem Set 0 Solutions

1 Review of The Learning Setting

Concentration Inequalities

Lecture 4: Inequalities and Asymptotic Estimates

18.175: Lecture 8 Weak laws and moment-generating/characteristic functions

Stochastic Processes and Monte-Carlo Methods. University of Massachusetts: Spring 2018 version. Luc Rey-Bellet

Joint Probability Distributions and Random Samples (Devore Chapter Five)

MAS223 Statistical Inference and Modelling Exercises

Introduction to Algebraic and Geometric Topology Week 3

Conditional distributions (discrete case)

18.440: Lecture 28 Lectures Review

3. Review of Probability and Statistics

Expectation. DS GA 1002 Statistical and Mathematical Models. Carlos Fernandez-Granda

Limiting Distributions

ψ(t) := log E[e tx ] (1) P X j x e Nψ (x) j=1

CSE 525 Randomized Algorithms & Probabilistic Analysis Spring Lecture 3: April 9

Fourier and Stats / Astro Stats and Measurement : Stats Notes

Lecture 22: Variance and Covariance

Selected Exercises on Expectations and Some Probability Inequalities

MATHEMATICS 154, SPRING 2009 PROBABILITY THEORY Outline #11 (Tail-Sum Theorem, Conditional distribution and expectation)

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

EE514A Information Theory I Fall 2013

Entropy and Ergodic Theory Lecture 15: A first look at concentration

Problem 1: (Chernoff Bounds via Negative Dependence - from MU Ex 5.15)

Lecture 3: Expected Value. These integrals are taken over all of Ω. If we wish to integrate over a measurable subset A Ω, we will write

Chapter 5 continued. Chapter 5 sections

SUMMARY OF PROBABILITY CONCEPTS SO FAR (SUPPLEMENT FOR MA416)

Part II Probability and Measure

Lecture 5: Moment generating functions

Moments. Raw moment: February 25, 2014 Normalized / Standardized moment:

Lecture Notes on Metric Spaces

Expectation. DS GA 1002 Probability and Statistics for Data Science. Carlos Fernandez-Granda

The Canonical Gaussian Measure on R

I forgot to mention last time: in the Ito formula for two standard processes, putting

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Stochastic Models of Manufacturing Systems

17. Convergence of Random Variables

Lecture 11. Probability Theory: an Overveiw

Appendix B: Inequalities Involving Random Variables and Their Expectations

Stochastic Processes and Monte-Carlo Methods. University of Massachusetts: Spring Luc Rey-Bellet

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

Math 127: Course Summary

11.1 Set Cover ILP formulation of set cover Deterministic rounding

X n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2)

Lecture 4: Sampling, Tail Inequalities

Convergence in Distribution

Econ 508B: Lecture 5

Lecture 6 Basic Probability

The Moment Method; Convex Duality; and Large/Medium/Small Deviations

High Dimensional Probability

Discrete Mathematics and Probability Theory Spring 2014 Anant Sahai Note 18

STAT 414: Introduction to Probability Theory

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

Solution for Problem 7.1. We argue by contradiction. If the limit were not infinite, then since τ M (ω) is nondecreasing we would have

Expectation is a positive linear operator

Tail and Concentration Inequalities

Chapter 3, 4 Random Variables ENCS Probability and Stochastic Processes. Concordia University

Exponential Tail Bounds

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

Generalization theory

The Johnson-Lindenstrauss Lemma

Lecture 35: December The fundamental statistical distances

Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Chapter 4 continued. Chapter 4 sections

STAT 418: Probability and Stochastic Processes

Example continued. Math 425 Intro to Probability Lecture 37. Example continued. Example

Common-Knowledge / Cheat Sheet

STAT 200C: High-dimensional Statistics

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

Reminders. Thought questions should be submitted on eclass. Please list the section related to the thought question

Lecture 2 Sep 5, 2017

4 Expectation & the Lebesgue Theorems

We introduce methods that are useful in:

Transcription:

CSE 29: Learning Theory Fall 2006 Lecture Measure concentration Lecturer: Sanjoy Dasgupta Scribe: Nakul Verma, Aaron Arvey, and Paul Ruvolo. Concentration of measure: examples We start with some examples of concentration of measure. This phenomenon is very useful in analyzing machine learning algorithms and can be used to bound things like error probabilities. The first and the most standard of concentration results is for averages. It states that the average of bounded independent random variables is tightly concentrated around its expectation... Example: coin tosses Suppose a coin of unknown bias p is tossed n times: X,..., X n {0, }. Then the average of the X i is tightly concentrated around p. Specifically ( ) X +... + X n P p n ǫ 2e 2ǫ2 n Figure. shows the quick drop-off of the probability that the sample mean deviates from its expectation. So for a large enough n, we can estimate p quite accurately. p n p p + n Figure.. Shows the exponential decay in the probability of the sample mean deviating from its expectation (p) in the coin tossing experiment...2 Example: random points in a d-dimensional box Pick a point X [, +] d uniformly at random. Then it can be shown that X is tightly concentrated around d/3. To see this we note that X = (X,..., X d ), then E X 2 = E [ X 2 +... + Xd 2 ] = d EXi 2 = i= d i= + 2 x2 dx = d 3 where the second equality is due to the linearity of expectation. Now since each of the X i s are independent (and bounded), we can show the concentration ( P X 2 d ) 3 ǫd 2e 2ǫ2d. -

This provides us with the counter-intuitive result that the volume of the high-dimensional cube tends to lie in its corners where the points have length approximately d/3. Note that above examples are special cases of Hoeffding s Inequality: Lemma (Hoeffding s inequality). Suppose X,..., X n are independent and bounded variables, such that a i X i b i. Then, [ ( ) X +... + X n X +... + X n P E ǫ] 2e 2ǫ2 n 2 / P i (bi ai)2. (.) n n We will soon prove a much more general version of this, which is introduced next...3 Concentration of Lipschitz functions Observing the Hoeffding bound, one might wonder whether such concentration applies only to averages of random variables. After all, what is so special about averages? It turns out that the relevant feature of the average that yields tight concentration is that it is smooth. In fact any smooth function of bounded independent random variables is tightly concentrated around its expectation. The notion of smoothness we will use is Lipschitz. Definition 2. f : R n R is λ-lipschitz w.r.t. the l p -metric if, for all x, y, f(x) f(y) λ x y p. Example. For x = (x,..., x n ), define the average: a(x) = n (x +... + x n ). Then a( ) is (/n)-lipschitz with respect to l metric, since for any x,x, a(x) a(x ) = n [(x x ) +... + (x n x n )] n ( x x +... + x n x n ) = n x x. It turns out that Hoeffding s bound holds for all Lipschitz (with respect to l ) functions. Lemma 3 (Concentration of Lipschitz functions wrt l metric). Suppose X,..., X n are independent and bounded with a i x i b i Then, for any f : R n R which is λ-lipschitz w.r.t. l -metric Proof. See Section.5. P [f Ef + ǫ] e 2ǫ2 /λ 2 P i (bi ai)2 Remark. Since f is also λ-lipschitz, we can bound both above and below as P [ f Ef ǫ] 2e 2ǫ2 /λ 2 P i (bi ai)2. (.2) We now look at bounds for functions that are Lipschitz with respect to other metrics...4 Concentration of Lipschitz functions w.r.t. l 2 metric Let S d denote the surface of the unit sphere in R d, and let µ be the uniform distribution over S d. The following is known (we will prove it later in the course): -2

f = w f = + f is within ǫ of its median value exp Ω(ǫ 2 d) of the sphere s mass Figure.2. The function f(x) = w x is at one pole of the sphere, + at the other pole, and increases steadily from to + as one moves from one pole to the other. Since f is -Lipschitz on S d, most of the volume lies in a thin slice near the equator of the sphere (perpendicular to w). Lemma 4. Let f : S d R be λ-lipschitz w.r.t. l 2 -metric. Then, where med(f) is a median value of f. µ [f med(f) + ǫ] 4e ǫ2 d/2λ 2 (.3) One immediate consequence of (.3) is most of the volume of the sphere lies in a thin slice around the equator (for all equators!). To see this, fix any unit vector w S d. Then for X µ (this notation means X drawn from distribution µ ), E(w X) = 0 and also med(w X) = 0. Moreover, the function f(x) = w x is -Lipschitz wrt the l 2 norm: for all x, y S d, f(x) f(y) = w x w y = w (x y) w 2 x y 2 = x y 2 where the second-to-last inequality uses Cauchy-Schwarz. Thus by (.3), f is tightly concentrated around its median, i.e., µ [X : w X ǫ] 4e ǫ2 d/2. See Figure.2. Moreover, since there is nothing special about this particular w; the above bound is true for any equator!..5 Types of concentration Types of concentration we ll encounter in this course: Concentration of a product measure X = (X,...,X n ) where X i are independent and bounded, with respect to l and Hamming metric. Concentration of a uniform measure over S d, with respect to l 2 metric. Concentration of multivariate Gaussian measure, with respect to l 2 metric. -3

.2 Probability review.2. Warm-up problem Question. Let σ be a random permutation of {...n}. Let S be the number of fixed points of this permutation. What is the expected value and variance of S? Answer. Use n indicator random variables X i = (σ(i) = i), so that S X i. By linearity of expectation, we can solve the first problem as follows: ES = E(X + + X n ) = n EX i = i= n i= P(X i = ) n =. For the second problem, we use var(s) = E(S 2 ) (ES) 2 = E(S 2 ), and E(S 2 ) = E(X + + X n ) 2 = E Xi 2 + X i X j i i j EX 2 i + i j E(X i X j ) (linearity of expectation) n + i j n(n ) = 2 Thus var(s) =..2.2 Some basics Property 5 (Linearity of expectation). E(X +Y ) = EX +EY (holds even if X and Y are not independent). Property 6. var(x) = E(X EX) 2 = EX 2 (EX) 2. Property 7 (Jensen s inequality). If f is a convex function, then Ef(X) f(ex). Here s a picture to help you remember this enormously useful property of convex functions: Ef(X) f(ex) a EX b -4

Lemma 8. If X,...,X n are independent, then var(x + + X n ) = var(x ) + + var(x n ). Proof. Let X,..., X n be n independent random variables. Set Y i = X i EX i. Thus Y,..., Y n are independent with mean zero, and var(x + + X n ) = E[(X EX ) + + (X n EX n )] 2 = E(Y + + Y n ) 2 = E Yi 2 + Y i Y j i i j EY 2 i + i j EY i EY j EY 2 i E(X i EX i ) 2 var(x i ) As an example of an incorrect application, had we mistakingly assumed that the X i in the warmup problem were independent we would have found that the variance was n n instead of. Not too far off, since those X i are approximately independent (for large n). Lemma 9 (Markov s inequality). P( X a) E X a. Proof. Observe: X a ( X a); take expectations of both sides, using E[( X a)] = P( X a). Example. A simple application of Markov s inequality to the random variable S, which is always positive, is P(S k) /k. Lemma 0 (Chebyshev s inequality). Proof. Apply Markov s inequality to (X EX) 2 : P( X EX a) var(x) a 2. P( X EX a) = P((X EX) 2 a 2 ) E(X EX)2 a 2 = var(x) a 2 Example. Again, S is a strictly positive random variable, thus P(S k) /(k ) 2. Note that this is generally a better bound than that given by Markov s inequality. -5

.2.3 Example: symmetric random walk A symmetric random walk is a stochastic process on the line. One starts at the origin and at each time step moves either one unit to the left or one unit to the right, with equal probability. The move at time t is thus a random variable X t, where { + (right) with probability /2 X t = (left) with probability /2 Let S n = n i= X i be the position after n steps of the random walk. What are the expected value and variance of S n? The expected value of X i is 0 since we are equally likely to obtain + and, so ES n = E n X i = EX i = 0. i= Similarly, since the X i are independent, variance becomes linear as well. The variance of X i is EXi 2 =, therefore n var(s n ) = var( X i ) = var(x i ) = n. i= The standard deviation of S n is thus n; so we would expect that S n is ±O( n). We can make this more precise by using Markov s and Chebyshev s inequalities. (Markov) P( S n c n) E S n c n ES 2 n c n (Chebyshev) P( S n c n) var(s n) (c n) 2 = c 2 = var(sn ) c n = c.2.4 Moment-generating functions The Chebyshev inequality is just the Markov inequality applied to X 2 ; this often yields a better bound, as in the case of the symmetric random walk. We could similarly apply Markov s inequality to X 4, or X 6, or even higher powers of X. For the symmetric random walk, the bounds would get better and better (they would look like O(/c k ) for increasing powers of k). The natural culmination of all this is to apply Markov s inequality to e X (or, for a little flexibility, e tx, where t is a constant we will optimize). Lemma. (Chernoff s Bounding Method) Proof. Again, we use Markov s inequality, P(X c) EetX e tc for any t > 0. P(X c) = P(e tx e tc ) EetX e tc. Definition 2. The moment generating function of random variable X is the function ψ(t) = Ee tx. -6

Example. If X is Gaussian with mean 0 and variance, ψ(t) = e tx e x2 /2 dx = e t2 /2. 2π In general, the value Ee tx may not always be defined. However, if Ee t0x is defined for some t 0 > 0, then:. Ee tx is defined for all t < t 0. 2. All moments of X are finite and ψ(t) has derivatives of all orders at t = 0, with EX k = k ψ t k. t=0 3. {ψ(t), t t 0 } uniquely determines the distribution of X..3 Bounding Ee tx We can compute this expectation directly if we know the distribution of X (simply do an integral), but can we get bounds on it given just some coarse statistics of X? Lemma 3. If X [a, b] and X has mean 0, then Ee tx e t2 (b a) 2 /8. Proof. As shown in Figure.3, e tx is a convex function. a 0 Figure.3. e tx is a convex function. b If we write x = λa + ( λ)b (where 0 λ ), convexity tells us that Plugging in λ = (b x)/(b a) then gives e tx λe ta + ( λ)e tb. e tx b x b a etx + x a b a etb Take expectations of both sides, using linearity of expectation and the fact that EX = 0. Ee tx b EX b a eta + EX a b a etb = beta ae tb b a e t2 (b a) 2 /8 where the last step is just calculus. -7

.4 Hoeffding s Inequality Theorem 4 (Hoeffding s inequality). Let X,..., X n be independent and bounded with a i X i b i. Let S n = X + + X n. Then for any ǫ > 0, P(S n ES n ǫ) P(S n ES n ǫ) e 2ǫ2 / P i (bi ai)2 e 2ǫ2 / P i (bi ai)2 Proof. We ll just do the upper bound (lower bound proof is very similar). Define Y i = X i EX i ; then {Y i } are independent, with mean zero and range [a i EX i, b i EX i ]. For any t > 0, P(S n ES n ǫ) = P(Y + + Y n ǫ) = P(e t(y+ +Yn) e tǫ ) Eet(Y+ +Yn) by Chernoff s bounding method. Exploiting the independence of the Y i s, and using our generic bound (Lemma 3) for each Y i, we get P(S n ES n ǫ) by choosing t = 4ǫ/( (b i a i ) 2 ). Next: generalize to Lipschitz functions. EetY Ee ty2 Ee tyn.5 Concentration in metric spaces.5. Basic definitions e tǫ et2 (b a ) 2 /8 e t2 (b 2 a 2) 2 /8 e t2 (b n a n) 2 /8 e tǫ e 2ǫ2 / P (b i a i) 2 Definition 5. A metric space (S, d) consists of a set S and a function d : S S R which satisfies three properties.. d(x, y) 0, with equality iff x = y 2. d(x, y) = d(y, x) 3. d(x, z) d(x, y) + d(y, z) Example. (R n, l p -distance) is a metric space for any p. Definition 6. f : S R is λ-lipschitz if f(x) f(y) λd(x, y) for all x, y S. Now suppose that µ is a probability measure on S, and that we want to bound µ{f Ef + ǫ} = P X µ (f(x) Ef + ǫ). Once again, it would be natural to look at the moment-generating function E µ e tf = e tf(x) µ(dx). But we want a bound that holds for all Lipschitz functions, so we take the supremum of this quantity. e tǫ -8

Definition 7. The Laplace functional of metric measure space (S, d, µ) is L (S,d,µ) (t) = sup E µ e tf where the supremum is taken over all -Lipschitz functions with mean 0..5.2 Metric spaces of bounded diameter We start with an analog of Lemma 3. Lemma 8. If (S, d) has bounded diameter D = sup x,y S d(x, y) <, then for any probability measure µ on S, L (S,d,µ) (t) e t2 D 2 /2. Proof. First some intuition. Pick any function f : S R which is -Lipschitz and has mean zero. Then certainly f(x) D for all x, and so Ee tf e td. The bound we seek is much tighter than this for small values of t (recall that in Hoeffding s proof we chose t = O(ǫ 2 )). To see why it is plausible, let s write out the Taylor expansion of e tf and make an unjustifiable approximation: Ee tf = E [ + tf + t2 f 2 2 + t3 f 3 3! ] + + tef + t2 Ef 2 2 + t2 D 2 2 e t2 D 2 /2. We ve exploited the fact that Ef = 0 to eliminate the first term of the series. However, notice that e t2 D 2 /2 contains all the even powers of t, and so we really need to eliminate all the odd terms in the original Taylor series. When is Ef i = 0 for odd i? Answer: when the distribution of f is symmetric around zero. Since this might not be the case, we need to explicitly symmetrize f. Now let s start the real proof. Take any -Lipschitz mean-0 function f : S R. First note that by Jensen s inequality, E µ e tf e teµf =. Let X, Y be two independent draws from distribution µ. Then: E µ e tf E µ e tf E µ e tf = E X µ e tf(x) E Y µ e tf(y ) = E X,Y µ e t(f(x) f(y )), which is just what we wanted because f(x) f(y ) has a symmetric distribution. Thus its odd powers have zero mean: [ ] E µ e tf t i t i E (f(x) f(y ))i = i! i! E(f(X) f(y t 2i ))i = (2i)! E(f(X) f(y ))2i. i=0 Now we use the fact that f(x) f(y ) D, along with the inequality (2i)! i! 2 i, to get E µ e tf t 2i D 2i ( t 2 D 2 ) i = e t2 D 2 /2, (2i)! 2 i! and we re done. i=0 i=0 i=0 In fact, by being a little more careful and using the same technique as in Lemma 3, we can get a slightly better bound. Lemma 9. Under the same conditions as Lemma 8, L (S,d,µ) (t) e t2 D 2 /8. We will apply this lemma to individual coordinates, as we did in Hoeffing s proof. i=0-9

.5.3 Product spaces Lemma 20. If (S, d) and (T, δ) are metric spaces so is (S T, d + δ). Example. S = T = R and d(x, y) = x y = δ(x, y). In this case, the metric on the product space is l distance. Definition 2. If µ is a measure on S and ν is a measure on T, let µ ν denote the product measure on S T, i.e., which satisfies (µ ν)(a B) = µ(a)ν(b) for all measurable A S, B T. Lemma 22. If (S, d, µ) and (T, δ, ν) are metric measure spaces then L (S T,d+δ,µ ν) (t) L (S,d,µ) (t) L (T,δ,ν) (t). Proof. Pick any -Lipschitz f : S T R which has mean zero. For any y T, define f(y) = E X µ f(x, y). Then f has mean zero, over Y ν. Moreover, it is -Lipschitz on (T, δ) since for any y, y T, f(y) f(y ) = E X µ [f(x, y)] E X µ [f(x, y )] = E X µ [f(x, y) f(x, y )] δ(y, y ) (the last step uses the fact that f is -Lipschitz). Now for any fixed y, the function f(x, y) f(y) is -Lipschitz on (S, d) and has mean zero over X µ. Therefore, [ E µ ν e tf = E X µ E Y ν e tf(y ) t(f(x,y ) f(y e ))] [ = E Y ν e tf(y ) t(f(x,y ) f(y E X µ e ))] ] E Y ν [e tf(y ) L S,d,µ (t) L (S,d,µ) (t) L (T,δ,ν) (t). Theorem 23. Let (S, d, µ ),..., (S n, d n, µ n ) be metric measure spaces of bounded diameters D i <. Let S = (S S 2 S n, d + d 2 + + d n ) be the product space and µ = µ µ 2 µ n the product measure. Then for any -Lipschitz function f : S R, µ {f Ef + ǫ} e 2ǫ2 / P D 2 i. Proof. Combining Lemmas 9 and 22, we see that L (S,d,µ) (t) e (t2 /8)( P i D2 i ). Now it is a simple matter of applying Chernoff s bounding method, using the fact that f Ef is -Lipschitz with mean zero: and the rest is algebra. µ {f Ef ǫ} = µ {e t(f Ef) e tǫ} E µe t(f Ef) e tǫ L (S,d,µ)(t) e tǫ Example. Take S i = R and d i (x, y) = x y. Then S = R n and d(x, y) = x y. This leads to the following corollary. Corollary 24. Let X,..., X n be independent and bounded with a i X i b i. Then for any -Lipschitz function f : R n R with respect to the l metric, P ( f(x,..., X n ) Ef ǫ) 2e 2ǫ2 / P (b i a i) 2. Remark. Hoeffding s inequality is a special case of this corollary where f(x,..., x n ) = x + + x n. -0