Concentration inequalities and the entropy method

Concentration inequalities and the entropy method Gábor Lugosi ICREA and Pompeu Fabra University Barcelona

what is concentration? We are interested in bounding random fluctuations of functions of many independent random variables.

what is concentration? We are interested in bounding random fluctuations of functions of many independent random variables. X 1,..., X n are independent random variables taking values in some set X. Let f : X n R and Z = f(x 1,..., X n ). How large are typical deviations of Z from EZ? We seek upper bounds for for t > 0. P{Z > EZ + t} and P{Z < EZ t}

various approaches - martingales (Yurinskii, 1974; Milman and Schechtman, 1986; Shamir and Spencer, 1987; McDiarmid, 1989,1998); - information theoretic and transportation methods (Alhswede, Gács, and Körner, 1976; Marton 1986, 1996, 1997; Dembo 1997); - Talagrand s induction method, 1996; - logarithmic Sobolev inequalities (Ledoux 1996, Massart 1998, Boucheron, Lugosi, Massart 1999, 2001).

chernoff bounds By Markov s inequality, if λ > 0, P{Z EZ > t} = P {e λ(z EZ) > e λt} Eeλ(Z EZ) Next derive bounds for the moment generating function Ee λ(z EZ) and optimize λ. e λt

chernoff bounds By Markov s inequality, if λ > 0, P{Z EZ > t} = P {e λ(z EZ) > e λt} Eeλ(Z EZ) Next derive bounds for the moment generating function Ee λ(z EZ) and optimize λ. If Z = n X i is a sum of independent random variables, n n Ee λz = E e λx i = Ee λx i by independence. It suffices to find bounds for Ee λx i. e λt

hoeffding s inequality If X 1,..., X n [0, 1], then Ee λ(x i EX i ) e λ2 /8.

hoeffding s inequality If X 1,..., X n [0, 1], then Ee λ(x i EX i ) e λ2 /8. We obtain { 1 P n [ n 1 X i E n ] } n X i > t 2e 2nt2 Wassily Hoeffding (1914 1991)

martingale representation X 1,..., X n are independent random variables taking values in some set X. Let f : X n R and Z = f(x 1,..., X n ). Denote E i [ ] = E[ X 1,..., X i ]. Thus, E 0 Z = EZ and E n Z = Z. Writing we have i = E i Z E i 1 Z, Z EZ = n i This is the Doob martingale representation of Z.

martingale representation: the variance ( n Var (Z) = E Now if j > i, E i j = 0, so ) 2 i = We obtain ( n Var (Z) = E n E i j i = i E i j = 0, ) 2 i = [ ] E 2 i + 2 E i j. j>i n [ ] E 2 i.

martingale representation: the variance ( n Var (Z) = E Now if j > i, E i j = 0, so ) 2 i = We obtain ( n Var (Z) = E n E i j i = i E i j = 0, ) 2 i = [ ] E 2 i + 2 E i j. j>i n [ ] E 2 i From this, using independence, it is easy derive the Efron-Stein inequality..

efron-stein inequality (1981) Let X 1,..., X n be independent random variables taking values in X. Let f : X n R and Z = f(x 1,..., X n ). Then Var(Z) E n (Z E (i) Z) 2 = E n Var (i) (Z). where E (i) Z is expectation with respect to the i-th variable X i only.

efron-stein inequality (1981) If X 1,..., X n are independent copies of X 1,..., X n, and Z i = f(x 1,..., X i 1, X i, X i+1,..., X n ), then Var(Z) 1 2 E [ n (Z Z i )2 ] Z is concentrated if it doesn t depend too much on any of its variables.

efron-stein inequality (1981) If for some arbitrary functions f i Z i = f i (X 1,..., X i 1, X i+1,..., X n ), then [ n ] Var(Z) E (Z Z i ) 2

efron, stein, and steele Bradley Efron Charles Stein Mike Steele

example: kernel density estimation Let X 1,..., X n be i.i.d. real samples drawn according to some density φ. The kernel density estimate is φ n (x) = 1 n ( x Xi K nh h ), where h > 0, and K is a nonnegative kernel K = 1. The L 1 error is Z = f(x 1,..., X n ) = φ(x) φ n (x) dx. It is easy to see that f(x 1,..., x n ) f(x 1,..., x i,..., x n) 1 ( ) ( x xi x x K K i nh h h (Devroye, 1991.) so we get Var(Z) 2 n. ) dx 2 n,

Luc Devroye on Taksim square (2013).

weakly self-bounding functions f : X n [0, ) is weakly (a, b)-self-bounding if there exist f i : X n 1 [0, ) such that for all x X n, n ( f(x) f i (x (i) )) 2 af(x) + b.

weakly self-bounding functions f : X n [0, ) is weakly (a, b)-self-bounding if there exist f i : X n 1 [0, ) such that for all x X n, n ( f(x) f i (x (i) )) 2 af(x) + b. Then Var(f(X)) aef(x) + b.

self-bounding functions If and n 0 f(x) f i (x (i) ) 1 ( ) f(x) f i (x (i) ) f(x), then f is self-bounding and Var(f(X)) Ef(X).

self-bounding functions If and n 0 f(x) f i (x (i) ) 1 ( ) f(x) f i (x (i) ) f(x), then f is self-bounding and Var(f(X)) Ef(X). Rademacher averages, random VC dimension, random VC entropy, longest increasing subsequence in a random permutation, are all examples of self bounding functions.

beyond the variance X 1,..., X n are independent random variables taking values in some set X. Let f : X n R and Z = f(x 1,..., X n ). Recall the Doob martingale representation: Z EZ = n i where i = E i Z E i 1 Z, with E i [ ] = E[ X 1,..., X i ]. To get exponential inequalities, we bound the moment generating function Ee λ(z EZ).

azuma s inequality Suppose that the martingale differences are bounded: i c i. Then ( n 1 ) Ee λ(z EZ) = Ee λ( n i) = EEn e λ i +λ n = Ee λ ( n 1 i Ee λ ( n 1 i ) ) e λ2 ( n c2 i )/2. E n e λ n e λ2 c 2 n /2 (by Hoeffding) This is the Azuma-Hoeffding inequality for sums of bounded martingale differences.

bounded differences inequality If Z = f(x 1,..., X n ) and f is such that f(x 1,..., x n ) f(x 1,..., x i,..., x n) c i then the martingale differences are bounded.

bounded differences inequality If Z = f(x 1,..., X n ) and f is such that f(x 1,..., x n ) f(x 1,..., x i,..., x n) c i then the martingale differences are bounded. Bounded differences inequality: if X 1,..., X n are independent, then P{ Z EZ > t} 2e 2t2 / n c2 i.

Colin McDiarmid bounded differences inequality If Z = f(x 1,..., X n ) and f is such that f(x 1,..., x n ) f(x 1,..., x i,..., x n) c i then the martingale differences are bounded. Bounded differences inequality: if X 1,..., X n are independent, then P{ Z EZ > t} 2e 2t2 / n c2 i. McDiarmid s inequality.

bounded differences inequality Easy to use. Distribution free. Often close to optimal. Does not exploit variance information. Often too rigid. Other methods are necessary.

shannon entropy If X, Y are random variables taking values in a set of size N, H(X) = x p(x) log p(x) H(X Y)= H(X, Y) H(Y) = p(x, y) log p(x y) x,y H(X) log N and H(X Y) H(X) Claude Shannon (1916 2001)

han s inequality If X = (X 1,..., X n ) and X (i) = (X 1,..., X i 1, X i+1,..., X n ), then n ( ) H(X) H(X (i) ) H(X) Te Sun Han

han s inequality If X = (X 1,..., X n ) and X (i) = (X 1,..., X i 1, X i+1,..., X n ), then Proof: n ( ) H(X) H(X (i) ) H(X) H(X)= H(X (i) ) + H(X i X (i) ) H(X (i) ) + H(X i X 1,..., X i 1 ) Te Sun Han Since n H(X i X 1,..., X i 1 ) = H(X), summing the inequality, we get (n 1)H(X) n H(X (i) ).

combinatorial entropies an example Let X 1,..., X n be independent points in the plane (of arbitrary distribution!). Let N be the number of subsets of points that are in convex position. Then Var(log 2 N) E log 2 N.

proof By Efron-Stein, it suffices to prove that f is self-bounding: 0 f n (x) f n 1 (x (i) ) 1 and n ( ) f n (x) f n 1 (x (i) ) f n (x).

proof By Efron-Stein, it suffices to prove that f is self-bounding: and n 0 f n (x) f n 1 (x (i) ) 1 ( ) f n (x) f n 1 (x (i) ) f n (x). The first property is obvious, only need to prove the second.

proof By Efron-Stein, it suffices to prove that f is self-bounding: and n 0 f n (x) f n 1 (x (i) ) 1 ( ) f n (x) f n 1 (x (i) ) f n (x). The first property is obvious, only need to prove the second. This is a deterministic property so fix the points.

proof Among all sets in convex position, draw one uniformly at random. Define Y i as the indicator that x i is in the chosen set.

proof Among all sets in convex position, draw one uniformly at random. Define Y i as the indicator that x i is in the chosen set. H(Y) = H(Y 1,..., Y n ) = log 2 N = f n (x) Also, H(Y (i) ) f n 1 (x (i) ) so by Han s inequality, n ( ) f n (x) f n 1 (x (i) ) n ( ) H(Y) H(Y (i) ) H(Y) = f n (x)

Han s inequality for relative entropies Let P = P 1 P n be a product distribution and Q an arbitrary distribution on X n.

Han s inequality for relative entropies Let P = P 1 P n be a product distribution and Q an arbitrary distribution on X n. By Han s inequality, D(Q P) 1 n 1 n D(Q (i) P (i) )

subadditivity of entropy The entropy of a random variable Z 0 is Ent(Z) = EΦ(Z) Φ(EZ) where Φ(x) = x log x. By Jensen s inequality, Ent(Z) 0.

subadditivity of entropy The entropy of a random variable Z 0 is Ent(Z) = EΦ(Z) Φ(EZ) where Φ(x) = x log x. By Jensen s inequality, Ent(Z) 0. Let X 1,..., X n be independent and let Z = f(x 1,..., X n ), where f 0. Ent(Z) is the relative entropy between the distribution induced by Z on X n and the distribution of X = (X 1,..., X n ).

a logarithmic sobolev inequality on the hypercube Let X = (X 1,..., X n ) be uniformly distributed over { 1, 1} n. If f : { 1, 1} n R and Z = f(x), Ent(Z 2 ) 1 n 2 E (Z Z i )2 The proof uses subadditivity of the entropy and calculus for the case n = 1. Implies Efron-Stein.

Sergei Lvovich Sobolev (1908 1989)

herbst s argument: exponential concentration If f : { 1, 1} n R, the log-sobolev inequality may be used with g(x) = e λf(x)/2 where λ R. If F(λ) = Ee λz is the moment generating function of Z = f(x), [ Ent(g(X) 2 )= λe Ze λz] [ E e λz] log E [Ze λz] = λf (λ) F(λ) log F(λ). Differential inequalities are obtained for F(λ).

herbst s argument As an example, suppose f is such that n (Z Z i )2 + v. Then by the log-sobolev inequality, λf (λ) F(λ) log F(λ) vλ2 4 F(λ) If G(λ) = log F(λ), this becomes ( G(λ) λ ) v 4. This can be integrated: G(λ) λez + λv/4, so F(λ) e λez λ2 v/4 This implies P{Z > EZ + t} e t2 /v

herbst s argument As an example, suppose f is such that n (Z Z i )2 + v. Then by the log-sobolev inequality, λf (λ) F(λ) log F(λ) vλ2 4 F(λ) If G(λ) = log F(λ), this becomes ( G(λ) λ ) v 4. This can be integrated: G(λ) λez + λv/4, so This implies F(λ) e λez λ2 v/4 P{Z > EZ + t} e t2 /v Stronger than the bounded differences inequality!

gaussian log-sobolev inequality Let X = (X 1,..., X n ) be a vector of i.i.d. standard normal If f : R n R and Z = f(x), Ent(Z 2 ) 2E [ f(x) 2] (Gross, 1975).

gaussian log-sobolev inequality Let X = (X 1,..., X n ) be a vector of i.i.d. standard normal If f : R n R and Z = f(x), Ent(Z 2 ) 2E [ f(x) 2] (Gross, 1975). Proof sketch: By the subadditivity of entropy, it suffices to prove it for n = 1. Approximate Z = f(x) by ( 1 m ) f ε i m where the ε i are i.i.d. Rademacher random variables. Use the log-sobolev inequality of the hypercube and the central limit theorem.

gaussian concentration inequality Herbst t argument may now be repeated: Suppose f is Lipschitz: for all x, y R n, Then, for all t > 0, f(x) f(y) L x y. P {f(x) Ef(X) t} e t2 /(2L 2). (Tsirelson, Ibragimov, and Sudakov, 1976).

beyond bernoulli and gaussian: the entropy method For general distributions, logarithmic Sobolev inequalities are not available. Solution: modified logarithmic Sobolev inequalities. Suppose X 1,..., X n are independent. Let Z = f(x 1,..., X n ) and Z i = f i (X (i) ) = f i (X 1,..., X i 1, X i+1,..., X n ). Let φ(x) = e x x 1. Then for all λ R, [ λe Ze λz] [ E e λz] log E [e λz] n [ ] E e λz φ ( λ(z Z i )). Michel Ledoux

the entropy method Define Z i = inf x i f(x 1,..., x i,..., X n) and suppose n (Z Z i ) 2 v. Then for all t > 0, P {Z EZ > t} e t2 /(2v).

the entropy method Define Z i = inf x i f(x 1,..., x i,..., X n) and suppose n (Z Z i ) 2 v. Then for all t > 0, P {Z EZ > t} e t2 /(2v). This implies the bounded differences inequality and much more.

example: the largest eigenvalue of a symmetric matrix Let A = (X i,j ) n n be symmetric, the X i,j independent (i j) with X i,j 1. Let Z = λ 1 = sup u T Au. u: u =1 and suppose v is such that Z = v T Av. A i,j is obtained by replacing X i,j by x i,j. Then ( ) (Z Z i,j ) + v T Av v T A i,j v 1Z>Zi,j ( ) ) = v T (A A i,j )v 1Z>Zi,j 2 (v i v j (X i,j X i,j ) Therefore, 1 i j n 4 v i v j. (Z Z i,j )2 + 1 i j n ( n ) 2 16 v i v j 2 16 vi 2 = 16. +

self-bounding functions Suppose Z satisfies 0 Z Z i 1 and n (Z Z i ) Z. Recall that Var(Z) EZ. We have much more: P{Z > EZ + t} e t2 /(2EZ+2t/3) and P{Z < EZ t} e t2 /(2EZ)

exponential efron-stein inequality Define n [ ] V + = E (Z Z i )2 + and n [ ] V = E (Z Z i )2. By Efron-Stein, Var(Z) EV + and Var(Z) EV.

exponential efron-stein inequality Define and By Efron-Stein, V + = V = n [ ] E (Z Z i )2 + n [ ] E (Z Z i )2. Var(Z) EV + and Var(Z) EV. The following exponential versions hold for all λ, θ > 0 with λθ < 1: log Ee λ(z EZ) λθ 1 λθ log /θ EeλV+. If also Z i Z 1 for every i, then for all λ (0, 1/2), log Ee λ(z EZ) 2λ 1 2λ log EeλV.

weakly self-bounding functions f : X n [0, ) is weakly (a, b)-self-bounding if there exist f i : X n 1 [0, ) such that for all x X n, n ( f(x) f i (x (i) )) 2 af(x) + b.

weakly self-bounding functions f : X n [0, ) is weakly (a, b)-self-bounding if there exist f i : X n 1 [0, ) such that for all x X n, n ( f(x) f i (x (i) )) 2 af(x) + b. Then P {Z EZ + t} exp ( t 2 2 (aez + b + at/2) ). If, in addition, f(x) f i (x (i) ) 1, then for 0 < t EZ, ( ) P {Z EZ t} exp t 2 2 (aez + b + c t). where c = (3a 1)/6.

the isoperimetric view Let X = (X 1,..., X n ) have independent components, taking values in X n. Let A X n. The Hamming distance of X to A is d(x, A) = min d(x, y) = min y A y A n 1 Xi y i. Michel Talagrand

the isoperimetric view Proof: By the bounded differences inequality, P{Ed(X, A) d(x, A) t} e 2t2 /n. Taking t = Ed(X, A), we get Ed(X, A) n 2 log 1 P{A}. By the bounded differences inequality again, { } n P d(x, A) t + 2 log 1 e 2t2 /n P{A}

talagrand s convex distance The weighted Hamming distance is d α (x, A) = inf d α(x, y) = inf y A y A i:x i y i α i where α = (α 1,..., α n ). The same argument as before gives { } α 2 1 P d α (X, A) t + log e 2t2 / α 2, 2 P{A} This implies sup min (P{A}, P {d α (X, A) t}) e t2 /2. α: α =1

convex distance inequality convex distance: d T (x, A) = sup d α (x, A). α [0, ) n : α =1

convex distance inequality convex distance: d T (x, A) = Talagrand s convex distance inequality: sup d α (x, A). α [0, ) n : α =1 P{A}P {d T (X, A) t} e t2 /4.

convex distance inequality convex distance: d T (x, A) = Talagrand s convex distance inequality: sup d α (x, A). α [0, ) n : α =1 P{A}P {d T (X, A) t} e t2 /4. Follows from the fact that d T (X, A) 2 is (4, 0) weakly self bounding (by a saddle point representation of d T ). Talagrand s original proof was different. It can also be recovered from Marton s transportation inequality.

convex lipschitz functions For A [0, 1] n and x [0, 1] n, define If A is convex, then D(x, A) = inf x y. y A D(x, A) d T (x, A).

convex lipschitz functions For A [0, 1] n and x [0, 1] n, define If A is convex, then D(x, A) = inf x y. y A D(x, A) d T (x, A). Proof: D(x, A)= inf x E νy (since A is convex) ν M(A) n ( ) inf Eν 1 2 xj Y j (since x j, Y j [0, 1]) ν M(A) = inf j=1 sup ν M(A) α: α 1 j=1 n α j E ν 1 xj Y j = d T (x, A) (by minimax theorem). (by Cauchy-Schwarz)

convex lipschitz functions Let X = (X 1,..., X n ) have independent components taking values in [0, 1]. Let f : [0, 1] n R be quasi-convex such that f(x) f(y) x y. Then P{f(X) > Mf(X) + t} 2e t2 /4 and P{f(X) < Mf(X) t} 2e t2 /4.

Take s = Mf(X) for the upper tail and s = Mf(X) t for the lower tail. convex lipschitz functions Let X = (X 1,..., X n ) have independent components taking values in [0, 1]. Let f : [0, 1] n R be quasi-convex such that f(x) f(y) x y. Then and P{f(X) > Mf(X) + t} 2e t2 /4 P{f(X) < Mf(X) t} 2e t2 /4. Proof: Let A s = {x : f(x) s} [0, 1] n. A s is convex. Since f is Lipschitz, f(x) s + D(x, A s ) s + d T (x, A s ), By the convex distance inequality, P{f(X) s + t}p{f(x) s} e t2 /4.

Φ entropies For a convex function Φ on [0, ), the Φ-entropy of Z 0 is H Φ (Z) = EΦ (Z) Φ (EZ).

Φ entropies For a convex function Φ on [0, ), the Φ-entropy of Z 0 is H Φ (Z) = EΦ (Z) Φ (EZ). H Φ is subadditive: H Φ (Z) n EH (i) Φ (Z) if (and only if) Φ is twice differentiable on (0, ), and either Φ is affine or strictly positive and 1/Φ is concave. Φ(x) = x 2 corresponds to Efron-Stein. x log x is subadditivity of entropy. We may consider Φ(x) = x p for p (1, 2].

generalized efron-stein Define Z i = f(x 1,..., X i 1, X i, X i+1,..., X n ), n V + = (Z Z i )2 +.

generalized efron-stein Define Z i = f(x 1,..., X i 1, X i, X i+1,..., X n ), n V + = (Z Z i )2 +. For q 2 and q/2 α q 1, E [ (Z EZ) q ] + E [ (Z EZ) α ] ] q/α + + α (q α) E [V + (Z EZ) q 2 +, and similarly for E [ (Z EZ) q ].

moment inequalities We may solve the recursions, for q 2.

moment inequalities We may solve the recursions, for q 2. If V + c for some constant c 0, then for all integers q 2, ( E [ (Z EZ) q +]) 1/q Kqc, where K = 1/ ( e e ) < 0.935.

moment inequalities We may solve the recursions, for q 2. If V + c for some constant c 0, then for all integers q 2, ( E [ (Z EZ) q +]) 1/q Kqc, where K = 1/ ( e e ) < 0.935. More generally, ( [ E (Z EZ) q ]) 1/q ( [ ]) + 1.6 q E V + q/2 1/q.