Concentration of Measure with Applications in Information Theory, Communications, and Coding

Size: px

Start display at page:

Download "Concentration of Measure with Applications in Information Theory, Communications, and Coding"

Chrystal Walters
5 years ago
Views:

1 Concentration of Measure with Applications in Information Theory, Communications, and Coding Maxim Raginsky (UIUC) and Igal Sason (Technion) A Tutorial, Presented at 2015 IEEE International Symposium on Information Theory (ISIT) Hong Kong, June 2015 Part 2 of 2

2 The Plan 1. Prelude: The Chernoff bound 2. The entropy method 3. Logarithmic Sobolev inequalities 4. Transportation cost inequalities 5. Some applications

3 Given: X 1, X 2,..., X n : independent random variables Z = f(x n ), for some real-valued f Problem: derive sharp bounds on the deviation probabilities Benchmark: X 1,..., X n i.i.d. N(0, σ 2 ) P[Z EZ t], for t 0 Z = f(x n ) = 1 n (X X n ) sample mean Goal: P[Z t] exp ) ( nt2 2σ 2 extend to other distributions besides Gaussians extend to nonlinear f

4 Prelude: The Chernoff Bound

5 Review: The Chernoff Bound Define: logarithmic moment-generating function ψ(λ) log E[e λ(z EZ) ] its Legendre dual ψ (t) sup {λt ψ(λ)} λ 0 P[Z EZ t] = P[e λ(z EZ) e λt ] e λt E[e λ(z EZ) ] = e {λt ψ(λ)} Markov s inequality Optimize over λ: Sanity check: P[Z EZ t] e ψ (t) Z N(0, σ 2 ) ψ(λ) = λ2 σ 2 2 P[Z t] e t2 /2σ 2 ψ (t) = t2 2σ 2

6 Chernoff Bound: Subgaussian Random Variables Definition. A real-valued r.v. Z is σ 2 -subgaussian if ψ(λ) λ2 σ 2 Immediate: if Z is σ 2 -subgaussian, then ψ (t) = sup {λt ψ(λ)} λ 0 sup λ 0 = t 2 /2σ 2 giving the Gaussian tail bound 2 { λt λ 2 σ 2 /2 } P[Z EZ t] e t2 /2σ 2 How do we establish subgaussianity?

7 Review: Hoeffding s Lemma Any almost surely bounded r.v. is subgaussian: If there exist < a b < such that Z [a, b] a.s., then ψ(λ) λ2 (b a) 2. 8 Corollary. If Z [a, b] a.s., then ) P[Z EZ t] exp ( 2t2 (b a) 2 Wassily Hoeffding Proof (of Corollary). By Hoeffding s lemma, Z is subgaussian with σ 2 = (b a) 2 /4.

8 Hoeffding s Lemma: An Alternative Proof 1. Assume without loss of generality that EZ = Compute the first two derivatives of ψ: ψ (λ) = E[ZeλZ ] E[e λz ] ψ (λ) = E[Z2 e λz ] E[e λz ] ( E[Ze λz ) 2 ] E[e λz ] 3. Tilted distribution: P = L(Z) Q dq dp (Z) = eλz E P [e λz ]. Then ψ (λ) = E Q [Z], ψ (λ) = Var Q [Z]. 4. Z [a, b] P -a.s. = Z [a, b] Q-a.s. = Var Q [Z] (b a) Calculus: ψ(λ) = λ τ 0 0 ψ (ρ) dρ dτ λ2 (b a) 2. 8

9 The Entropy Method

10 Exponential Tilting and (Relative) Entropy Back to our setting: Z = f(x), X an arbitrary r.v. Want to prove subgaussianity of Z, so need to analyze ψ(λ) = log E[e λ(z EZ) ] = log E[e λ(f(x) Ef(X)) ]. Let P = L(X), introduce tilted distribution P λf : dp λf dp (X) = eλf(x) E[e λf(x) ]. We will relate ψ(λ) to the relative entropy D(P λf P ).

11 Exponential Tilting and (Relative) Entropy Tilting of P = L(X): dp λf dp (X) = Relative entropy: D(P λf P ) = = With a bit of foresight, eλf(x) E[e λf(x) ] eλ(f(x) Ef(X)) e ψ(λ). dp λf log dp λf dp dp λf (λ(f E P f) ψ(λ)) = λe P [(f E P f)e λ(f E P f) ] ψ(λ) e ψ(λ) = λψ (λ) ψ(λ) λψ (λ) ψ(λ) = λ 2 ( ψ (λ) λ ψ(λ) ) λ 2 = λ 2 d dλ ( ) ψ(λ) λ

12 The Herbst Argument Tilting of P = L(X): Relative entropy: dp λf dp (X) = D(P λf P ) = λ 2 d dλ eλf(x) E[e λf(x) ] ( ) ψ(λ) Since lim ψ(λ)/λ = 0 (by l Hôpital), we have λ 0 ψ(λ) = λ λ 0 λ D(P ρf P ) ρ 2 dρ. Suppose now that P = L(X) and f are such that for some σ 2. Then D(P ρf P ) ρ2 σ 2 ψ(λ) λ λ 0 2, ρ 0 ρ 2 σ 2 2ρ 2 dρ = λ2 σ 2 2.

13 The Herbst Argument Lemma (Herbst, 1975). Suppose that Z = f(x) is such that D(P λf P ) λ2 σ 2, λ 0. 2 Then Z is σ 2 -subgaussian, and so P[f(X) Ef(X) t] e t2 /2σ 2, t 0. Ira Herbst For the sake of completeness, we... prove a theorem which states roughly that the potential of an intrinsically hypercontractive Schrödinger operator must increase at least quadratically at infinity.... [I]ts complete proof requires information in an unpublished letter of 1975 from I. Herbst to L. Gross. [C]ertain steps in the argument... can be written down abstractly. from a 1984 paper by B. Simon and E.B. Davies

14 The Herbst Converse Lemma (R. van Handel, 2014). Suppose Z = f(x) is σ 2 /4-subgaussian. Then D(P λf P ) λ2 σ 2, λ 0. 2 Proof. Let Ẽ[ ] E P λf [ ]. ] D(P λf λf P ) = [log Ẽ dp (Jensen) [ ] dp λf log dp Ẽ dp [ ] e λ(f(x) Ef(X)) = log Ẽ E[e λ(f(x) Ef(X)) ] [ = log E e 2λ(f Ef)] { log E[e λ(f Ef) ] }{{} 1 (2λ)2 σ 2 /4 2 = λ2 σ 2 2 } 2

15 The Herbst Argument: What Is It Good For? Subgaussianity of Z = f(x) is equivalent to D(P λf P ) = O(λ 2 ), but what does that give us? Recall: we are interested in high-dimensional settings Z = f(x n ) = f(x 1,..., X n ) X 1,..., X n independent r.v. s Thus, P is a product measure: P = P 1 P 2... P n, P i L(X i ) The relative entropy tensorizes!! we can break the (hard) n-dimensional problem into n (hopefully) easier 1-dimensional problems.

16 Tensorization X 1,..., X n independent r.v. s P = L(X n ) = P 1... P n, P i = L(X i ) Recall the Efron Stein Steele inequality: Var P [f(x n )] n E [ Var P [f(x n ) X i ] ] i=1 tensorization of variance Tensorization of relative entropy: for an arbitrary probability measure Q on X n, D(Q P ) n D(Q Xi X i P X i X i Q X i) }{{} conditional divergence i=1 independence of the X i s is key

17 Tensorization: A Quick Proof D(Q P ) n = D(Q Xi X i 1 P X i X i 1 Q X i 1) i=1 n [ = E Q log dq ] X i X i 1 dp i=1 Xi X i 1 [ n = E Q log dq ] X i X i 1 + dq i=1 Xi X i n [ = E Q log dq ] X i X i + dq i=1 Xi X i 1 n = i=1 n i=1 (chain rule) [ E Q log dq ] X i X i dp Xi X i 1 [ n E Q log dq ] X i X i dp Xi X i i=1 n D(Q Xi X i Q X i 1 Q i X X i) + D(Q Xi X i P X i Q Xi) n D(Q Xi X i P X i Q Xi) i=1 i=1 (independence) D (Q P ) erasure divergence (Verdú Weissman, 2008)

18 Tensorization and Tilting Recall: we are interested in D(Q P ), where P = P 1... P n, dq = eλf E P [e λf ] dp Then dq Xi ( x i ) = P i (dx i ) eλf(x1,...,xi 1,xi,xi+1,...,xn) dp Xi X E P [e λf(xn ) ] = E P [e λf(xn) X i = x i ] E P [e λf(xn ) ] therefore dq Xi X i = x i dp Xi (x i ) = eλf(x1,...,xi 1,xi,xi+1,...,xn) E P [e λf(x1,...,xi 1,Xi,xi+1,...,xn) ] remember, x i is fixed.

19 Tensorization and Tilting dp λf X i X i = x i dp Xi (x i ) = eλf(x1,...,xi 1,xi,xi+1,...,xn) E Pi [e λf(x 1,...,x i 1,X i,x i+1,...,x n) ] For a fixed x i, define the function f i ( x i ) : X R f i (x i x i ) = f(x 1,..., x i 1, x i, x i+1,..., x n ) = f i (x i ) (just shorthand, always remember x i ) Now observe that Q Xi X i = x i is the tilting of P i P Xi : dq Xi X i = x i = eλf i E Pi [e λf i ] dp i

20 Tensorization and Tilting Lemma. If X 1,..., X n are independent r.v. s with joint law P = P 1... P n, where P i = L(X i ), then for any f : X n R D(P λf P ) n i=1 [ ] Ẽ D(P λf i i P i ), where Ẽ[ ] E P λf [ ] and f i( ) = f i ( X i ) for each i. Proof. D(P λf P ) = = n D(P λf X i X P i Xi P λf ) X i λf n dpx Ẽ log i X i i=1 i=1 [ n Ẽ log i=1 dp Xi dp λfi i dp i ] = n i=1 [ ] Ẽ D(P λfi i P i )

21 The Entropy Method: Divide and Conquer Want to derive a subgaussian tail bound P[f(X n ) Ef(X n ) t] e t2 /2σ 2, t 0 where X 1,..., X n are independent r.v. s. Suppose we can prove there exist constants c 1,..., c n, such that Then ( ) D(P λf P ) (tensor.) D(P λfi i P i ) λ2 c 2 i, i {1,..., n}. 2 λ2 n i=1 c2 i 2 (Herbst) = σ 2 = Now we just need to prove ( )!! n i=1 c 2 i

22 Logarithmic Sobolev Inequalities

23 Log-Sobolev in a Nutshell Goal: control the relative entropy D(P λf P ). A log-sobolev inequality ties together: (i) the underlying probability measure P (ii) a function class A (containing f of interest) (iii) an energy functional E : A R such that and looks like this: E(αf) = αe(f), f A, α 0 D(P f P ) c 2 E2 (f), f A. In that case, if E(f) L, then D(P λf P ) c 2 E2 (λf) = c 2 λ2 E 2 (f) λ2 cl 2 The name comes from an analogy with Sobolev inequalities in functional analysis. 2

24 The Bernoulli Log-Sobolev Inequality Theorem (Gross, 1975). Let X 1,..., X n be i.i.d. Bern(1/2) random variables. Then, for any function f : {0, 1} n R, D(P f P ) 1 8 E[ Df(X n ) 2 e f(xn) ] E[e f(xn ), ] where Df(x n n ) f(x n ) f( x n e i ) }{{} 2, i=1 flip ith bit and E[ ] is w.r.t. P = (Bern(1/2)) n. Leonard Gross Remarks: This is not the original form of the inequality from Gross 1975 paper, but they are equivalent. Note that D(λf) = λd(f) for all λ 0.

25 Bernoulli LSI: Proof Sketch Consider first n = 1 and f : {0, 1} R with a = f(0) and b = f(1). In that case, the log-sobolev inequality reads e a e a + e b log 2e a e a + e b + eb e a + e b log 2e b e a + e b 1 8 (b a)2. Proof: Elementary (but tedious) exercise in calculus. Tensorization: n D(P f P ) Ẽ[D(P fi i P i )] where Ẽ[h(Xn )] = E[h(Xn )e f(x n) ] E[e f(xn ) ] i=1 1 n [ Ẽ f(x i 1, 0, X n 8 i+1) f(x i 1, 1, Xi+1) n 2] i=1 = 1 [ Df(X n ) 2] = 1 E[ Df(X n ) 2 e f(xn) ] 8Ẽ 8 E[e f(xn ) ]

26 The Gaussian Log-Sobolev Inequality Theorem (Gross, 1975). Let X 1,..., X n be i.i.d. N(0, 1) random variables. Then, for any smooth function f : R n R, [ ] D(P f P ) 1 E f(x n ) 2 2 ef(xn ) 2 E[e f(xn), ] where all expectations are w.r.t. P = L(X n ) = N(0, I n ). Leonard Gross Remarks: This is not the original form of the inequality from Gross 1975 paper, but they are equivalent. Equivalent forms of Gaussian LSI have been obtained independently by A. Stam (1959) and by P. Federbush (1969). The contribution of Stam (via the entropy power inequality) was first pointed out by E.A. Carlen in 1991.

27 Proof(s) of the Gaussian LSI There are many ways of proving the Gaussian log-sobolev inequality. Original proof by Gross: apply the Bernoulli LSI to ( ) X X n n/2 i.i.d. f, X i Bern(1/2) n/4 then use the Central Limit Theorem: X X n n/2 n/4 N(0, 1) as n via Markov semigroups via hypercontractivity (E. Nelson) via Stam s inequality for entropy power and Fisher info. via I-MMSE relation (cf. Raginsky and Sason)...

28 Application of Gaussian LSI Theorem (Tsirelson Ibragimov Sudakov, 1976). Let X 1,..., X n be independent N(0, 1) random variables. Let f : R n R be a function which is L-Lipschitz: f(x n ) f(y n ) L x n y n 2, x n, y n R n. Then Z = f(x n ) is L 2 -subgaussian: log E[e λ(f(xn ) Ef(X n )) ] λ2 L 2, λ 0. 2 Remarks: The original proof did not rely on the Gaussian LSI. It is a striking result: f can be an arbitrary nonlinear function, and the subgaussian constant is independent of the dimension n.

29 Tsirelson Ibragimov Sudakov: Proof via LSI By an approximation argument, can assume that f is differentiable. Since it is L-Lipschitz, f 2 L. By the Gaussian LSI, for any λ 0, [ ] D(P λf P ) 1 E λ f(x n ) 2 2 eλf(xn ) 2 E [ e ] λf(xn ) [ ] = λ2 E f(x n ) 2 2 eλf(xn ) 2 E [ e ] λf(xn ) By the Herbst argument, λ2 L 2 2. log E [e λ(f(xn ) Ef(X n )) ] λ2 L 2 2.

30 A Gaussian Concentration Bound The Tsirelson Ibragimov Sudakov inequality gives us Corollary. Let X 1,..., X n i.i.d. N(0, 1), and let f : R n R be an L-Lipschitz function. Then P [f(x n ) Ef(X n ) t] e t2 /2L 2. Proof. Use the Chernoff bound. Remarks: This is an example of dimension-free concentration: the tail bound does not depend on n. Applying the same result to f and using the union bound, we get P [ f(x n ) Ef(X n ) t] 2e t2 /2L 2.

31 Deriving Log-Sobolev (1) Are there systematic ways to derive log-sobolev? The usual (probabilistic) approach (a subtle art): Construct a continuous-time Markov process {X t } t 0 with stationary distribution P and Markov generator Lf(x) lim t 0 E[f(X t ) X 0 = x] f(x) t Use the structure of L to obtain an inequality of the form D(P f P ) c E(e f, f) 2 E P [e f ], where E(g, h) E P [f(x)lg(x)] is the Dirichlet form. Extract Γ by looking for a bound of the form E(e f, f) E P [ Γf(X) 2 e f(x) ] Different choices of L (for the same P ) will yield different Γ s, and hence different log-sobolev inequalities.

32 Deriving Log-Sobolev (2) An alternative (information-theoretic) approach: based on a recent paper of A. Maurer (2012). Exploits a representation of D(P λf P ) in terms of the variance of f(x) under the tilted distributions dp sf = esf E P [e sf ] dp. An interpretation in terms of statistical physics: think of f as energy and of s 0 as inverse temperature. Then Var sf [f(x)] E P [f 2 (X)e sf(x) ] E P [e sf(x) ] ( ) 2 E P [f(x)e sfx() ] E P [e sf(x) ] gives the thermal fluctuations of f at temp. T = 1/s. Infinite-temperature limit (T ): recover Var P [f(x)].

33 Entropy via Thermal Fluctuations Theorem (A. Maurer, 2012). Let X be a random variable with law P. Then for any real-valued function f and any λ 0 D(P λf P ) = λ λ 0 t Var sf [f(x)] ds dt, where Var sf [f(x)] is the variance of f(x) under the tilted distribution P sf. Andreas Maurer Recall: Var sf [f(x)] = E[f 2 (X)e sf(x) ] E[e sf(x) ] where ψ(s) = log E[e s(f(x) Ef(X)) ] ( E[f(X)e sf(x) ) 2 ] ψ (s) E[e sf(x) ]

34 Proof Recall D(P λf P ) = λψ (λ) ψ(λ), where ψ(λ) = log E[e λ(f Ef) ]. Since ψ(0) = ψ (0) = 0, we have λψ (λ) = ψ(λ) = Substitute: λ 0 λ 0 ψ (λ) dt ψ (t) dt. D(P λf P ) = = = λ 0 λ λ 0 t λ λ 0 t (ψ (λ) ψ (t)) dt ψ (s) ds dt Var sf [f(x)] ds dt.

35 From Thermal Fluctuations to Log-Sobolev Theorem. Let A be a class of functions of X, and suppose that there is a mapping Γ : A R, such that: 1. For all f A and α 0, Γ(αf) = αγ(f). 2. There exists a constant c > 0, such that Then Var λf [f(x)] c Γ(f) 2, f A, λ 0. D(P λf P ) λ2 c Γ(f) 2, f A, λ 0. 2 Proof. D(P λf P ) c Γ(f) 2 λ 0 λ t ds dt = c Γ(f) 2 λ 2 2

36 From Thermal Fluctuations to Log-Sobolev: Example 1 Let s use Maurer s method to derive the Bernoulli LSI. For any f : {0, 1} R, define Γ(f) f(0) f(1). Since f is obviously bounded, for every s 0 we have Finally, Var sf [f(x)] (f(0) f(1))2 4 Γf 2 4. D(P f P ) = t Γ(f)2 4 = 1 8 Γf 2. Var sf [f(x)] ds dt t ds dt For n > 1, use tensorization.

37 From Thermal Fluctuations to Log-Sobolev: Example 2 Let s use Maurer s method to derive McDiarmid s inequality. We will use tensorization, so let s first consider n = 1. We are interested in all functions f : X R, such that for some c <. sup f(x) inf f(x) c x X x X Define Γ(f) sup x X f(x) inf x X f(x). Any f satisfies f(x) [inf f, sup f]. If Γ(f) <, then [inf f, sup f] is a bounded interval. In that case, for any P, Var sf [f(x)] (sup f inf f)2 4 = Γ(f) 2. 4 Using the integral representation of the divergence, we get D(P λf P ) λ2 c 2, if sup f inf f c. 8

38 Proof of McDiarmid (cont.) So far, we have obtained ( ) D(P λf P ) λ2 c 2, if sup f inf f c. 8 Let X i P i, 1 i n, be independent r.v. s. Consider f : X n R that has bounded differences: ( ) sup sup f(x i 1, x i, x n i+1) inf f(x i 1, x i, x n x i x i x i i+1) c i for all i, for some constants 0 c 1,..., c n <. For each i, apply ( ) to f i ( ) f(x i 1,, x n i+1 ): D(P λf i i P i ) 1 8 ( [recall, f i ( ) depends on x i ]. sup f(x i 1, x i, x n i+1) inf f(x i 1, x i, x n x i x i i+1) ) 2

39 Proof of McDiarmid (cont.) Now we tensorize: for P = P 1... P n, D(P λf P ) n [ ] Ẽ D(P λf i i P i ) i=1 [ n λ2 ( 8 Ẽ sup i=1 x i f(x i 1, x i, X n i+1) inf x i f(x i 1, x i, X n i+1) } {{ } = Γ(f)(X n ) 2 ) 2 ]

40 Theorem. Let X 1,..., X n X be independent r.v. s with joint law P = P 1... P n. Then, for any function f : X n R, D(P f P ) 1 8 Γf 2, where { n } ( ) 1/2 2 Γf(x n ) = sup f(x i 1, x i, x n i+1) inf f(x i 1, x i, x n x i=1 i x i i+1) }{{} =Γ if( x i ) Remarks: McDiarmid: if f has bounded differences with c 1,..., c n, then n i=1 c2 i f(x n ) is -subgaussian 4 Since Γf 2 n i=1 Γi f 2, the above theorem is stronger than McDiarmid.

41 Transportation-Cost Inequalities

42 Concentration and the Lipschitz Property A common theme: Let f : R n R be 1-Lipschitz w.r.t. the Euclidean norm: f(x n ) f(y n ) x n y n 2. Let X 1,..., X n be i.i.d. N(0, 1) r.v. s. Then P [ f(x n ) Ef(X n ) t] 2e t2 /2. Let f : X n R be 1-Lipschitz w.r.t. a weighted Hamming metric: f(x n ) f(y n ) d c (x n, y n ), n d c (x n, y n ) c i 1{x i y i } i=1 (this is equivalent to bounded differences). Let X 1,..., X n be independent r.v. s. Then ) P [ f(x n ) Ef(X n ) t] 2 exp ( 2t2 n. i=1 c2 i

43 The Setting: Probability in Metric Spaces Let (X, d) be a metric space. A function f : X R is L-Lipschitz (w.r.t. d) if f(x) f(y) Ld(x, y), x, y X. Notation: Lip L (X, d) the class of all L-Lipschitz functions. Question: What conditions does a probability measure P on X have to satisfy, so that f(x), X P, is σ 2 -subgaussian for every f Lip 1 (X, d)?

44 Some Definitions A coupling of two probability measures P and Q on X is any probability measure π on X X, such that (X, Y ) π = X P, Y Q. Π(P, Q): family of all couplings of P and Q. Definition. For p 1, the L p Wasserstein distance between P and Q is given by W p (P, Q) inf (E π[d p (X, Y )]) 1/p. π Π(P,Q) Kantorovich Rubinstein formula. For any two P, Q, W 1 (P, Q) = sup E P [f] E Q [f]. f Lip 1 (X,d)

45 Optimal Transportation Monge-Kantorovich Optimal Transportation Problem. Given two probability measures P, Q on a common space X and a cost function c : X X R +, minimize E π [c(x, Y )] over all couplings π Π(P, Q). Interpretation: P and Q: initial and final distributions of some material (say, sand) in space c(x, y): cost of transporting a grain of sand from location x to location y π(dy x): randomized strategy for transporting from location x Wasserstein distances: transportation cost is some power of a metric Gaspard Monge Leonid Kantorovich

46 Transportation Cost Inequalities Definition. A probability measure P on a metric space (X, d) satisfies an L p transportation cost inequality with constant c, or T p (c) for short, if W p (P, Q) 2cD(Q P ), Q. Preview: We will be primarily concerned with T 1 (c) and T 2 (c) inequalities. T 1 (c) is easier to work with, while T 2 (c) has strong properties (dimension-free concentration).

47 Examples of TC Inequalities: 1 X: arbitrary space with trivial metric d(x, y) = 1{x y} Then W 1 (P, Q) = inf E π[1{x Y }] π Π(P,Q) inf P π[x Y ] π Π(P,Q) = P Q TV total variation distance (proof by explicit construction of optimal coupling) Any P satisfies T 1 (1/4): P Q TV 1 2 D(Q P ) (Csiszár Kemperman Kullback Pinsker)

48 Examples of TC Inequalities: 2 X = {0, 1} with trivial metric d(x, y) = 1{x y} P = Bern(p) Distribution-dependent refinement of Pinsker: P Q TV 2c(p)D(Q P ), where c(p) = p p 2(log p log p), p 1 p (Ordentlich Weinberger, 2005) Thus, P = Bern(p) satisfies T 1 (c(p)), and the constant is optimal: D(Q P ) inf Q Q P 2 = 1 TV 2c(p).

49 Examples of TC Inequalities: 3 X = R n with d(x, y) = x y 2 The Gaussian measure P = N(0, I n ) satisfies T 2 (1): W 2 (P, Q) 2D(Q P ) (Talagrand, 1996) Note: the constant is independent of n!! Michel Talagrand

Sergey Bobkov Friedrich Götze Remarks: Connection between concentration and transportation inequalities was first pointed out by Marton (1986).

50 Transportation and Concentration Theorem (Bobkov Götze, 1999). Let P be a probability measure on a metric space (X, d). Then the following two statements are equivalent: 1. f(x), X P, is σ 2 -subgaussian for every f Lip 1 (X, d). 2. P satisfies T 1 (c) on (X, d): W 1 (P, Q) 2σ 2 D(Q P ), Q P. Sergey Bobkov Friedrich Götze Remarks: Connection between concentration and transportation inequalities was first pointed out by Marton (1986). Remarkable result: concentration phenomenon for Lipschitz functions can be expressed purely in terms of probabilistic and metric structures.

51 Proof of Bobkov Götze (Ultra-light version, due to R. van Handel) Statement 1 of the theorem is equivalent to ( ) sup sup {log E P [e ] λ(f E P f) λ2 σ 2 } 0. 2 λ 0 f Lip 1 (X,d) Use Gibbs variational principle: log E P [e h ] = sup {E Q [h] D(Q P )}, h s.t. e h L 1 (P ) Q (supremum achieved by the tilted distribution Q = P h ). Then ( ) is equivalent to ( ) sup sup sup {λ (E Q [f] E P [f]) D(Q P ) λ2 σ 2 } 0 Q 2 λ 0 f Lip 1 (X,d)

52 Proof of Bobkov Götze (cont.) ( ) sup sup λ 0 f Lip 1 (X,d) sup Q {λ (E Q [f] E P [f]) D(Q P ) λ2 σ 2 } 0 2 Interchange the order of suprema: sup sup λ 0 f Lip 1 (X,d) sup Q [...] = sup Q sup sup λ 0 f Lip 1 (X,d) Then ( ) sup sup {λw 1 (P, Q) D(Q P ) λ2 σ 2 } 0 Q λ 0 2 (by Kantorovich Rubinstein) { } W1 (P, Q) sup Q 2σ 2 D(Q P ) 0 (optimize over λ) [...]

53 Tensorization of Transportation Inequalities At first sight, all we have is another equivalent characterization of concentration of Lipschitz functions. However, transportation inequalities tensorize!! Proof of tensorization is through a beautiful result on couplings by Katalin Marton.

The Marton Coupling Theorem (Marton, 1986). Let (X i, P i ), 1 i n, be probability spaces. Let w i : X i X i R +, 1 i n, be positive weight functions, and let ϕ : R + R + be a convex function.

54 The Marton Coupling Theorem (Marton, 1986). Let (X i, P i ), 1 i n, be probability spaces. Let w i : X i X i R +, 1 i n, be positive weight functions, and let ϕ : R + R + be a convex function. Suppose that, for each i, inf ϕ (E π[w i (X, Y )]) 2σ 2 D(Q P i ), Q. π Π(P i,q) Then the following holds for the product measure P = P 1... P n on the product space X = X 1... X n : inf π Π(P,Q) i=1 n ϕ (E π [w i (X i, Y i )]) 2σ 2 D(Q P ) Katalin Marton for every Q on X. Proof idea: Chain rule for relative entropy + conditional coupling + induction.

55 Tensorization of Transportation Cost Inequalities Theorem. Let (X i, P i, d i ), 1 i n, be probability metric spaces. If for some 1 p 2 each P i satisfies T p (c) on (X i, d i ), then the product measure P = P 1... P n on X = X 1... X n satisfies T p (cn 2/p 1 ) w.r.t. the metric ( n 1/p d p (x n, y n ) d p i (x i, y i )). i=1 Remarks: If each P i satisfies T 1 (c), then P = P 1... P n satisfies T 1 (cn) with respect to the metric i d i. Note: constant deteriorates with n. If each P i satisfies T 2 (c), then P satisfies T 2 (c) with respect to. Note: constant is independent of n. i d2 i

56 Proof of Tensorization By hypothesis, for each i, inf (E π[d p i (X, Y )])2/p 2cD(Q P i ), π Π(P i,q) }{{} Wp,d 2 (P i i,q) Q. 1 p 2 = ϕ(u) = u 2/p is convex. Take w i = d p i. Then inf ϕ (E π[w i (X, Y )]) 2cD(Q P i ), π Π(P i,q) Q. By Marton s coupling, ( ) inf π Π(P,Q) i=1 n (E π [d p i (X i, Y i )]) 2/p 2cD(Q P ), Q.

57 Proof of Tensorization (cont.) We have shown that if P i satisfies T p (c) w.r.t. d i, for each i, then ( ) inf π Π(P,Q) i=1 n (E π [d p i (X i, Y i )]) 2/p 2cD(Q P ), Q. We will now prove LHS( ) n 1 2/p Wp,d 2 p (P, Q). For any π Π(P, Q), ( [ n E π d p i (X i, Y i ) i=1 ]) 2/p ( n 2/p E π [d p i (X i, Y i )]) (concavity of t t 1/p ) i=1 n 2/p 1 n (E π [d p i (X i, Y i )]) 2/p (convexity of t t 2/p ) i=1 Take infimum over all π Π(P, Q), and we are done.

58 (Yet Another) Proof of McDiarmid s Inequality Product probability space: (X 1... X n, P 1... P n ) Given c 1,..., c n 0, equip X i with d ci (x i, y i ) c i 1{x i y i }. Then any P i on X i satisfies T 1 (c 2 i /4): W 1,dci (P i, Q) c i P i Q TV c 2 i 2 D(Q P i) (by rescaling Pinsker). By Marton coupling, P = P 1... P n satisfies T 1 with constant (1/4) n i=1 c2 i with respect to the metric n n d c (x n, y n ) d ci (x i, y i ) = c i 1{x i y i }. i=1 By Bobkov Götze, this is equivalent to subgaussian property of all f Lip 1 (X 1... X n, d c ): ) P[ f(x n ) Ef(X n ) t] 2 exp ( 2t2 n, f Lip 1 (d c ) i=1 c2 i i=1 } {{ } McDiarmid } {{ } bdd. diff.

59 Some Applications

60 The Blowing-Up Lemma Consider a product space Y n equipped with Hamming metric d(y n, z n ) = n i=1 1{y i z i } For a set A Y n and for r {0, 1,..., n}, define its r-blowup { } [A] r z n Y n : min y n A d(zn, y n ) r The following result, in a different (asymptotic) form was first proved by Ahlswede Gács Körner (1976); a simple proof was given by Marton (1986): Let Y 1,..., Y n be independent r.v. s taking values in Y. Then for every set A Y n with P Y n(a) > 0 ( ) 2 P Y n {[A] r } 1 exp 2 n r n 2 log 1 P Y n(a) + Informally, any set in a product space can be blown up to engulf most of the probability mass.

61 Marton s Proof Let P i = L(Y i ), 1 i n; P = P 1... P n L(Y n ). By tensorization, P satisfies the TC inequality n ( ) W 1 (P, Q) 2 D(Q P ), Q P(Yn ), where W 1 (P, Q) = inf π Π(P,Q) E π [ n ] 1{Y i Z i }, Y n P, Z n Q i=1 For any B Y n with P (B) > 0, consider conditional distribution P B ( ) P ( B). P (B) 1 Then D(P B P ) = log P (B), and therefore n ( ) W 1 (P, P B ) 2 log 1 P (B).

62 Marton s Proof ( ) W 1 (P, P B ) n 2 log 1 P (B) Apply ( ) to B = A and B = [A] c r: W 1 (P, P A ) n 2 log 1 P (A), W 1(P, P [A] c r ) n 2 log 1 1 P ([A] r ) Then n 2 log 1 P (A) + n 2 log 1 1 P ([A] r ) W 1(P A, P ) + W 1 (P, P [A] c r ) W 1 (P A, P [A] c r ) min d(y n, z n ) y n A,z n [A] c r r. (def. of [ ] r) (triangle ineq.) Rearrange to finish the proof.

63 The Blowing-Up Lemma: Consequences Consider a DMC (X, Y, T ) with input alphabet X, output alphabet Y, transition probabilities T (y x), (x, y) X Y (n, M, ε)-code C: encoder f : {1,..., M} X n, decoder g : Y n {1,..., M} with max P[g(Y n ) j X n = f(j)] ε. 1 j M Equivalently, C = {(u j, D j )} M j=1, where uj = f(j) X n codewords Dj = g 1 (j) = {y n Y n : g(y n ) = j} decoding sets T n (D j u j ) 1 ε, j = 1,..., M. Lemma. There exists δ n > 0, such that T n( ) X [D j ] n nδn = u j 1 1, j = 1,..., M. n

64 Proof Choose δ n = 1 n n ( ) log n 1 2n + 2n log 1 1 ε For each j, apply Blowing-Up Lemma to the product measure n P j (y n ) = T (y i u j (i)), i=1 where u j = (u j (1),..., u j (n)) = f(j). }{{} jth codeword With r = nδ n, this gives T n( ) X [D j ] n nδn = u j 1 exp 2 n ( nδ n ) 2 n 2 log 1 1 ε n.

65 From Blowing-Up Lemma to Strong Converses Informally, the Blowing-Up Lemma shows that any bad code contains a good subcode (Ahlswede and Dueck, 1976). Consider an (n, M, ε)-code C = {(u j, D j )} M j=1. Each decoding set D j can be blown up to a set D j Y n with T n ( D j u j ) 1 1 n. The object C = {(u j, D j )} M j=1 is not a code (since the sets D j are no longer disjoint), but a random coding argument can be used to extract an (n, M, ε ) subcode with M slightly smaller than M and ε < ε. Then one can apply the usual (weak) converse to the subcode. Similar ideas can be used in multiterminal settings (starting with Ahlswede Gács Körner).

66 Example: Capacity-Achieving Channel Codes The set-up DMC (X, Y, T ) with capacity C = C(T ) = max P X I(X; Y ) (n, M)-code: C = (f, g) with encoder f : {1,..., M} X n and decoder g : Y n {1,..., M} Capacity-achieving codes: A sequence {C n } n=1, where each C n is an (n, M n )-code, is capacity-achieving if lim n 1 n log M n = C.

67 Capacity-Achieving Channel Codes Capacity-achieving input and output distributions: P X arg max P X I(X; Y ) (may not be unique) P X T P Y (always unique) Theorem (Shamai Verdú, 1997). Let {C n } be any capacity-achieving code sequence with vanishing error probability. Then 1 ( ) lim n n D P (Cn) P Y n Y n = 0, where P (Cn) Y is the output distribution induced by the code n C n when the messages in {1,..., M n } are equiprobable.

68 Capacity-Achieving Channel Codes lim n 1 n D(P Y n P Y n) = 0 Main message: channel output sequences induced by good code resemble i.i.d. sequences drawn from the CAOD P Y Useful implications: estimate performance characteristics of good channel codes by their expectations w.r.t. P Y n = (P Y )n often much easier to compute explicitly bound estimation accuracy using large-deviation theory (e.g., Sanov s theorem) Question: what about good codes with nonvanishing error probability?

69 Codes with Nonvanishing Error Probability Y. Polyanskiy and S. Verdú, Empirical distribution of good channel codes with non-vanishing error probability (2012) 1. Let C = (f, g) be any (n, M, ε)-code for T : max P [ g(y n ) j X n = f(j) ] ε. 1 j M Then D ( P (C) ) P Y n Y n nc log M + o(n). 2. If {C n } n=1 is a capacity-achieving sequence, where each C n is an (n, M n, ε)-code for some fixed ε > 0, then 1 ( lim n n D P (Cn) Y n ) P Y n = 0. In some cases, the o(n) term can be improved to O( n).

70 Relative Entropy at the Output of a Code Consider a DMC T : X Y with T ( ) > 0, and let c(t ) = 2 max max T (y x) x X y,y Y log T (y x) Theorem (Raginsky Sason, 2013). Any (n, M, ε)-code C for T, where ε (0, 1/2), satisfies ( ) D P (C) P Y n Y n nc log M + log 1 n ε + c(t ) 2 log 1 1 2ε Remark: Polyanskiy and Verdú show that ( ) D P (C) P Y n Y n nc log M + a n for some constant a = a(ε).

71 Proof Sketch (1) Fix x n X n and study concentration of the function h x n(y n ) = log dp Y n X n =x n dp (C) around its expectation w.r.t. P Y n X n =x n: Y n (y n ) ( ) E[h x n(y n ) X n = x n P (C) ] = D P Y n X n =x n Y n Step 1: Because T ( ) > 0, the function h x n(y n ) is 1-Lipschitz w.r.t. scaled Hamming metric d(y n, ȳ n ) = c(t ) n 1{y i ȳ i } i=1

72 Proof Sketch (2) Step 1: Because T ( ) > 0, the function h x n(y n ) is 1-Lipschitz w.r.t. scaled Hamming metric d(y n, ȳ n ) = c(t ) n 1{y i ȳ i } Step 2: Any product probability measure µ on (Y n, d) satisfies i=1 log E µ [ e tf(y n ) ] nc(t )2 t 2 for any f with E µ f = 0 and F Lip 1. Proof: Tensorization of T 1 (Pinsker), followed by appeal to Bobkov Götze. 8

73 Proof Sketch (3) h x n(y n ) = log dp Y n X n =x n (y n ) dp (C) Y n ( E[h x n(y n ) X n = x n P (C) ] = D P Y n X n =x n Step 3: For any x n, µ = P Y n X n =x n Y n ) is a product measure, so ( ( ) ) ) P h x n(y n P (C) ) D P Y n X n =x n Y + r exp ( 2r2 n nc(t ) 2 Use this with r = c(t ) n 2 log 1 1 2ε : ( ( P h x n(y n P (C) ) D P Y n X n =x n ) n Y + c(t ) n 2 log 1 1 2ε ) 1 2ε Remark: Polyanskiy Verdú show Var[h x n(y n ) X n = x n ] = O(n).

74 Proof Sketch (4) Recall: ( ( P h x n(y n P (C) ) D P Y n X n =x n ) n Y + c(t ) n 2 log 1 1 2ε ) 1 2ε Step 4: Same as Polyanskiy Verdú, appeal to Augustin s strong converse (1966) to get log M log 1 ( ) P ε + D (C) P (C) n P Y n X n Y n X + c(t ) n 2 log 1 1 2ε ( ) D P (C) P Y n Y n ( ) ( P = D P Y n X n Y n P (C) P (C) P (C) X D P n Y n X n Y n nc log M + log 1 ε + c(t ) n 2 log 1 1 2ε X n )

75 Relative Entropy at the Output of a Code Theorem (Raginsky Sason, 2013). Let (X, Y, T ) be a DMC with C > 0. Then, for any 0 < ε < 1, any (n, M, ε)-code C for T satisfies D ( P (C) P Y n Y n) nc log M + 2n (log n) 3/2 ( log n + log ( 2 X Y 2). 1 log n log ( ) ) ( ε ) log Y log n Proof idea: Apply Blowing-Up Lemma to the code, then extract a good subcode. Remark: Polyanskiy and Verdú show that for some constant b > 0. D ( P (C) Y n P Y n) nc log M + b n log 3/2 n

76 Concentration of Lipschitz Functions Theorem (Raginsky Sason, 2013). Let (X, Y, T ) be a DMC with c(t ) <. Let d : Y n Y n R + be a metric, and suppose that P Y n X n =x n, xn X n, as well as P Y n, satisfy T 1(c) for some c > 0. Then, for any ε (0, 1/2), any (n, M, ε)-code C for T, and any function f : Y n R we have ( ) P (C) Y f(y n ) E[f(Y n )] r n ) (nc 4 ε exp ln M + a n 8c f 2 Lip where Y n P Y n, and a c(t ) 1 2 ln 1 1 2ε. r 2, r 0

77 Proof Sketch Step 1: For each x n X n, let φ(x n ) E[f(Y n ) X n = x n ]. Then, by Bobkov Götze, ( ) ( P f(y n ) φ(x n ) r X n = x n) 2 exp r2 2c f 2 Lip Step 2: By restricting to a subcode C with codewords x n X n satisfying φ(x n ) E[f(Y n )] + r, we can show that ( r f Lip 2c nc log M + a n + log 1 ), ε ) with M = MP (C) X (φ(x n ) E[f(Y n )] + r. Solve to get n ( ) P (C) X φ(x n ) E[f(Y n )] r 2e nc log M+a n+log 1 ε r 2 n 2c f 2 Lip Step 3: Apply union bound.

78 Empirical Averages at the Code Output Equip Y n with the Hamming metric n d(y n, ȳ n ) = 1{y i ȳ i } Consider functions of the form f(y n ) = 1 n i=1 n f i (y i ), i=1 where f i (y i ) f i (ȳ i ) L1{y i ȳ i } for all i, y i, ȳ i. Then f Lip L/n. Since P Y n X n =x n for all xn and PY n are product measures on Y n, they all satisfy T 1 (n/4) (by tensorization) Therefore, for any (n, M, ε)-code and any such f we have P (C) ( Y f(y n ) E[f(Y n )] r ) n 4 ( ε exp nc log M + a ) n nr2 2L 2

79 Concentration of Measure Information-Theoretic Converse Concentration phenomenon in a nutshell: if a subset of a metric probability space does not have too small of a probability mass, then its blowups will eventually take up most of the probability mass. Question: given a set whose blowups eventually take up most of the probability mass, how small can this set be? This question was answered by Kontoyiannis (1999) as a consequence of a general information-theoretic converse.

80 Converse Concentration of Measure: The Set-Up Let X be a finite set, together with a distortion function d : X X R + and a mass function M : X (0, ). Extend to product space X n : d n (x n, y n ) M n (x n ) n d(x i, y i ) i=1 n M(x i ) i=1 M n (A) x n A M n (x n ), A X n Blowups: A X n [A] r { } x n X n : min d n(x n, y n ) r y n A

81 Converse Concentration of Measure Let P be a probability measure on X. Define { R n (δ) min I(X n ; Y n ) + E log M n (Y n ) : P X n Y n } P X n = P n, E[d n (X n, Y n )] nδ Theorem (Kontoyiannis). Let A n X n be an arbitrary set. Then where δ 1 [ ] n E min d n (X n, y n ) y n A n Remark: 1 n log M n(a n ) R(δ), and It can be shown that R 1 (δ) = inf n 1 R n(δ) n. R n (δ) R(δ) lim R 1 (δ). n n

82 Proof Define the mapping ϕ n : X n X n via ϕ n (x n ) arg min y n A n d n (x n, y n ) and let Y n = ϕ n (X n ), Q n = L(Y n ). Then log M n (A n ) = log log y n A n M n (y n ) y n A n: Q n>0 Q n (y n ) M n(y n ) Q n (y n ) Q n (y n ) log Mn(yn ) Q n (y n ) y n A n = Q n (y n ) log Q n (y n ) + y n A n = H(Y n ) + E log M(Y n ) = I(X n ; Y n ) + E log M(Y n ) R n (δ). y n A n Q(y n ) log M n (y n )

83 Converse Concentration of Measure Consider a sequence of sets {A n } n=1 with ( ) P n ([A n ] nδ ) n 1. Apply Kontoyiannis converse to the mass function M = P, to get the following: Corollary. If the sequence {A n } satisfies ( ), then lim inf n 1 n log P n (A n ) R(δ), where the concentration exponent is { } R(δ) = min I(X; Y ) + E log P (Y ) : P X = P, E[d(X, Y )] δ P XY { } max H(Y X) + D(P Y P ) : P X = P, E[d(X, Y )] δ. P XY

84 Example of The Concentration Exponent Theorem (Raginsky Sason, 2013). Let P = Bern(p). Then { ( ) ϕ(p)δ 2 (1 p)h, if δ [0, 1 p] where R(δ) δ 1 p = log p, if δ [1 p, 1] ϕ(p) = 1 1 2p log 1 p p and h( ) is the binary entropy function. Remarks: The upper bound is not tight, but in this case R(δ) can be evaluated numerically (cf. Kontoyiannis, 2001). The proof is by a coupling argument.

85 Summary Three related methods for obtaining sharp concentration inequalities in high dimension: 1. The entropy method 2. Log-Sobolev inequalities 3. Transportation-cost inequalities All three methods crucially rely on tensorization: Breaking the original high-dimensional problem into low-dimensional pieces, exploiting low-dimensional structure to control entropy locally, assembling local information into a global bound. Tensorization is a consequence of independence. Applications to information theory: Exploit the problem structure to isolate independence (e.g., output distribution of a DMC for any fixed input block).

86 What We Had to Skip Log-Sobolev inequalities and hypercontractivity Log-Sobolev inequalities when Herbst fails (e.g., Poisson measures) Connections to isoperimetric inequalities HWI inequalities: tying together relative entropy, Wasserstein distance, Fisher information Concentration inequalities for functions of dependent random variables For this and more, consult our monograph: M. Raginsky and I. Sason, Concentration of Measure Inequalities in Info. Theory, Comm. and Coding, FnT, 2nd edition, 2014.

87 Recent Books and Surveys - Concentration Inequalities 1. N. Alon and J. H. Spencer, The Probabilistic Method, Wiley, 3rd edition, S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence, Oxford Press, F. Chung and L. Lu, Complex Graphs and Networks, vol. 106, Regional Conference Series in Mathematics, Wiley, D. P. Dubhashi and A. Panconesi, Concentration of Measure for the Analysis of Randomized Algorithms, Cambridge Press, M. Ledoux, The Concentration of Measure Phenomenon, Mathematical Surveys and Monographs, vol. 89, AMS, C. McDiarmid, Concentration, Probabilistic Methods for Algorithmic Discrete Mathematics, pp , Springer, M. Raginsky and I. Sason, Concentration of Measure Inequalities in Info. Theory, Comm. and Coding, FnT, 2nd edition, R. van Handel, Probability in High Dimension, lecture notes, 2014.

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities

Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities Maxim Raginsky and Igal Sason ISIT 2013, Istanbul, Turkey Capacity-Achieving Channel Codes The set-up DMC