Concentration of Measure with Applications in Information Theory, Communications, and Coding
|
|
- Chrystal Walters
- 5 years ago
- Views:
Transcription
1 Concentration of Measure with Applications in Information Theory, Communications, and Coding Maxim Raginsky (UIUC) and Igal Sason (Technion) A Tutorial, Presented at 2015 IEEE International Symposium on Information Theory (ISIT) Hong Kong, June 2015 Part 2 of 2
2 The Plan 1. Prelude: The Chernoff bound 2. The entropy method 3. Logarithmic Sobolev inequalities 4. Transportation cost inequalities 5. Some applications
3 Given: X 1, X 2,..., X n : independent random variables Z = f(x n ), for some real-valued f Problem: derive sharp bounds on the deviation probabilities Benchmark: X 1,..., X n i.i.d. N(0, σ 2 ) P[Z EZ t], for t 0 Z = f(x n ) = 1 n (X X n ) sample mean Goal: P[Z t] exp ) ( nt2 2σ 2 extend to other distributions besides Gaussians extend to nonlinear f
4 Prelude: The Chernoff Bound
5 Review: The Chernoff Bound Define: logarithmic moment-generating function ψ(λ) log E[e λ(z EZ) ] its Legendre dual ψ (t) sup {λt ψ(λ)} λ 0 P[Z EZ t] = P[e λ(z EZ) e λt ] e λt E[e λ(z EZ) ] = e {λt ψ(λ)} Markov s inequality Optimize over λ: Sanity check: P[Z EZ t] e ψ (t) Z N(0, σ 2 ) ψ(λ) = λ2 σ 2 2 P[Z t] e t2 /2σ 2 ψ (t) = t2 2σ 2
6 Chernoff Bound: Subgaussian Random Variables Definition. A real-valued r.v. Z is σ 2 -subgaussian if ψ(λ) λ2 σ 2 Immediate: if Z is σ 2 -subgaussian, then ψ (t) = sup {λt ψ(λ)} λ 0 sup λ 0 = t 2 /2σ 2 giving the Gaussian tail bound 2 { λt λ 2 σ 2 /2 } P[Z EZ t] e t2 /2σ 2 How do we establish subgaussianity?
7 Review: Hoeffding s Lemma Any almost surely bounded r.v. is subgaussian: If there exist < a b < such that Z [a, b] a.s., then ψ(λ) λ2 (b a) 2. 8 Corollary. If Z [a, b] a.s., then ) P[Z EZ t] exp ( 2t2 (b a) 2 Wassily Hoeffding Proof (of Corollary). By Hoeffding s lemma, Z is subgaussian with σ 2 = (b a) 2 /4.
8 Hoeffding s Lemma: An Alternative Proof 1. Assume without loss of generality that EZ = Compute the first two derivatives of ψ: ψ (λ) = E[ZeλZ ] E[e λz ] ψ (λ) = E[Z2 e λz ] E[e λz ] ( E[Ze λz ) 2 ] E[e λz ] 3. Tilted distribution: P = L(Z) Q dq dp (Z) = eλz E P [e λz ]. Then ψ (λ) = E Q [Z], ψ (λ) = Var Q [Z]. 4. Z [a, b] P -a.s. = Z [a, b] Q-a.s. = Var Q [Z] (b a) Calculus: ψ(λ) = λ τ 0 0 ψ (ρ) dρ dτ λ2 (b a) 2. 8
9 The Entropy Method
10 Exponential Tilting and (Relative) Entropy Back to our setting: Z = f(x), X an arbitrary r.v. Want to prove subgaussianity of Z, so need to analyze ψ(λ) = log E[e λ(z EZ) ] = log E[e λ(f(x) Ef(X)) ]. Let P = L(X), introduce tilted distribution P λf : dp λf dp (X) = eλf(x) E[e λf(x) ]. We will relate ψ(λ) to the relative entropy D(P λf P ).
11 Exponential Tilting and (Relative) Entropy Tilting of P = L(X): dp λf dp (X) = Relative entropy: D(P λf P ) = = With a bit of foresight, eλf(x) E[e λf(x) ] eλ(f(x) Ef(X)) e ψ(λ). dp λf log dp λf dp dp λf (λ(f E P f) ψ(λ)) = λe P [(f E P f)e λ(f E P f) ] ψ(λ) e ψ(λ) = λψ (λ) ψ(λ) λψ (λ) ψ(λ) = λ 2 ( ψ (λ) λ ψ(λ) ) λ 2 = λ 2 d dλ ( ) ψ(λ) λ
12 The Herbst Argument Tilting of P = L(X): Relative entropy: dp λf dp (X) = D(P λf P ) = λ 2 d dλ eλf(x) E[e λf(x) ] ( ) ψ(λ) Since lim ψ(λ)/λ = 0 (by l Hôpital), we have λ 0 ψ(λ) = λ λ 0 λ D(P ρf P ) ρ 2 dρ. Suppose now that P = L(X) and f are such that for some σ 2. Then D(P ρf P ) ρ2 σ 2 ψ(λ) λ λ 0 2, ρ 0 ρ 2 σ 2 2ρ 2 dρ = λ2 σ 2 2.
13 The Herbst Argument Lemma (Herbst, 1975). Suppose that Z = f(x) is such that D(P λf P ) λ2 σ 2, λ 0. 2 Then Z is σ 2 -subgaussian, and so P[f(X) Ef(X) t] e t2 /2σ 2, t 0. Ira Herbst For the sake of completeness, we... prove a theorem which states roughly that the potential of an intrinsically hypercontractive Schrödinger operator must increase at least quadratically at infinity.... [I]ts complete proof requires information in an unpublished letter of 1975 from I. Herbst to L. Gross. [C]ertain steps in the argument... can be written down abstractly. from a 1984 paper by B. Simon and E.B. Davies
14 The Herbst Converse Lemma (R. van Handel, 2014). Suppose Z = f(x) is σ 2 /4-subgaussian. Then D(P λf P ) λ2 σ 2, λ 0. 2 Proof. Let Ẽ[ ] E P λf [ ]. ] D(P λf λf P ) = [log Ẽ dp (Jensen) [ ] dp λf log dp Ẽ dp [ ] e λ(f(x) Ef(X)) = log Ẽ E[e λ(f(x) Ef(X)) ] [ = log E e 2λ(f Ef)] { log E[e λ(f Ef) ] }{{} 1 (2λ)2 σ 2 /4 2 = λ2 σ 2 2 } 2
15 The Herbst Argument: What Is It Good For? Subgaussianity of Z = f(x) is equivalent to D(P λf P ) = O(λ 2 ), but what does that give us? Recall: we are interested in high-dimensional settings Z = f(x n ) = f(x 1,..., X n ) X 1,..., X n independent r.v. s Thus, P is a product measure: P = P 1 P 2... P n, P i L(X i ) The relative entropy tensorizes!! we can break the (hard) n-dimensional problem into n (hopefully) easier 1-dimensional problems.
16 Tensorization X 1,..., X n independent r.v. s P = L(X n ) = P 1... P n, P i = L(X i ) Recall the Efron Stein Steele inequality: Var P [f(x n )] n E [ Var P [f(x n ) X i ] ] i=1 tensorization of variance Tensorization of relative entropy: for an arbitrary probability measure Q on X n, D(Q P ) n D(Q Xi X i P X i X i Q X i) }{{} conditional divergence i=1 independence of the X i s is key
17 Tensorization: A Quick Proof D(Q P ) n = D(Q Xi X i 1 P X i X i 1 Q X i 1) i=1 n [ = E Q log dq ] X i X i 1 dp i=1 Xi X i 1 [ n = E Q log dq ] X i X i 1 + dq i=1 Xi X i n [ = E Q log dq ] X i X i + dq i=1 Xi X i 1 n = i=1 n i=1 (chain rule) [ E Q log dq ] X i X i dp Xi X i 1 [ n E Q log dq ] X i X i dp Xi X i i=1 n D(Q Xi X i Q X i 1 Q i X X i) + D(Q Xi X i P X i Q Xi) n D(Q Xi X i P X i Q Xi) i=1 i=1 (independence) D (Q P ) erasure divergence (Verdú Weissman, 2008)
18 Tensorization and Tilting Recall: we are interested in D(Q P ), where P = P 1... P n, dq = eλf E P [e λf ] dp Then dq Xi ( x i ) = P i (dx i ) eλf(x1,...,xi 1,xi,xi+1,...,xn) dp Xi X E P [e λf(xn ) ] = E P [e λf(xn) X i = x i ] E P [e λf(xn ) ] therefore dq Xi X i = x i dp Xi (x i ) = eλf(x1,...,xi 1,xi,xi+1,...,xn) E P [e λf(x1,...,xi 1,Xi,xi+1,...,xn) ] remember, x i is fixed.
19 Tensorization and Tilting dp λf X i X i = x i dp Xi (x i ) = eλf(x1,...,xi 1,xi,xi+1,...,xn) E Pi [e λf(x 1,...,x i 1,X i,x i+1,...,x n) ] For a fixed x i, define the function f i ( x i ) : X R f i (x i x i ) = f(x 1,..., x i 1, x i, x i+1,..., x n ) = f i (x i ) (just shorthand, always remember x i ) Now observe that Q Xi X i = x i is the tilting of P i P Xi : dq Xi X i = x i = eλf i E Pi [e λf i ] dp i
20 Tensorization and Tilting Lemma. If X 1,..., X n are independent r.v. s with joint law P = P 1... P n, where P i = L(X i ), then for any f : X n R D(P λf P ) n i=1 [ ] Ẽ D(P λf i i P i ), where Ẽ[ ] E P λf [ ] and f i( ) = f i ( X i ) for each i. Proof. D(P λf P ) = = n D(P λf X i X P i Xi P λf ) X i λf n dpx Ẽ log i X i i=1 i=1 [ n Ẽ log i=1 dp Xi dp λfi i dp i ] = n i=1 [ ] Ẽ D(P λfi i P i )
21 The Entropy Method: Divide and Conquer Want to derive a subgaussian tail bound P[f(X n ) Ef(X n ) t] e t2 /2σ 2, t 0 where X 1,..., X n are independent r.v. s. Suppose we can prove there exist constants c 1,..., c n, such that Then ( ) D(P λf P ) (tensor.) D(P λfi i P i ) λ2 c 2 i, i {1,..., n}. 2 λ2 n i=1 c2 i 2 (Herbst) = σ 2 = Now we just need to prove ( )!! n i=1 c 2 i
22 Logarithmic Sobolev Inequalities
23 Log-Sobolev in a Nutshell Goal: control the relative entropy D(P λf P ). A log-sobolev inequality ties together: (i) the underlying probability measure P (ii) a function class A (containing f of interest) (iii) an energy functional E : A R such that and looks like this: E(αf) = αe(f), f A, α 0 D(P f P ) c 2 E2 (f), f A. In that case, if E(f) L, then D(P λf P ) c 2 E2 (λf) = c 2 λ2 E 2 (f) λ2 cl 2 The name comes from an analogy with Sobolev inequalities in functional analysis. 2
24 The Bernoulli Log-Sobolev Inequality Theorem (Gross, 1975). Let X 1,..., X n be i.i.d. Bern(1/2) random variables. Then, for any function f : {0, 1} n R, D(P f P ) 1 8 E[ Df(X n ) 2 e f(xn) ] E[e f(xn ), ] where Df(x n n ) f(x n ) f( x n e i ) }{{} 2, i=1 flip ith bit and E[ ] is w.r.t. P = (Bern(1/2)) n. Leonard Gross Remarks: This is not the original form of the inequality from Gross 1975 paper, but they are equivalent. Note that D(λf) = λd(f) for all λ 0.
25 Bernoulli LSI: Proof Sketch Consider first n = 1 and f : {0, 1} R with a = f(0) and b = f(1). In that case, the log-sobolev inequality reads e a e a + e b log 2e a e a + e b + eb e a + e b log 2e b e a + e b 1 8 (b a)2. Proof: Elementary (but tedious) exercise in calculus. Tensorization: n D(P f P ) Ẽ[D(P fi i P i )] where Ẽ[h(Xn )] = E[h(Xn )e f(x n) ] E[e f(xn ) ] i=1 1 n [ Ẽ f(x i 1, 0, X n 8 i+1) f(x i 1, 1, Xi+1) n 2] i=1 = 1 [ Df(X n ) 2] = 1 E[ Df(X n ) 2 e f(xn) ] 8Ẽ 8 E[e f(xn ) ]
26 The Gaussian Log-Sobolev Inequality Theorem (Gross, 1975). Let X 1,..., X n be i.i.d. N(0, 1) random variables. Then, for any smooth function f : R n R, [ ] D(P f P ) 1 E f(x n ) 2 2 ef(xn ) 2 E[e f(xn), ] where all expectations are w.r.t. P = L(X n ) = N(0, I n ). Leonard Gross Remarks: This is not the original form of the inequality from Gross 1975 paper, but they are equivalent. Equivalent forms of Gaussian LSI have been obtained independently by A. Stam (1959) and by P. Federbush (1969). The contribution of Stam (via the entropy power inequality) was first pointed out by E.A. Carlen in 1991.
27 Proof(s) of the Gaussian LSI There are many ways of proving the Gaussian log-sobolev inequality. Original proof by Gross: apply the Bernoulli LSI to ( ) X X n n/2 i.i.d. f, X i Bern(1/2) n/4 then use the Central Limit Theorem: X X n n/2 n/4 N(0, 1) as n via Markov semigroups via hypercontractivity (E. Nelson) via Stam s inequality for entropy power and Fisher info. via I-MMSE relation (cf. Raginsky and Sason)...
28 Application of Gaussian LSI Theorem (Tsirelson Ibragimov Sudakov, 1976). Let X 1,..., X n be independent N(0, 1) random variables. Let f : R n R be a function which is L-Lipschitz: f(x n ) f(y n ) L x n y n 2, x n, y n R n. Then Z = f(x n ) is L 2 -subgaussian: log E[e λ(f(xn ) Ef(X n )) ] λ2 L 2, λ 0. 2 Remarks: The original proof did not rely on the Gaussian LSI. It is a striking result: f can be an arbitrary nonlinear function, and the subgaussian constant is independent of the dimension n.
29 Tsirelson Ibragimov Sudakov: Proof via LSI By an approximation argument, can assume that f is differentiable. Since it is L-Lipschitz, f 2 L. By the Gaussian LSI, for any λ 0, [ ] D(P λf P ) 1 E λ f(x n ) 2 2 eλf(xn ) 2 E [ e ] λf(xn ) [ ] = λ2 E f(x n ) 2 2 eλf(xn ) 2 E [ e ] λf(xn ) By the Herbst argument, λ2 L 2 2. log E [e λ(f(xn ) Ef(X n )) ] λ2 L 2 2.
30 A Gaussian Concentration Bound The Tsirelson Ibragimov Sudakov inequality gives us Corollary. Let X 1,..., X n i.i.d. N(0, 1), and let f : R n R be an L-Lipschitz function. Then P [f(x n ) Ef(X n ) t] e t2 /2L 2. Proof. Use the Chernoff bound. Remarks: This is an example of dimension-free concentration: the tail bound does not depend on n. Applying the same result to f and using the union bound, we get P [ f(x n ) Ef(X n ) t] 2e t2 /2L 2.
31 Deriving Log-Sobolev (1) Are there systematic ways to derive log-sobolev? The usual (probabilistic) approach (a subtle art): Construct a continuous-time Markov process {X t } t 0 with stationary distribution P and Markov generator Lf(x) lim t 0 E[f(X t ) X 0 = x] f(x) t Use the structure of L to obtain an inequality of the form D(P f P ) c E(e f, f) 2 E P [e f ], where E(g, h) E P [f(x)lg(x)] is the Dirichlet form. Extract Γ by looking for a bound of the form E(e f, f) E P [ Γf(X) 2 e f(x) ] Different choices of L (for the same P ) will yield different Γ s, and hence different log-sobolev inequalities.
32 Deriving Log-Sobolev (2) An alternative (information-theoretic) approach: based on a recent paper of A. Maurer (2012). Exploits a representation of D(P λf P ) in terms of the variance of f(x) under the tilted distributions dp sf = esf E P [e sf ] dp. An interpretation in terms of statistical physics: think of f as energy and of s 0 as inverse temperature. Then Var sf [f(x)] E P [f 2 (X)e sf(x) ] E P [e sf(x) ] ( ) 2 E P [f(x)e sfx() ] E P [e sf(x) ] gives the thermal fluctuations of f at temp. T = 1/s. Infinite-temperature limit (T ): recover Var P [f(x)].
33 Entropy via Thermal Fluctuations Theorem (A. Maurer, 2012). Let X be a random variable with law P. Then for any real-valued function f and any λ 0 D(P λf P ) = λ λ 0 t Var sf [f(x)] ds dt, where Var sf [f(x)] is the variance of f(x) under the tilted distribution P sf. Andreas Maurer Recall: Var sf [f(x)] = E[f 2 (X)e sf(x) ] E[e sf(x) ] where ψ(s) = log E[e s(f(x) Ef(X)) ] ( E[f(X)e sf(x) ) 2 ] ψ (s) E[e sf(x) ]
34 Proof Recall D(P λf P ) = λψ (λ) ψ(λ), where ψ(λ) = log E[e λ(f Ef) ]. Since ψ(0) = ψ (0) = 0, we have λψ (λ) = ψ(λ) = Substitute: λ 0 λ 0 ψ (λ) dt ψ (t) dt. D(P λf P ) = = = λ 0 λ λ 0 t λ λ 0 t (ψ (λ) ψ (t)) dt ψ (s) ds dt Var sf [f(x)] ds dt.
35 From Thermal Fluctuations to Log-Sobolev Theorem. Let A be a class of functions of X, and suppose that there is a mapping Γ : A R, such that: 1. For all f A and α 0, Γ(αf) = αγ(f). 2. There exists a constant c > 0, such that Then Var λf [f(x)] c Γ(f) 2, f A, λ 0. D(P λf P ) λ2 c Γ(f) 2, f A, λ 0. 2 Proof. D(P λf P ) c Γ(f) 2 λ 0 λ t ds dt = c Γ(f) 2 λ 2 2
36 From Thermal Fluctuations to Log-Sobolev: Example 1 Let s use Maurer s method to derive the Bernoulli LSI. For any f : {0, 1} R, define Γ(f) f(0) f(1). Since f is obviously bounded, for every s 0 we have Finally, Var sf [f(x)] (f(0) f(1))2 4 Γf 2 4. D(P f P ) = t Γ(f)2 4 = 1 8 Γf 2. Var sf [f(x)] ds dt t ds dt For n > 1, use tensorization.
37 From Thermal Fluctuations to Log-Sobolev: Example 2 Let s use Maurer s method to derive McDiarmid s inequality. We will use tensorization, so let s first consider n = 1. We are interested in all functions f : X R, such that for some c <. sup f(x) inf f(x) c x X x X Define Γ(f) sup x X f(x) inf x X f(x). Any f satisfies f(x) [inf f, sup f]. If Γ(f) <, then [inf f, sup f] is a bounded interval. In that case, for any P, Var sf [f(x)] (sup f inf f)2 4 = Γ(f) 2. 4 Using the integral representation of the divergence, we get D(P λf P ) λ2 c 2, if sup f inf f c. 8
38 Proof of McDiarmid (cont.) So far, we have obtained ( ) D(P λf P ) λ2 c 2, if sup f inf f c. 8 Let X i P i, 1 i n, be independent r.v. s. Consider f : X n R that has bounded differences: ( ) sup sup f(x i 1, x i, x n i+1) inf f(x i 1, x i, x n x i x i x i i+1) c i for all i, for some constants 0 c 1,..., c n <. For each i, apply ( ) to f i ( ) f(x i 1,, x n i+1 ): D(P λf i i P i ) 1 8 ( [recall, f i ( ) depends on x i ]. sup f(x i 1, x i, x n i+1) inf f(x i 1, x i, x n x i x i i+1) ) 2
39 Proof of McDiarmid (cont.) Now we tensorize: for P = P 1... P n, D(P λf P ) n [ ] Ẽ D(P λf i i P i ) i=1 [ n λ2 ( 8 Ẽ sup i=1 x i f(x i 1, x i, X n i+1) inf x i f(x i 1, x i, X n i+1) } {{ } = Γ(f)(X n ) 2 ) 2 ]
40 Theorem. Let X 1,..., X n X be independent r.v. s with joint law P = P 1... P n. Then, for any function f : X n R, D(P f P ) 1 8 Γf 2, where { n } ( ) 1/2 2 Γf(x n ) = sup f(x i 1, x i, x n i+1) inf f(x i 1, x i, x n x i=1 i x i i+1) }{{} =Γ if( x i ) Remarks: McDiarmid: if f has bounded differences with c 1,..., c n, then n i=1 c2 i f(x n ) is -subgaussian 4 Since Γf 2 n i=1 Γi f 2, the above theorem is stronger than McDiarmid.
41 Transportation-Cost Inequalities
42 Concentration and the Lipschitz Property A common theme: Let f : R n R be 1-Lipschitz w.r.t. the Euclidean norm: f(x n ) f(y n ) x n y n 2. Let X 1,..., X n be i.i.d. N(0, 1) r.v. s. Then P [ f(x n ) Ef(X n ) t] 2e t2 /2. Let f : X n R be 1-Lipschitz w.r.t. a weighted Hamming metric: f(x n ) f(y n ) d c (x n, y n ), n d c (x n, y n ) c i 1{x i y i } i=1 (this is equivalent to bounded differences). Let X 1,..., X n be independent r.v. s. Then ) P [ f(x n ) Ef(X n ) t] 2 exp ( 2t2 n. i=1 c2 i
43 The Setting: Probability in Metric Spaces Let (X, d) be a metric space. A function f : X R is L-Lipschitz (w.r.t. d) if f(x) f(y) Ld(x, y), x, y X. Notation: Lip L (X, d) the class of all L-Lipschitz functions. Question: What conditions does a probability measure P on X have to satisfy, so that f(x), X P, is σ 2 -subgaussian for every f Lip 1 (X, d)?
44 Some Definitions A coupling of two probability measures P and Q on X is any probability measure π on X X, such that (X, Y ) π = X P, Y Q. Π(P, Q): family of all couplings of P and Q. Definition. For p 1, the L p Wasserstein distance between P and Q is given by W p (P, Q) inf (E π[d p (X, Y )]) 1/p. π Π(P,Q) Kantorovich Rubinstein formula. For any two P, Q, W 1 (P, Q) = sup E P [f] E Q [f]. f Lip 1 (X,d)
45 Optimal Transportation Monge-Kantorovich Optimal Transportation Problem. Given two probability measures P, Q on a common space X and a cost function c : X X R +, minimize E π [c(x, Y )] over all couplings π Π(P, Q). Interpretation: P and Q: initial and final distributions of some material (say, sand) in space c(x, y): cost of transporting a grain of sand from location x to location y π(dy x): randomized strategy for transporting from location x Wasserstein distances: transportation cost is some power of a metric Gaspard Monge Leonid Kantorovich
46 Transportation Cost Inequalities Definition. A probability measure P on a metric space (X, d) satisfies an L p transportation cost inequality with constant c, or T p (c) for short, if W p (P, Q) 2cD(Q P ), Q. Preview: We will be primarily concerned with T 1 (c) and T 2 (c) inequalities. T 1 (c) is easier to work with, while T 2 (c) has strong properties (dimension-free concentration).
47 Examples of TC Inequalities: 1 X: arbitrary space with trivial metric d(x, y) = 1{x y} Then W 1 (P, Q) = inf E π[1{x Y }] π Π(P,Q) inf P π[x Y ] π Π(P,Q) = P Q TV total variation distance (proof by explicit construction of optimal coupling) Any P satisfies T 1 (1/4): P Q TV 1 2 D(Q P ) (Csiszár Kemperman Kullback Pinsker)
48 Examples of TC Inequalities: 2 X = {0, 1} with trivial metric d(x, y) = 1{x y} P = Bern(p) Distribution-dependent refinement of Pinsker: P Q TV 2c(p)D(Q P ), where c(p) = p p 2(log p log p), p 1 p (Ordentlich Weinberger, 2005) Thus, P = Bern(p) satisfies T 1 (c(p)), and the constant is optimal: D(Q P ) inf Q Q P 2 = 1 TV 2c(p).
49 Examples of TC Inequalities: 3 X = R n with d(x, y) = x y 2 The Gaussian measure P = N(0, I n ) satisfies T 2 (1): W 2 (P, Q) 2D(Q P ) (Talagrand, 1996) Note: the constant is independent of n!! Michel Talagrand
50 Transportation and Concentration Theorem (Bobkov Götze, 1999). Let P be a probability measure on a metric space (X, d). Then the following two statements are equivalent: 1. f(x), X P, is σ 2 -subgaussian for every f Lip 1 (X, d). 2. P satisfies T 1 (c) on (X, d): W 1 (P, Q) 2σ 2 D(Q P ), Q P. Sergey Bobkov Friedrich Götze Remarks: Connection between concentration and transportation inequalities was first pointed out by Marton (1986). Remarkable result: concentration phenomenon for Lipschitz functions can be expressed purely in terms of probabilistic and metric structures.
51 Proof of Bobkov Götze (Ultra-light version, due to R. van Handel) Statement 1 of the theorem is equivalent to ( ) sup sup {log E P [e ] λ(f E P f) λ2 σ 2 } 0. 2 λ 0 f Lip 1 (X,d) Use Gibbs variational principle: log E P [e h ] = sup {E Q [h] D(Q P )}, h s.t. e h L 1 (P ) Q (supremum achieved by the tilted distribution Q = P h ). Then ( ) is equivalent to ( ) sup sup sup {λ (E Q [f] E P [f]) D(Q P ) λ2 σ 2 } 0 Q 2 λ 0 f Lip 1 (X,d)
52 Proof of Bobkov Götze (cont.) ( ) sup sup λ 0 f Lip 1 (X,d) sup Q {λ (E Q [f] E P [f]) D(Q P ) λ2 σ 2 } 0 2 Interchange the order of suprema: sup sup λ 0 f Lip 1 (X,d) sup Q [...] = sup Q sup sup λ 0 f Lip 1 (X,d) Then ( ) sup sup {λw 1 (P, Q) D(Q P ) λ2 σ 2 } 0 Q λ 0 2 (by Kantorovich Rubinstein) { } W1 (P, Q) sup Q 2σ 2 D(Q P ) 0 (optimize over λ) [...]
53 Tensorization of Transportation Inequalities At first sight, all we have is another equivalent characterization of concentration of Lipschitz functions. However, transportation inequalities tensorize!! Proof of tensorization is through a beautiful result on couplings by Katalin Marton.
54 The Marton Coupling Theorem (Marton, 1986). Let (X i, P i ), 1 i n, be probability spaces. Let w i : X i X i R +, 1 i n, be positive weight functions, and let ϕ : R + R + be a convex function. Suppose that, for each i, inf ϕ (E π[w i (X, Y )]) 2σ 2 D(Q P i ), Q. π Π(P i,q) Then the following holds for the product measure P = P 1... P n on the product space X = X 1... X n : inf π Π(P,Q) i=1 n ϕ (E π [w i (X i, Y i )]) 2σ 2 D(Q P ) Katalin Marton for every Q on X. Proof idea: Chain rule for relative entropy + conditional coupling + induction.
55 Tensorization of Transportation Cost Inequalities Theorem. Let (X i, P i, d i ), 1 i n, be probability metric spaces. If for some 1 p 2 each P i satisfies T p (c) on (X i, d i ), then the product measure P = P 1... P n on X = X 1... X n satisfies T p (cn 2/p 1 ) w.r.t. the metric ( n 1/p d p (x n, y n ) d p i (x i, y i )). i=1 Remarks: If each P i satisfies T 1 (c), then P = P 1... P n satisfies T 1 (cn) with respect to the metric i d i. Note: constant deteriorates with n. If each P i satisfies T 2 (c), then P satisfies T 2 (c) with respect to. Note: constant is independent of n. i d2 i
56 Proof of Tensorization By hypothesis, for each i, inf (E π[d p i (X, Y )])2/p 2cD(Q P i ), π Π(P i,q) }{{} Wp,d 2 (P i i,q) Q. 1 p 2 = ϕ(u) = u 2/p is convex. Take w i = d p i. Then inf ϕ (E π[w i (X, Y )]) 2cD(Q P i ), π Π(P i,q) Q. By Marton s coupling, ( ) inf π Π(P,Q) i=1 n (E π [d p i (X i, Y i )]) 2/p 2cD(Q P ), Q.
57 Proof of Tensorization (cont.) We have shown that if P i satisfies T p (c) w.r.t. d i, for each i, then ( ) inf π Π(P,Q) i=1 n (E π [d p i (X i, Y i )]) 2/p 2cD(Q P ), Q. We will now prove LHS( ) n 1 2/p Wp,d 2 p (P, Q). For any π Π(P, Q), ( [ n E π d p i (X i, Y i ) i=1 ]) 2/p ( n 2/p E π [d p i (X i, Y i )]) (concavity of t t 1/p ) i=1 n 2/p 1 n (E π [d p i (X i, Y i )]) 2/p (convexity of t t 2/p ) i=1 Take infimum over all π Π(P, Q), and we are done.
58 (Yet Another) Proof of McDiarmid s Inequality Product probability space: (X 1... X n, P 1... P n ) Given c 1,..., c n 0, equip X i with d ci (x i, y i ) c i 1{x i y i }. Then any P i on X i satisfies T 1 (c 2 i /4): W 1,dci (P i, Q) c i P i Q TV c 2 i 2 D(Q P i) (by rescaling Pinsker). By Marton coupling, P = P 1... P n satisfies T 1 with constant (1/4) n i=1 c2 i with respect to the metric n n d c (x n, y n ) d ci (x i, y i ) = c i 1{x i y i }. i=1 By Bobkov Götze, this is equivalent to subgaussian property of all f Lip 1 (X 1... X n, d c ): ) P[ f(x n ) Ef(X n ) t] 2 exp ( 2t2 n, f Lip 1 (d c ) i=1 c2 i i=1 } {{ } McDiarmid } {{ } bdd. diff.
59 Some Applications
60 The Blowing-Up Lemma Consider a product space Y n equipped with Hamming metric d(y n, z n ) = n i=1 1{y i z i } For a set A Y n and for r {0, 1,..., n}, define its r-blowup { } [A] r z n Y n : min y n A d(zn, y n ) r The following result, in a different (asymptotic) form was first proved by Ahlswede Gács Körner (1976); a simple proof was given by Marton (1986): Let Y 1,..., Y n be independent r.v. s taking values in Y. Then for every set A Y n with P Y n(a) > 0 ( ) 2 P Y n {[A] r } 1 exp 2 n r n 2 log 1 P Y n(a) + Informally, any set in a product space can be blown up to engulf most of the probability mass.
61 Marton s Proof Let P i = L(Y i ), 1 i n; P = P 1... P n L(Y n ). By tensorization, P satisfies the TC inequality n ( ) W 1 (P, Q) 2 D(Q P ), Q P(Yn ), where W 1 (P, Q) = inf π Π(P,Q) E π [ n ] 1{Y i Z i }, Y n P, Z n Q i=1 For any B Y n with P (B) > 0, consider conditional distribution P B ( ) P ( B). P (B) 1 Then D(P B P ) = log P (B), and therefore n ( ) W 1 (P, P B ) 2 log 1 P (B).
62 Marton s Proof ( ) W 1 (P, P B ) n 2 log 1 P (B) Apply ( ) to B = A and B = [A] c r: W 1 (P, P A ) n 2 log 1 P (A), W 1(P, P [A] c r ) n 2 log 1 1 P ([A] r ) Then n 2 log 1 P (A) + n 2 log 1 1 P ([A] r ) W 1(P A, P ) + W 1 (P, P [A] c r ) W 1 (P A, P [A] c r ) min d(y n, z n ) y n A,z n [A] c r r. (def. of [ ] r) (triangle ineq.) Rearrange to finish the proof.
63 The Blowing-Up Lemma: Consequences Consider a DMC (X, Y, T ) with input alphabet X, output alphabet Y, transition probabilities T (y x), (x, y) X Y (n, M, ε)-code C: encoder f : {1,..., M} X n, decoder g : Y n {1,..., M} with max P[g(Y n ) j X n = f(j)] ε. 1 j M Equivalently, C = {(u j, D j )} M j=1, where uj = f(j) X n codewords Dj = g 1 (j) = {y n Y n : g(y n ) = j} decoding sets T n (D j u j ) 1 ε, j = 1,..., M. Lemma. There exists δ n > 0, such that T n( ) X [D j ] n nδn = u j 1 1, j = 1,..., M. n
64 Proof Choose δ n = 1 n n ( ) log n 1 2n + 2n log 1 1 ε For each j, apply Blowing-Up Lemma to the product measure n P j (y n ) = T (y i u j (i)), i=1 where u j = (u j (1),..., u j (n)) = f(j). }{{} jth codeword With r = nδ n, this gives T n( ) X [D j ] n nδn = u j 1 exp 2 n ( nδ n ) 2 n 2 log 1 1 ε n.
65 From Blowing-Up Lemma to Strong Converses Informally, the Blowing-Up Lemma shows that any bad code contains a good subcode (Ahlswede and Dueck, 1976). Consider an (n, M, ε)-code C = {(u j, D j )} M j=1. Each decoding set D j can be blown up to a set D j Y n with T n ( D j u j ) 1 1 n. The object C = {(u j, D j )} M j=1 is not a code (since the sets D j are no longer disjoint), but a random coding argument can be used to extract an (n, M, ε ) subcode with M slightly smaller than M and ε < ε. Then one can apply the usual (weak) converse to the subcode. Similar ideas can be used in multiterminal settings (starting with Ahlswede Gács Körner).
66 Example: Capacity-Achieving Channel Codes The set-up DMC (X, Y, T ) with capacity C = C(T ) = max P X I(X; Y ) (n, M)-code: C = (f, g) with encoder f : {1,..., M} X n and decoder g : Y n {1,..., M} Capacity-achieving codes: A sequence {C n } n=1, where each C n is an (n, M n )-code, is capacity-achieving if lim n 1 n log M n = C.
67 Capacity-Achieving Channel Codes Capacity-achieving input and output distributions: P X arg max P X I(X; Y ) (may not be unique) P X T P Y (always unique) Theorem (Shamai Verdú, 1997). Let {C n } be any capacity-achieving code sequence with vanishing error probability. Then 1 ( ) lim n n D P (Cn) P Y n Y n = 0, where P (Cn) Y is the output distribution induced by the code n C n when the messages in {1,..., M n } are equiprobable.
68 Capacity-Achieving Channel Codes lim n 1 n D(P Y n P Y n) = 0 Main message: channel output sequences induced by good code resemble i.i.d. sequences drawn from the CAOD P Y Useful implications: estimate performance characteristics of good channel codes by their expectations w.r.t. P Y n = (P Y )n often much easier to compute explicitly bound estimation accuracy using large-deviation theory (e.g., Sanov s theorem) Question: what about good codes with nonvanishing error probability?
69 Codes with Nonvanishing Error Probability Y. Polyanskiy and S. Verdú, Empirical distribution of good channel codes with non-vanishing error probability (2012) 1. Let C = (f, g) be any (n, M, ε)-code for T : max P [ g(y n ) j X n = f(j) ] ε. 1 j M Then D ( P (C) ) P Y n Y n nc log M + o(n). 2. If {C n } n=1 is a capacity-achieving sequence, where each C n is an (n, M n, ε)-code for some fixed ε > 0, then 1 ( lim n n D P (Cn) Y n ) P Y n = 0. In some cases, the o(n) term can be improved to O( n).
70 Relative Entropy at the Output of a Code Consider a DMC T : X Y with T ( ) > 0, and let c(t ) = 2 max max T (y x) x X y,y Y log T (y x) Theorem (Raginsky Sason, 2013). Any (n, M, ε)-code C for T, where ε (0, 1/2), satisfies ( ) D P (C) P Y n Y n nc log M + log 1 n ε + c(t ) 2 log 1 1 2ε Remark: Polyanskiy and Verdú show that ( ) D P (C) P Y n Y n nc log M + a n for some constant a = a(ε).
71 Proof Sketch (1) Fix x n X n and study concentration of the function h x n(y n ) = log dp Y n X n =x n dp (C) around its expectation w.r.t. P Y n X n =x n: Y n (y n ) ( ) E[h x n(y n ) X n = x n P (C) ] = D P Y n X n =x n Y n Step 1: Because T ( ) > 0, the function h x n(y n ) is 1-Lipschitz w.r.t. scaled Hamming metric d(y n, ȳ n ) = c(t ) n 1{y i ȳ i } i=1
72 Proof Sketch (2) Step 1: Because T ( ) > 0, the function h x n(y n ) is 1-Lipschitz w.r.t. scaled Hamming metric d(y n, ȳ n ) = c(t ) n 1{y i ȳ i } Step 2: Any product probability measure µ on (Y n, d) satisfies i=1 log E µ [ e tf(y n ) ] nc(t )2 t 2 for any f with E µ f = 0 and F Lip 1. Proof: Tensorization of T 1 (Pinsker), followed by appeal to Bobkov Götze. 8
73 Proof Sketch (3) h x n(y n ) = log dp Y n X n =x n (y n ) dp (C) Y n ( E[h x n(y n ) X n = x n P (C) ] = D P Y n X n =x n Step 3: For any x n, µ = P Y n X n =x n Y n ) is a product measure, so ( ( ) ) ) P h x n(y n P (C) ) D P Y n X n =x n Y + r exp ( 2r2 n nc(t ) 2 Use this with r = c(t ) n 2 log 1 1 2ε : ( ( P h x n(y n P (C) ) D P Y n X n =x n ) n Y + c(t ) n 2 log 1 1 2ε ) 1 2ε Remark: Polyanskiy Verdú show Var[h x n(y n ) X n = x n ] = O(n).
74 Proof Sketch (4) Recall: ( ( P h x n(y n P (C) ) D P Y n X n =x n ) n Y + c(t ) n 2 log 1 1 2ε ) 1 2ε Step 4: Same as Polyanskiy Verdú, appeal to Augustin s strong converse (1966) to get log M log 1 ( ) P ε + D (C) P (C) n P Y n X n Y n X + c(t ) n 2 log 1 1 2ε ( ) D P (C) P Y n Y n ( ) ( P = D P Y n X n Y n P (C) P (C) P (C) X D P n Y n X n Y n nc log M + log 1 ε + c(t ) n 2 log 1 1 2ε X n )
75 Relative Entropy at the Output of a Code Theorem (Raginsky Sason, 2013). Let (X, Y, T ) be a DMC with C > 0. Then, for any 0 < ε < 1, any (n, M, ε)-code C for T satisfies D ( P (C) P Y n Y n) nc log M + 2n (log n) 3/2 ( log n + log ( 2 X Y 2). 1 log n log ( ) ) ( ε ) log Y log n Proof idea: Apply Blowing-Up Lemma to the code, then extract a good subcode. Remark: Polyanskiy and Verdú show that for some constant b > 0. D ( P (C) Y n P Y n) nc log M + b n log 3/2 n
76 Concentration of Lipschitz Functions Theorem (Raginsky Sason, 2013). Let (X, Y, T ) be a DMC with c(t ) <. Let d : Y n Y n R + be a metric, and suppose that P Y n X n =x n, xn X n, as well as P Y n, satisfy T 1(c) for some c > 0. Then, for any ε (0, 1/2), any (n, M, ε)-code C for T, and any function f : Y n R we have ( ) P (C) Y f(y n ) E[f(Y n )] r n ) (nc 4 ε exp ln M + a n 8c f 2 Lip where Y n P Y n, and a c(t ) 1 2 ln 1 1 2ε. r 2, r 0
77 Proof Sketch Step 1: For each x n X n, let φ(x n ) E[f(Y n ) X n = x n ]. Then, by Bobkov Götze, ( ) ( P f(y n ) φ(x n ) r X n = x n) 2 exp r2 2c f 2 Lip Step 2: By restricting to a subcode C with codewords x n X n satisfying φ(x n ) E[f(Y n )] + r, we can show that ( r f Lip 2c nc log M + a n + log 1 ), ε ) with M = MP (C) X (φ(x n ) E[f(Y n )] + r. Solve to get n ( ) P (C) X φ(x n ) E[f(Y n )] r 2e nc log M+a n+log 1 ε r 2 n 2c f 2 Lip Step 3: Apply union bound.
78 Empirical Averages at the Code Output Equip Y n with the Hamming metric n d(y n, ȳ n ) = 1{y i ȳ i } Consider functions of the form f(y n ) = 1 n i=1 n f i (y i ), i=1 where f i (y i ) f i (ȳ i ) L1{y i ȳ i } for all i, y i, ȳ i. Then f Lip L/n. Since P Y n X n =x n for all xn and PY n are product measures on Y n, they all satisfy T 1 (n/4) (by tensorization) Therefore, for any (n, M, ε)-code and any such f we have P (C) ( Y f(y n ) E[f(Y n )] r ) n 4 ( ε exp nc log M + a ) n nr2 2L 2
79 Concentration of Measure Information-Theoretic Converse Concentration phenomenon in a nutshell: if a subset of a metric probability space does not have too small of a probability mass, then its blowups will eventually take up most of the probability mass. Question: given a set whose blowups eventually take up most of the probability mass, how small can this set be? This question was answered by Kontoyiannis (1999) as a consequence of a general information-theoretic converse.
80 Converse Concentration of Measure: The Set-Up Let X be a finite set, together with a distortion function d : X X R + and a mass function M : X (0, ). Extend to product space X n : d n (x n, y n ) M n (x n ) n d(x i, y i ) i=1 n M(x i ) i=1 M n (A) x n A M n (x n ), A X n Blowups: A X n [A] r { } x n X n : min d n(x n, y n ) r y n A
81 Converse Concentration of Measure Let P be a probability measure on X. Define { R n (δ) min I(X n ; Y n ) + E log M n (Y n ) : P X n Y n } P X n = P n, E[d n (X n, Y n )] nδ Theorem (Kontoyiannis). Let A n X n be an arbitrary set. Then where δ 1 [ ] n E min d n (X n, y n ) y n A n Remark: 1 n log M n(a n ) R(δ), and It can be shown that R 1 (δ) = inf n 1 R n(δ) n. R n (δ) R(δ) lim R 1 (δ). n n
82 Proof Define the mapping ϕ n : X n X n via ϕ n (x n ) arg min y n A n d n (x n, y n ) and let Y n = ϕ n (X n ), Q n = L(Y n ). Then log M n (A n ) = log log y n A n M n (y n ) y n A n: Q n>0 Q n (y n ) M n(y n ) Q n (y n ) Q n (y n ) log Mn(yn ) Q n (y n ) y n A n = Q n (y n ) log Q n (y n ) + y n A n = H(Y n ) + E log M(Y n ) = I(X n ; Y n ) + E log M(Y n ) R n (δ). y n A n Q(y n ) log M n (y n )
83 Converse Concentration of Measure Consider a sequence of sets {A n } n=1 with ( ) P n ([A n ] nδ ) n 1. Apply Kontoyiannis converse to the mass function M = P, to get the following: Corollary. If the sequence {A n } satisfies ( ), then lim inf n 1 n log P n (A n ) R(δ), where the concentration exponent is { } R(δ) = min I(X; Y ) + E log P (Y ) : P X = P, E[d(X, Y )] δ P XY { } max H(Y X) + D(P Y P ) : P X = P, E[d(X, Y )] δ. P XY
84 Example of The Concentration Exponent Theorem (Raginsky Sason, 2013). Let P = Bern(p). Then { ( ) ϕ(p)δ 2 (1 p)h, if δ [0, 1 p] where R(δ) δ 1 p = log p, if δ [1 p, 1] ϕ(p) = 1 1 2p log 1 p p and h( ) is the binary entropy function. Remarks: The upper bound is not tight, but in this case R(δ) can be evaluated numerically (cf. Kontoyiannis, 2001). The proof is by a coupling argument.
85 Summary Three related methods for obtaining sharp concentration inequalities in high dimension: 1. The entropy method 2. Log-Sobolev inequalities 3. Transportation-cost inequalities All three methods crucially rely on tensorization: Breaking the original high-dimensional problem into low-dimensional pieces, exploiting low-dimensional structure to control entropy locally, assembling local information into a global bound. Tensorization is a consequence of independence. Applications to information theory: Exploit the problem structure to isolate independence (e.g., output distribution of a DMC for any fixed input block).
86 What We Had to Skip Log-Sobolev inequalities and hypercontractivity Log-Sobolev inequalities when Herbst fails (e.g., Poisson measures) Connections to isoperimetric inequalities HWI inequalities: tying together relative entropy, Wasserstein distance, Fisher information Concentration inequalities for functions of dependent random variables For this and more, consult our monograph: M. Raginsky and I. Sason, Concentration of Measure Inequalities in Info. Theory, Comm. and Coding, FnT, 2nd edition, 2014.
87 Recent Books and Surveys - Concentration Inequalities 1. N. Alon and J. H. Spencer, The Probabilistic Method, Wiley, 3rd edition, S. Boucheron, G. Lugosi, and P. Massart, Concentration Inequalities: A Nonasymptotic Theory of Independence, Oxford Press, F. Chung and L. Lu, Complex Graphs and Networks, vol. 106, Regional Conference Series in Mathematics, Wiley, D. P. Dubhashi and A. Panconesi, Concentration of Measure for the Analysis of Randomized Algorithms, Cambridge Press, M. Ledoux, The Concentration of Measure Phenomenon, Mathematical Surveys and Monographs, vol. 89, AMS, C. McDiarmid, Concentration, Probabilistic Methods for Algorithmic Discrete Mathematics, pp , Springer, M. Raginsky and I. Sason, Concentration of Measure Inequalities in Info. Theory, Comm. and Coding, FnT, 2nd edition, R. van Handel, Probability in High Dimension, lecture notes, 2014.
Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities
Refined Bounds on the Empirical Distribution of Good Channel Codes via Concentration Inequalities Maxim Raginsky and Igal Sason ISIT 2013, Istanbul, Turkey Capacity-Achieving Channel Codes The set-up DMC
More informationConcentration inequalities and the entropy method
Concentration inequalities and the entropy method Gábor Lugosi ICREA and Pompeu Fabra University Barcelona what is concentration? We are interested in bounding random fluctuations of functions of many
More informationDistance-Divergence Inequalities
Distance-Divergence Inequalities Katalin Marton Alfréd Rényi Institute of Mathematics of the Hungarian Academy of Sciences Motivation To find a simple proof of the Blowing-up Lemma, proved by Ahlswede,
More informationOXPORD UNIVERSITY PRESS
Concentration Inequalities A Nonasymptotic Theory of Independence STEPHANE BOUCHERON GABOR LUGOSI PASCAL MASS ART OXPORD UNIVERSITY PRESS CONTENTS 1 Introduction 1 1.1 Sums of Independent Random Variables
More informationConcentration, self-bounding functions
Concentration, self-bounding functions S. Boucheron 1 and G. Lugosi 2 and P. Massart 3 1 Laboratoire de Probabilités et Modèles Aléatoires Université Paris-Diderot 2 Economics University Pompeu Fabra 3
More informationStein s method, logarithmic Sobolev and transport inequalities
Stein s method, logarithmic Sobolev and transport inequalities M. Ledoux University of Toulouse, France and Institut Universitaire de France Stein s method, logarithmic Sobolev and transport inequalities
More informationConcentration inequalities: basics and some new challenges
Concentration inequalities: basics and some new challenges M. Ledoux University of Toulouse, France & Institut Universitaire de France Measure concentration geometric functional analysis, probability theory,
More informationTwo Applications of the Gaussian Poincaré Inequality in the Shannon Theory
Two Applications of the Gaussian Poincaré Inequality in the Shannon Theory Vincent Y. F. Tan (Joint work with Silas L. Fong) National University of Singapore (NUS) 2016 International Zurich Seminar on
More informationEntropy and Ergodic Theory Lecture 15: A first look at concentration
Entropy and Ergodic Theory Lecture 15: A first look at concentration 1 Introduction to concentration Let X 1, X 2,... be i.i.d. R-valued RVs with common distribution µ, and suppose for simplicity that
More informationEE5139R: Problem Set 7 Assigned: 30/09/15, Due: 07/10/15
EE5139R: Problem Set 7 Assigned: 30/09/15, Due: 07/10/15 1. Cascade of Binary Symmetric Channels The conditional probability distribution py x for each of the BSCs may be expressed by the transition probability
More informationLogarithmic Sobolev inequalities in discrete product spaces: proof by a transportation cost distance
Logarithmic Sobolev inequalities in discrete product spaces: proof by a transportation cost distance Katalin Marton Alfréd Rényi Institute of Mathematics of the Hungarian Academy of Sciences Relative entropy
More informationarxiv: v4 [cs.it] 17 Oct 2015
Upper Bounds on the Relative Entropy and Rényi Divergence as a Function of Total Variation Distance for Finite Alphabets Igal Sason Department of Electrical Engineering Technion Israel Institute of Technology
More informationECE598: Information-theoretic methods in high-dimensional statistics Spring 2016
ECE598: Information-theoretic methods in high-dimensional statistics Spring 06 Lecture : Mutual Information Method Lecturer: Yihong Wu Scribe: Jaeho Lee, Mar, 06 Ed. Mar 9 Quick review: Assouad s lemma
More informationContents 1. Introduction 1 2. Main results 3 3. Proof of the main inequalities 7 4. Application to random dynamical systems 11 References 16
WEIGHTED CSISZÁR-KULLBACK-PINSKER INEQUALITIES AND APPLICATIONS TO TRANSPORTATION INEQUALITIES FRANÇOIS BOLLEY AND CÉDRIC VILLANI Abstract. We strengthen the usual Csiszár-Kullback-Pinsker inequality by
More informationOn the Concentration of the Crest Factor for OFDM Signals
On the Concentration of the Crest Factor for OFDM Signals Igal Sason Department of Electrical Engineering Technion - Israel Institute of Technology, Haifa 3, Israel E-mail: sason@eetechnionacil Abstract
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini May 30, 2018 1 / 59 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d
More informationStein s Method: Distributional Approximation and Concentration of Measure
Stein s Method: Distributional Approximation and Concentration of Measure Larry Goldstein University of Southern California 36 th Midwest Probability Colloquium, 2014 Concentration of Measure Distributional
More informationPhenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012
Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 202 BOUNDS AND ASYMPTOTICS FOR FISHER INFORMATION IN THE CENTRAL LIMIT THEOREM
More informationConcentration inequalities for non-lipschitz functions
Concentration inequalities for non-lipschitz functions University of Warsaw Berkeley, October 1, 2013 joint work with Radosław Adamczak (University of Warsaw) Gaussian concentration (Sudakov-Tsirelson,
More informationConcentration of Measure with Applications in Information Theory, Communications and Coding
Concentration of Measure with Applications in Information Theory, Communications and Coding Maxim Raginsky UIUC Urbana, IL 61801, USA maxim@illinois.edu Igal Sason Technion Haifa 32000, Israel sason@ee.technion.ac.il
More informationAn upper bound on l q norms of noisy functions
Electronic Collouium on Computational Complexity, Report No. 68 08 An upper bound on l norms of noisy functions Alex Samorodnitsky Abstract Let T ɛ, 0 ɛ /, be the noise operator acting on functions on
More informationLecture 1 Measure concentration
CSE 29: Learning Theory Fall 2006 Lecture Measure concentration Lecturer: Sanjoy Dasgupta Scribe: Nakul Verma, Aaron Arvey, and Paul Ruvolo. Concentration of measure: examples We start with some examples
More informationLecture 35: December The fundamental statistical distances
36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose
More informationHypercontractivity of spherical averages in Hamming space
Hypercontractivity of spherical averages in Hamming space Yury Polyanskiy Department of EECS MIT yp@mit.edu Allerton Conference, Oct 2, 2014 Yury Polyanskiy Hypercontractivity in Hamming space 1 Hypercontractivity:
More informationChannel Dispersion and Moderate Deviations Limits for Memoryless Channels
Channel Dispersion and Moderate Deviations Limits for Memoryless Channels Yury Polyanskiy and Sergio Verdú Abstract Recently, Altug and Wagner ] posed a question regarding the optimal behavior of the probability
More informationAN INEQUALITY FOR TAIL PROBABILITIES OF MARTINGALES WITH BOUNDED DIFFERENCES
Lithuanian Mathematical Journal, Vol. 4, No. 3, 00 AN INEQUALITY FOR TAIL PROBABILITIES OF MARTINGALES WITH BOUNDED DIFFERENCES V. Bentkus Vilnius Institute of Mathematics and Informatics, Akademijos 4,
More informationA note on the convex infimum convolution inequality
A note on the convex infimum convolution inequality Naomi Feldheim, Arnaud Marsiglietti, Piotr Nayar, Jing Wang Abstract We characterize the symmetric measures which satisfy the one dimensional convex
More informationTheory of Probability Fall 2008
MIT OpenCourseWare http://ocw.mit.edu 8.75 Theory of Probability Fall 008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. Section Prekopa-Leindler inequality,
More informationStability results for Logarithmic Sobolev inequality
Stability results for Logarithmic Sobolev inequality Daesung Kim (joint work with Emanuel Indrei) Department of Mathematics Purdue University September 20, 2017 Daesung Kim (Purdue) Stability for LSI Probability
More informationArimoto Channel Coding Converse and Rényi Divergence
Arimoto Channel Coding Converse and Rényi Divergence Yury Polyanskiy and Sergio Verdú Abstract Arimoto proved a non-asymptotic upper bound on the probability of successful decoding achievable by any code
More informationOn the Entropy of Sums of Bernoulli Random Variables via the Chen-Stein Method
On the Entropy of Sums of Bernoulli Random Variables via the Chen-Stein Method Igal Sason Department of Electrical Engineering Technion - Israel Institute of Technology Haifa 32000, Israel ETH, Zurich,
More informationConcentration Inequalities
Chapter Concentration Inequalities I. Moment generating functions, the Chernoff method, and sub-gaussian and sub-exponential random variables a. Goal for this section: given a random variable X, how does
More informationTight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding
APPEARS IN THE IEEE TRANSACTIONS ON INFORMATION THEORY, FEBRUARY 015 1 Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding Igal Sason Abstract Tight bounds for
More informationEECS 750. Hypothesis Testing with Communication Constraints
EECS 750 Hypothesis Testing with Communication Constraints Name: Dinesh Krithivasan Abstract In this report, we study a modification of the classical statistical problem of bivariate hypothesis testing.
More informationEntropy and Ergodic Theory Lecture 17: Transportation and concentration
Entropy and Ergodic Theory Lecture 17: Transportation and concentration 1 Concentration in terms of metric spaces In this course, a metric probability or m.p. space is a triple px, d, µq in which px, dq
More informationDispersion of the Gilbert-Elliott Channel
Dispersion of the Gilbert-Elliott Channel Yury Polyanskiy Email: ypolyans@princeton.edu H. Vincent Poor Email: poor@princeton.edu Sergio Verdú Email: verdu@princeton.edu Abstract Channel dispersion plays
More informationLecture 4 Noisy Channel Coding
Lecture 4 Noisy Channel Coding I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw October 9, 2015 1 / 56 I-Hsiang Wang IT Lecture 4 The Channel Coding Problem
More informationHeat Flows, Geometric and Functional Inequalities
Heat Flows, Geometric and Functional Inequalities M. Ledoux Institut de Mathématiques de Toulouse, France heat flow and semigroup interpolations Duhamel formula (19th century) pde, probability, dynamics
More informationCorrelation Detection and an Operational Interpretation of the Rényi Mutual Information
Correlation Detection and an Operational Interpretation of the Rényi Mutual Information Masahito Hayashi 1, Marco Tomamichel 2 1 Graduate School of Mathematics, Nagoya University, and Centre for Quantum
More informationLecture 5 Channel Coding over Continuous Channels
Lecture 5 Channel Coding over Continuous Channels I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw November 14, 2014 1 / 34 I-Hsiang Wang NIT Lecture 5 From
More informationOn Improved Bounds for Probability Metrics and f- Divergences
IRWIN AND JOAN JACOBS CENTER FOR COMMUNICATION AND INFORMATION TECHNOLOGIES On Improved Bounds for Probability Metrics and f- Divergences Igal Sason CCIT Report #855 March 014 Electronics Computers Communications
More informationStatistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of
More informationChapter 4. Data Transmission and Channel Capacity. Po-Ning Chen, Professor. Department of Communications Engineering. National Chiao Tung University
Chapter 4 Data Transmission and Channel Capacity Po-Ning Chen, Professor Department of Communications Engineering National Chiao Tung University Hsin Chu, Taiwan 30050, R.O.C. Principle of Data Transmission
More informationLecture 6 I. CHANNEL CODING. X n (m) P Y X
6- Introduction to Information Theory Lecture 6 Lecturer: Haim Permuter Scribe: Yoav Eisenberg and Yakov Miron I. CHANNEL CODING We consider the following channel coding problem: m = {,2,..,2 nr} Encoder
More informationMetric Spaces and Topology
Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies
More informationLogarithmic Sobolev Inequalities
Logarithmic Sobolev Inequalities M. Ledoux Institut de Mathématiques de Toulouse, France logarithmic Sobolev inequalities what they are, some history analytic, geometric, optimal transportation proofs
More informationInformation theoretic perspectives on learning algorithms
Information theoretic perspectives on learning algorithms Varun Jog University of Wisconsin - Madison Departments of ECE and Mathematics Shannon Channel Hangout! May 8, 2018 Jointly with Adrian Tovar-Lopez
More informationSoft Covering with High Probability
Soft Covering with High Probability Paul Cuff Princeton University arxiv:605.06396v [cs.it] 20 May 206 Abstract Wyner s soft-covering lemma is the central analysis step for achievability proofs of information
More informationEntropy, Compound Poisson Approximation, Log-Sobolev Inequalities and Measure Concentration
Entropy, Compound Poisson Approximation, Log-Sobolev Inequalities and Measure Concentration Ioannis Kontoyiannis 1 and Mokshay Madiman Division of Applied Mathematics, Brown University 182 George St.,
More informationNotes 3: Stochastic channels and noisy coding theorem bound. 1 Model of information communication and noisy channel
Introduction to Coding Theory CMU: Spring 2010 Notes 3: Stochastic channels and noisy coding theorem bound January 2010 Lecturer: Venkatesan Guruswami Scribe: Venkatesan Guruswami We now turn to the basic
More informationOn the Concentration of the Crest Factor for OFDM Signals
On the Concentration of the Crest Factor for OFDM Signals Igal Sason Department of Electrical Engineering Technion - Israel Institute of Technology Haifa 32000, Israel The 2011 IEEE International Symposium
More informationWeak convergence and large deviation theory
First Prev Next Go To Go Back Full Screen Close Quit 1 Weak convergence and large deviation theory Large deviation principle Convergence in distribution The Bryc-Varadhan theorem Tightness and Prohorov
More informationLecture 18: March 15
CS71 Randomness & Computation Spring 018 Instructor: Alistair Sinclair Lecture 18: March 15 Disclaimer: These notes have not been subjected to the usual scrutiny accorded to formal publications. They may
More informationLECTURE 10. Last time: Lecture outline
LECTURE 10 Joint AEP Coding Theorem Last time: Error Exponents Lecture outline Strong Coding Theorem Reading: Gallager, Chapter 5. Review Joint AEP A ( ɛ n) (X) A ( ɛ n) (Y ) vs. A ( ɛ n) (X, Y ) 2 nh(x)
More informationLECTURE 3. Last time:
LECTURE 3 Last time: Mutual Information. Convexity and concavity Jensen s inequality Information Inequality Data processing theorem Fano s Inequality Lecture outline Stochastic processes, Entropy rate
More informationSHARED INFORMATION. Prakash Narayan with. Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe
SHARED INFORMATION Prakash Narayan with Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe 2/40 Acknowledgement Praneeth Boda Himanshu Tyagi Shun Watanabe 3/40 Outline Two-terminal model: Mutual
More informationX n D X lim n F n (x) = F (x) for all x C F. lim n F n(u) = F (u) for all u C F. (2)
14:17 11/16/2 TOPIC. Convergence in distribution and related notions. This section studies the notion of the so-called convergence in distribution of real random variables. This is the kind of convergence
More informationMarch 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.
Florida State University March 1, 2018 Framework 1. (Lizhe) Basic inequalities Chernoff bounding Review for STA 6448 2. (Lizhe) Discrete-time martingales inequalities via martingale approach 3. (Boning)
More informationUniform concentration inequalities, martingales, Rademacher complexity and symmetrization
Uniform concentration inequalities, martingales, Rademacher complexity and symmetrization John Duchi Outline I Motivation 1 Uniform laws of large numbers 2 Loss minimization and data dependence II Uniform
More informationFunctional Properties of MMSE
Functional Properties of MMSE Yihong Wu epartment of Electrical Engineering Princeton University Princeton, NJ 08544, USA Email: yihongwu@princeton.edu Sergio Verdú epartment of Electrical Engineering
More informationNEW FUNCTIONAL INEQUALITIES
1 / 29 NEW FUNCTIONAL INEQUALITIES VIA STEIN S METHOD Giovanni Peccati (Luxembourg University) IMA, Minneapolis: April 28, 2015 2 / 29 INTRODUCTION Based on two joint works: (1) Nourdin, Peccati and Swan
More informationSuperposition Encoding and Partial Decoding Is Optimal for a Class of Z-interference Channels
Superposition Encoding and Partial Decoding Is Optimal for a Class of Z-interference Channels Nan Liu and Andrea Goldsmith Department of Electrical Engineering Stanford University, Stanford CA 94305 Email:
More informationSHARED INFORMATION. Prakash Narayan with. Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe
SHARED INFORMATION Prakash Narayan with Imre Csiszár, Sirin Nitinawarat, Himanshu Tyagi, Shun Watanabe 2/41 Outline Two-terminal model: Mutual information Operational meaning in: Channel coding: channel
More informationKatalin Marton. Abbas El Gamal. Stanford University. Withits A. El Gamal (Stanford University) Katalin Marton Withits / 9
Katalin Marton Abbas El Gamal Stanford University Withits 2010 A. El Gamal (Stanford University) Katalin Marton Withits 2010 1 / 9 Brief Bio Born in 1941, Budapest Hungary PhD from Eötvös Loránd University
More informationMAT 570 REAL ANALYSIS LECTURE NOTES. Contents. 1. Sets Functions Countability Axiom of choice Equivalence relations 9
MAT 570 REAL ANALYSIS LECTURE NOTES PROFESSOR: JOHN QUIGG SEMESTER: FALL 204 Contents. Sets 2 2. Functions 5 3. Countability 7 4. Axiom of choice 8 5. Equivalence relations 9 6. Real numbers 9 7. Extended
More informationThe Moment Method; Convex Duality; and Large/Medium/Small Deviations
Stat 928: Statistical Learning Theory Lecture: 5 The Moment Method; Convex Duality; and Large/Medium/Small Deviations Instructor: Sham Kakade The Exponential Inequality and Convex Duality The exponential
More informationCapacity of AWGN channels
Chapter 3 Capacity of AWGN channels In this chapter we prove that the capacity of an AWGN channel with bandwidth W and signal-tonoise ratio SNR is W log 2 (1+SNR) bits per second (b/s). The proof that
More informationConcentration inequalities
Concentration inequalities A nonasymptotic theory of independence Stéphane Boucheron Paris 7 Gábor Lugosi ICREA and Universitat Pompeu Fabra Pascal Massart Orsay CLARENDON PRESS. OXFORD 2012 PREFACE Measure
More informationConcentration inequalities and tail bounds
Concentration inequalities and tail bounds John Duchi Outline I Basics and motivation 1 Law of large numbers 2 Markov inequality 3 Cherno bounds II Sub-Gaussian random variables 1 Definitions 2 Examples
More informationSolutions to Homework Set #1 Sanov s Theorem, Rate distortion
st Semester 00/ Solutions to Homework Set # Sanov s Theorem, Rate distortion. Sanov s theorem: Prove the simple version of Sanov s theorem for the binary random variables, i.e., let X,X,...,X n be a sequence
More informationON MEHLER S FORMULA. Giovanni Peccati (Luxembourg University) Conférence Géométrie Stochastique Nantes April 7, 2016
1 / 22 ON MEHLER S FORMULA Giovanni Peccati (Luxembourg University) Conférence Géométrie Stochastique Nantes April 7, 2016 2 / 22 OVERVIEW ı I will discuss two joint works: Last, Peccati and Schulte (PTRF,
More informationTightened Upper Bounds on the ML Decoding Error Probability of Binary Linear Block Codes and Applications
on the ML Decoding Error Probability of Binary Linear Block Codes and Moshe Twitto Department of Electrical Engineering Technion-Israel Institute of Technology Haifa 32000, Israel Joint work with Igal
More informationLecture 5: Channel Capacity. Copyright G. Caire (Sample Lectures) 122
Lecture 5: Channel Capacity Copyright G. Caire (Sample Lectures) 122 M Definitions and Problem Setup 2 X n Y n Encoder p(y x) Decoder ˆM Message Channel Estimate Definition 11. Discrete Memoryless Channel
More informationA Gentle Introduction to Concentration Inequalities
A Gentle Introduction to Concentration Inequalities Karthik Sridharan Abstract This notes is ment to be a review of some basic inequalities and bounds on Random variables. A basic understanding of probability
More informationIntroduction to Statistical Learning Theory
Introduction to Statistical Learning Theory In the last unit we looked at regularization - adding a w 2 penalty. We add a bias - we prefer classifiers with low norm. How to incorporate more complicated
More informationEXPONENTIAL INEQUALITIES IN NONPARAMETRIC ESTIMATION
EXPONENTIAL INEQUALITIES IN NONPARAMETRIC ESTIMATION Luc Devroye Division of Statistics University of California at Davis Davis, CA 95616 ABSTRACT We derive exponential inequalities for the oscillation
More informationMEASURE CONCENTRATION FOR COMPOUND POIS- SON DISTRIBUTIONS
Elect. Comm. in Probab. 11 (26), 45 57 ELECTRONIC COMMUNICATIONS in PROBABILITY MEASURE CONCENTRATION FOR COMPOUND POIS- SON DISTRIBUTIONS I. KONTOYIANNIS 1 Division of Applied Mathematics, Brown University,
More informationBennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence
Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence Chao Zhang The Biodesign Institute Arizona State University Tempe, AZ 8587, USA Abstract In this paper, we present
More informationSeries 7, May 22, 2018 (EM Convergence)
Exercises Introduction to Machine Learning SS 2018 Series 7, May 22, 2018 (EM Convergence) Institute for Machine Learning Dept. of Computer Science, ETH Zürich Prof. Dr. Andreas Krause Web: https://las.inf.ethz.ch/teaching/introml-s18
More informationLecture 8: Information Theory and Statistics
Lecture 8: Information Theory and Statistics Part II: Hypothesis Testing and I-Hsiang Wang Department of Electrical Engineering National Taiwan University ihwang@ntu.edu.tw December 23, 2015 1 / 50 I-Hsiang
More informationEE/Stat 376B Handout #5 Network Information Theory October, 14, Homework Set #2 Solutions
EE/Stat 376B Handout #5 Network Information Theory October, 14, 014 1. Problem.4 parts (b) and (c). Homework Set # Solutions (b) Consider h(x + Y ) h(x + Y Y ) = h(x Y ) = h(x). (c) Let ay = Y 1 + Y, where
More informationEntropy, Inference, and Channel Coding
Entropy, Inference, and Channel Coding Sean Meyn Department of Electrical and Computer Engineering University of Illinois and the Coordinated Science Laboratory NSF support: ECS 02-17836, ITR 00-85929
More informationSuperconcentration inequalities for centered Gaussian stationnary processes
Superconcentration inequalities for centered Gaussian stationnary processes Kevin Tanguy Toulouse University June 21, 2016 1 / 22 Outline What is superconcentration? Convergence of extremes (Gaussian case).
More informationNational University of Singapore Department of Electrical & Computer Engineering. Examination for
National University of Singapore Department of Electrical & Computer Engineering Examination for EE5139R Information Theory for Communication Systems (Semester I, 2014/15) November/December 2014 Time Allowed:
More informationExercises with solutions (Set D)
Exercises with solutions Set D. A fair die is rolled at the same time as a fair coin is tossed. Let A be the number on the upper surface of the die and let B describe the outcome of the coin toss, where
More informationEE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm
EE/Stats 376A: Homework 7 Solutions Due on Friday March 17, 5 pm 1. Feedback does not increase the capacity. Consider a channel with feedback. We assume that all the recieved outputs are sent back immediately
More informationEE 4TM4: Digital Communications II. Channel Capacity
EE 4TM4: Digital Communications II 1 Channel Capacity I. CHANNEL CODING THEOREM Definition 1: A rater is said to be achievable if there exists a sequence of(2 nr,n) codes such thatlim n P (n) e (C) = 0.
More informationLower Bounds on the Graphical Complexity of Finite-Length LDPC Codes
Lower Bounds on the Graphical Complexity of Finite-Length LDPC Codes Igal Sason Department of Electrical Engineering Technion - Israel Institute of Technology Haifa 32000, Israel 2009 IEEE International
More informationSTAT 200C: High-dimensional Statistics
STAT 200C: High-dimensional Statistics Arash A. Amini April 27, 2018 1 / 80 Classical case: n d. Asymptotic assumption: d is fixed and n. Basic tools: LLN and CLT. High-dimensional setting: n d, e.g. n/d
More informationIntroduction and Preliminaries
Chapter 1 Introduction and Preliminaries This chapter serves two purposes. The first purpose is to prepare the readers for the more systematic development in later chapters of methods of real analysis
More informationCOMMENT ON : HYPERCONTRACTIVITY OF HAMILTON-JACOBI EQUATIONS, BY S. BOBKOV, I. GENTIL AND M. LEDOUX
COMMENT ON : HYPERCONTRACTIVITY OF HAMILTON-JACOBI EQUATIONS, BY S. BOBKOV, I. GENTIL AND M. LEDOUX F. OTTO AND C. VILLANI In their remarkable work [], Bobkov, Gentil and Ledoux improve, generalize and
More information6.1 Moment Generating and Characteristic Functions
Chapter 6 Limit Theorems The power statistics can mostly be seen when there is a large collection of data points and we are interested in understanding the macro state of the system, e.g., the average,
More informationLectures 6: Degree Distributions and Concentration Inequalities
University of Washington Lecturer: Abraham Flaxman/Vahab S Mirrokni CSE599m: Algorithms and Economics of Networks April 13 th, 007 Scribe: Ben Birnbaum Lectures 6: Degree Distributions and Concentration
More informationECE 4400:693 - Information Theory
ECE 4400:693 - Information Theory Dr. Nghi Tran Lecture 8: Differential Entropy Dr. Nghi Tran (ECE-University of Akron) ECE 4400:693 Lecture 1 / 43 Outline 1 Review: Entropy of discrete RVs 2 Differential
More informationECE Information theory Final (Fall 2008)
ECE 776 - Information theory Final (Fall 2008) Q.1. (1 point) Consider the following bursty transmission scheme for a Gaussian channel with noise power N and average power constraint P (i.e., 1/n X n i=1
More informationOn Concentration of Martingales and Applications in Information Theory, Communication & Coding
On Concentration of Martingales and Applications in Information Theory, Communication & Coding Igal Sason Department of Electrical Engineering Technion - Israel Institute of Technology Haifa 32000, Israel
More informationA new converse in rate-distortion theory
A new converse in rate-distortion theory Victoria Kostina, Sergio Verdú Dept. of Electrical Engineering, Princeton University, NJ, 08544, USA Abstract This paper shows new finite-blocklength converse bounds
More informationCapacitary inequalities in discrete setting and application to metastable Markov chains. André Schlichting
Capacitary inequalities in discrete setting and application to metastable Markov chains André Schlichting Institute for Applied Mathematics, University of Bonn joint work with M. Slowik (TU Berlin) 12
More informationConcentration Properties of Restricted Measures with Applications to Non-Lipschitz Functions
Concentration Properties of Restricted Measures with Applications to Non-Lipschitz Functions S G Bobkov, P Nayar, and P Tetali April 4, 6 Mathematics Subject Classification Primary 6Gxx Keywords and phrases
More information