Distance-Divergence Inequalities

Size: px

Start display at page:

Download "Distance-Divergence Inequalities"

Simon Walter Parrish
5 years ago
Views:

1 Distance-Divergence Inequalities Katalin Marton Alfréd Rényi Institute of Mathematics of the Hungarian Academy of Sciences

2 Motivation To find a simple proof of the Blowing-up Lemma, proved by Ahlswede, Gács and Körner, with the aim to prove Strong Converses in Shannon theory; Ornstein s Copying Lemma in the proof of his Isomorphism Theorem; My experience with Single-Letter Characterization techniques in Shannon theory.

3 What does measure concentration mean? Enlargements and distances of sets Notation (X, d, q): a metric probability space. More precisely, Assume: X is a Polish space, with the Borel σ-algebra; d is a Borel measurable distance on X ; q is a Borel probability measure on X. Subsets of X are assumed to be measurable.

4 Notation For A X and κ > 0: The κ-enlargement (or κ-neighborhood) of the set A is { } [A] κ = x X : y A, d(x, y) κ. Informal definition of Measure Concentration: The a metric probability space (X, d, q) satisfies the measure concentration property if: For every ε > 0 there exists a κ = κ(ε) such that q(a) ε = q([a] κ ) 1 ε, A X, and κ(ε) can be overbounded in a non-trivial way.

6 By the equality d ( A, [A] c κ) = κ : Equivalent definition The a metric probability space (X, d, q) satisfies the measure concentration property if: There is a function κ : R + R + such that: For every A, B X d(a, B) κ ( q(a) ) + κ ( q(b) ), A, B X, and κ can be overbounded in a non-trivial way.

7 Sub-Gaussian measure concentration κ(ε) = c(q) log 1 ( ( ) κ 2 ) ε(κ) = exp ε c(q) I.e.: d(a, B) c(q) [ log ] 1 q(a) + 1 log, A, B X. q(b)

8 Example: Hamming distance, C. McDiarmid 1989, Marton 1996 X n : n-th power of a Borel space X ; δ n (x n, y n ) = n i=1 δ(x i, y i ): Hamming distance on X n ; x n = (x 1, x 2,..., x n ); q n = n i=1 q i: product measure on X n. Then, taking neighborhood with respect to Hamming distance: q n ([A] κ ) 1 ( ( ) ) κ exp 2n n 1 2n log 1 2 q n, κ, A X n. (A)

9 Cont d q n ([A] κ ) 1 ( ( ) ) κ exp 2n n 1 2n log 1 2 q n, κ, A X n. (A) Equivalently: 1 δ n (A, B) 2n log 1 q n (A) + 1 2n log 1 q n (B), A, B X n Thus, for Hamming distance δ n 1 κ(ε) = 2n log 1 ε.

10 Cont d q n ([A] κ ) 1 ( ( ) ) κ exp 2n n 1 2n log 1 2 q n, κ, A X n. (A) Equivalently: 1 δ n (A, B) 2n log 1 q n (A) + 1 2n log 1 q n (B), A, B X n Thus, for Hamming distance δ n 1 κ(ε) = 2n log 1 ε.

11 Example: Euclidean distance, Talagrand 1996 R n : n-dimensional Euclidean space; e n : Euclidean distance; q n = n i=1 q i = n i=1 exp( V i): product density on R n. Assume: V i : R R is strictly uniformly convex: V i (x) ρ > 0, x. Then: e n (A, B) 2 ρ log 1 q n (A) + 2 ρ log 1 q n (B), A, B X n.

12 Example: Euclidean distance, Talagrand 1996 R n : n-dimensional Euclidean space; e n : Euclidean distance; q n = n i=1 q i = n i=1 exp( V i): product density on R n. Assume: V i : R R is strictly uniformly convex: V i (x) ρ > 0, x. Then: e n (A, B) 2 ρ log 1 q n (A) + 2 ρ log 1 q n (B), A, B X n.

13 Cont d Assume: V i : R R is strictly uniformly convex: V i (x) ρ > 0, x. Then: e n (A, B) 2 ρ log 1 q n (A) + 2 ρ log 1 q n (B), A, B X n. Thus, for Euclidean distance, and for product density q n = n i=1 exp( V i) satisfying V i ρ > 0, 2 κ(ε) = ρ log 1, independently of n. ε

14 Cont d Assume: V i : R R is strictly uniformly convex: V i (x) ρ > 0, x. Then: e n (A, B) 2 ρ log 1 q n (A) + 2 ρ log 1 q n (B), A, B X n. Thus, for Euclidean distance, and for product density q n = n i=1 exp( V i) satisfying V i ρ > 0, 2 κ(ε) = ρ log 1, independently of n. ε

15 Distance-Divergence Inequalities: A way to prove measure concentration Notation P(X ): the set of probability measures on X. q is a distinguished element of P(X ). Definition: set ofcouplings Π(p, r) For p, r P(X ): Π(p, r) set of all couplings of p and r set of all joint probability distributions π = L(Y, Z) on X X, with marginals: p = L(Y ) and r = L(Z).

16 Definition: Wasserstein distance Fix a β, 1 β 2. For p, r P(X ) : W β (p, r) = W (d) β (p, r) { ( E π d(y, Z) β) } 1/β : L(Y ) = p, L(Z) = r,. inf π Π(p,r) W β (p, r) is the Wasserstein distance, of order β, of p and r. Claim W β is a distance on P(X ). (May take the value.)

17 Definition: Wasserstein distance Fix a β, 1 β 2. For p, r P(X ) : W β (p, r) = W (d) β (p, r) { ( E π d(y, Z) β) } 1/β : L(Y ) = p, L(Z) = q,. inf π Π(p,r) W β (p, r) is the Wasserstein distance, of order β, of p and r. Example: Variational distance δ: Kronecker s distance: δ(x, y) = 1 for x y, δ(x, x) = 0. W (δ) 1 (p, r) p r T V = min E π δ(y, Z) ( ) = min P r π {Y Z} = max p(a) r(a), A X where π = L(Y, Z) Π(p, r).

18 Definition: Wasserstein distance Fix a β, 1 β 2. For p, r P(X ) : W β (p, r) = W (d) β (p, r) { ( E π d(y, Z) β) } 1/β : L(Y ) = p, L(Z) = q,. inf π Π(p,r) W β (p, r) is the Wasserstein distance, of order β, of p and r. Example: Variational distance δ: Kronecker s distance: δ(x, y) = 1 for x y, δ(x, x) = 0. W (δ) 1 (p, r) p r T V = min E π δ(y, Z) ( ) = min P r π {Y Z} = max p(a) r(a), A X where π = L(Y, Z) Π(p, r).

19 Definition: (Informational) Divergence For p, q P(X ) : ( dp ) D(p q) log dp dq if p q; + otherwise. X Definition: Distance-Divergence Inequality q P(X ) satisfies a distance-divergence inequality of order β, with constant ρ, if: 2 W β (p, q) D(p q) for all p P(X ). ρ Most important cases: β = 1 and β = 2.

20 Definition: Distance-Divergence Inequality q P(X ) satisfies a distance-divergence inequality of order β, with constant ρ, if: 2 W β (p, q) D(p q) for all p P(X ). ρ Example: Pinsker-Csiszár-Kullback inequality p q T V 1 D(p q) p, q P(X ). 2 This distance-divergence inequality holds with the same constant ρ = 4 for every q.

21 Definition: Distance-Divergence Inequality q P(X ) satisfies a distance-divergence inequality of order β, with constant ρ, if: 2 W β (p, q) D(p q) for all p P(X ). ρ Example: Pinsker-Csiszár-Kullback inequality p q T V 1 D(p q) p, q P(X ). 2 This distance-divergence inequality holds with the same constant ρ = 4 for every q.

22 Notation W 2 : Wasserstein distance on P(R) from the Euclidean distance, i.e.: For densities p and r on R W 2 (p, r) = W (e) 2 (p, r) { ( = inf E π Y Z 2) } 1/2 : L(Y ) = p, L(Z) = q. π Π(p,r)

23 M. Talagrand s Theorem on R, 1996 Assume: q(x) = exp( V (x)) strictly uniformly log-concave density on R, i.e., V (x) ρ > 0, x. E.g., for q normal, ρ = 1/Variance. Then: W 2 (p, r) 2 ρ D(p q), p P(R).

24 The proof of Talagrand s Theorem is based on the unique monotone map T that takes the measure p on R to the measure q (provided both measures are absolutely continuous): x q(t)dt = T (x) p(t)dt. A monotone map taking a given density p to the density q exists also on R n. (Proved, e.g., by Y. Brenier 1991 or R. McCann 1994.)

25 Talagrand s Theorem on R n Assume: q(x) = exp( V (x)): strictly uniformly log-concave density on R n : HessV (x) ρ Id x, ρ > 0. Then: W 2 (p, q) 2 ρ D(p q) p P(R). W 2 denotes the Wasserstein distance on P(R n ) derived from the Euclidean distance. Proved e.g. by S. Bobkov - M. Ledoux 2000, F. Otto - C. Villani 2000, D. Cordero-Erausquin

26 Lemma: Distance-Divergence Inequality implies Measure Concentration, Marton 1986, 1996 (X, d, q): metric probability space, W (d) β Wasserstein distance of order β associated with d. Then: W (d) 2 β (p, q) D(p q), p P(X ) ρ = d(a, B) 2 ρ log 1 q(a) + 2 ρ log 1 q(b), A, B X.

27 Proof Define Triangle inequality q(a C) p = q A : p(c) =, q(a) q(b C) r = q B : r(c) =. q(b) + Distance-Divergence inequality: d(a, B) W β (p, r) W β (p, q) + W β (q, r) 2 2 ρ D(p q) + D(r q) ρ β [1, 2].

28 Cont d d(a, B) 2 2 ρ D(p q) + ρ D(r q). Further: and similarly p(x) q(x) = 1 q(a) = D(p q) = log p almost everywhere 1 q(a), D(r q) = log 1 q(b).

29 Product spaces Definition: Distances in Product Spaces Given the metric d on X, define distances on X n : ( n ) 1/β d β,n (x n, y n ) = d(x i, y i ) β, 1 β 2. i=1 Hamming distance: from Kronecker s distance, with β = 1, Euclidean distance: from one-dimensional Euclidean distance, with β = 2.

30 Definition: Wasserstein Distances in Product Spaces The distance d β,n on X n gives rise to the Wasserstein distance W β,n = W (d) β,n on P(X n ): W β,n (p n, r n ) = W (d) β,n (pn, r n ) { ( n = inf E π d(y i, Z i ) β) 1/β : π Π(p n,r n ) i=1 L(Y n ) = p n, L(Z n ) = r n }.

31 Lemma: Tensorisation of Distance-Divergence inequalities, Marton 1996, M. Talagrand 1996 q n = n i=1 q i: product measure on X n. Then: 2 W β (p, q i ) ρ D(p q i), i, p P(X ) = Thus: W β,n (p n, q n ) 2n 2/β 1 ρ D(p n q n ), p n P(X n ). For Hamming distance: n-fold product measures satisfy a distance-divergence inequality with the same constant 4/n. For Euclidean distance: n-fold product measures satisfy distance-divergence inequalities with the smallest of the constants valid for the factors, independently of n.

32 Lemma: Tensorisation of Distance-Divergence inequalities, Marton 1996, M. Talagrand 1996 q n = n i=1 q i: product measure on X n. Then: 2 W β (p, q i ) ρ D(p q i), i, p P(X ) = Thus: W β,n (p n, q n ) 2n 2/β 1 ρ D(p n q n ), p n P(X n ). For Hamming distance: n-fold product measures satisfy a distance-divergence inequality with the same constant 4/n. For Euclidean distance: n-fold product measures satisfy distance-divergence inequalities with the smallest of the constants valid for the factors, independently of n.

33 Proof From couplings between q i and the conditional distributions p i ( y i 1 ) L(Y i Y 1 = y 1,..., Y i 1 = y i 1 ) construct good couplings between measures p n and q n on X n : Successively define L ( Y i, X i Y i 1 = y i 1, X i 1 = x i 1) best coupling of p i ( y i 1 ) and q i.

34 Cont d We have defined a joint distribution π = L(Y n, X n ) with marginals p n = L(Y n ), q n = L(X n ). E π δ n (Y n, X n ) = = n P r π {Y i X i } i=1 n E { Y i X i Y i 1} i=1 n i=1 n 1 = 2 D(Y i Y i 1 qi ) i=1 n n 2 D(Y i Y i 1 qi ) = i=1 n 2 D(pn q n ). 1 2 D( Y i Y i 1 qi ( Y i 1 ) )

35 Distance-Divergence Inequalities and Large Deviation type Inequalities S. Bobkov and Götze s theorem, 1999 (X, d, q): metric probability space. The following two statements are equivalent: (i) W (d) 1 (p, q) (ii) X 2 D(p q) ρ for all p P(X ), ) ( t e tf(x) 2 dq(x) exp 2ρ + t E qf for all Lipschitz functions f : X R with Lipschitz coefficient 1, and all t > 0.

36 Remark Bobkov and Götze actually proved that: For any fixed function f : X R ( ) t e tf(x) 2 dq(x) exp 2ρ + t E qf X = Ep f E q f 2 D(p q) ρ for all p P(X ).

37 Distance-Divergence Inequalities and Large Deviation type Inequalities N. Gozlan s theorem, 2009 (X, d, q): metric probability space, q n : n-th power of q. The following two statements are equivalent: (i) W (d) 2 (p, q) (ii) q n {x n : f(x n ) Ef + r 2 D(p q) ρ for all p P(X ), } ) ( ρr2 exp for all n, all Lipschitz functions f : X n R with Lipschitz coefficient 1 (with respect to distance d 2,n ), and all r > 0. 2

38 Measure Concentration for Non-product Measures: Contracting Markov Chains, Marton 1996 Assume: q 1 : probability measure on the Borel space X ; q i ( x): Markov kernels, X X, Contracting: q n on X n : Then: q i ( x) q i ( y) T V (1 β), β > 0, x, y; q n (x n ) q 1 (x 1 ) n q i (x i x i 1 ), x n X n. i=2 W (δ) 1,n (pn, q n ) 1 β 1 2n D(pn q n ), p n P(X ).

39 Measure Concentration for Non-product Measures X n : n-th power of a Borel space X ; q n : Borel probability measure on X n ; X n : random sequence with L(X n ) = q n ; p n : another probability measure on X n ; Y n : random sequence in X n, L(Y n ) = p n. Notation For y n X n and 1 i n : y i (y 1, y 2,..., y i ) and y n i (y i+1, y i+2,..., y n ); p i ( y i 1 ) L(Y i Y i 1 = y i 1 ); p n i ( y i ) L(Y n i Y i = y i );

40 Definition: Measures admitting coupling with bounded distance q n admits coupling with distance bounded by constant C if: For every i n 1 and every pair of sequences y i, z i X i such that y i 1 = z i 1, y i z i, there exists a coupling of q n i ( yi ) and q n i ( zi ): π n i ( y i, z i ) = L(Y n i, Z n i Y i = y i, Z i = z i ) satisfying E π n i { n j=i+1 δ(y j, Z j ) } y i, z i ) C.

41 Theorem: Measure Concentration for random processes, Marton 1998 Assume: q n admits coupling with distance bounded by the constant C. Then W (δ) n 1,n (pn, q n ) (C + 1) 2 D(pn q n ). Thus: For measures admitting couplings with bounded distance, the distance-divergence inequality is worse by a constant factor (only) than the inequality for product measures. The Theorem applies for segments of sufficiently mixing stationary random processes.

42 Measure Concentration for Gibbs Measures Notation q n : Borel probability measure on X n ; X n : random sequence, L(X n ) = q n ; For y n X n : ȳ i (y 1, y 2,..., y i 1, y i+1,..., y n ); Q i ( ȳ i ) L(X i X i = ȳ i ). The conditional distributions Q i ( ȳ i ) are called the local specifications of q n.

43 (Non-standard) Definition: Gibbs Measures q n, when defined by the local specifications Q i ( ȳ i ), is called a Gibbs measure.

44 Definition: Dobrushin s Uniqueness Condition q n satisfies Dobrushin s uniqueness condition if: there exist numbers a i,k 0, (i, k [1, n], i k), such that: (i) For any i, k, i k, and any ȳ k, z k X n 1, differing only in the i-th coordinate: Q k ( ȳ k ) Q k ( z k ) T V a i,k ; (ii) For the matrix A ( a i,k ) n i,k=1, where a i,i = 0, we have A = max k n a i,k < 1. The numbers a i,k are called Dobrushin s interdependence coefficients. i=1

45 Definition: Gibbs Sampler Gibbs sampler associated with the local specifications Q i ( ȳ i ): Markov chain with state space X n and transition law Γ(z n y n ) defined as follows: Given y n X n, (i) select a random index i [1, n] independently of y n and uniformly distributed; (ii) Set Γ(z n y n ) = δȳi, z i Q i (z i ȳ i ). Dobrushin s uniqueness condition implies that the Gibbs sampler Γ is a contraction with respect to the distance W (δ) W (δ) 1,n (pn Γ, r n Γ) ( 1 1 A n ) W (δ) 1,n (pn, r n ). 1,n :

46 Definition: Gibbs Sampler Gibbs sampler associated with the local specifications Q i ( ȳ i ): Markov chain with state space X n and transition law Γ(z n y n ) defined as follows: Given y n X n, (i) select a random index i [1, n] independently of y n and uniformly distributed; (ii) Set Γ(z n y n ) = δȳi, z i Q i (z i ȳ i ). Dobrushin s uniqueness condition implies that the Gibbs sampler Γ is a contraction with respect to the distance W (δ) W (δ) 1,n (pn Γ, r n Γ) ( 1 1 A n ) W (δ) 1,n (pn, r n ). 1,n :

47 Theorem, Li Ming Wu 2006 If q n satisfies Dobrushin s uniqueness condition then W (δ) 1,n (pn, q n 1 n ) 1 A 2 D(pn q n ) for all p n P(X ) n. Remark A similar theorem holds for Gibbs samplers with state space R n and W (e) 2,n. (Wiith the appropriate modification of Dobrushin s uniqueness condition.) Proved by Marton 2004, 2009.

48 Theorem, Li Ming Wu 2006 If q n satisfies Dobrushin s uniqueness condition then W (δ) 1,n (pn, q n 1 n ) 1 A 2 D(pn q n ) for all p n P(X ) n. Remark A similar theorem holds for Gibbs samplers with state space R n and W (e) 2,n. (Wiith the appropriate modification of Dobrushin s uniqueness condition.) Proved by Marton 2004, 2009.

49 Rate of decrease of divergence for Gibbs samplers Theorem: The Gibbs sampler Decreases Divergence by a rate < 1, Marton, in preparation Assume (i) α inf x n,i Q i (x i x i ) > 0; (ii) q n satisfies the L 2 -version of Dobrushin s uniqueness condition: For the matrix of Dobrushin s interdependence coefficients A 2 < 1. Then: ( α (1 A ) 2 ) 2 D(p n Γ q n ) 1 2n D(p n q n ).

50 Cont d The proof uses the following "Converse" of Pinsker s lemma Let q be a probability measure on a finite set X, and assume: α min x q(x) > 0. Then: D(p q) 4 α p q 2 T V, p P(X ).

51 Corollary D(p n Γ t q n ) 0 exponentially, as t.

52 Distances between measures that are not Wasserstein distances Example: X n : product Borel space, p n and r n : Borel probability measures on X n. W (p n, r n ) min n } P rπ{ 2 Yi Z i. π=l(y n,z n ) Π(p n,r n ) i=1

Logarithmic Sobolev inequalities in discrete product spaces: proof by a transportation cost distance

Logarithmic Sobolev inequalities in discrete product spaces: proof by a transportation cost distance Katalin Marton Alfréd Rényi Institute of Mathematics of the Hungarian Academy of Sciences Relative entropy