MIT OpenCourseWare http://ocw.mit.edu 8.75 Theory of Probability Fall 008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
Section Prekopa-Leindler inequality, entropy and concentration. In this section we will make several connections between the Kantorovich-Rubinstein theorem and other classical objects. Let us start with the following classical inequality. Theorem 5 (Prekopa-Leindler) Consider nonnegative integrable functions w, u, v : R n [0, ) such that for some λ (0, ), w(λx + ( λ)y) u(x) λ v(y) λ for all x, y R n. Then, ( ) λ ( ) λ wdx udx vdx. Proof. The proof will proceed by induction on n. Let us first show the induction step. Suppose the statement holds for n and we would like to show it for n +. By assumption, for any x, y R n and a, b R Let us fix a and b and consider functions w(λx + ( λ)y, λa + ( λ)b) u(x, a) λ v(y, b) λ. w (x) = w(x, λa + ( λ)b), u (x) = u(x, a), v (x) = v(x, b) on R n that satisfy w (λx + ( λ)y) u (x) λ v (y) λ. By induction assumption, ( ) λ ( ) λ w dx u dx v dx. R n R n R n These integrals still depend on a and b and we can define w (λa + ( λ)b) = w dx = w(x, λa + ( λ)b)dx R n R n and, similarly, u (a) = u (x, a)dx, v (b) = v (x, b)dx R n R n so that w (λa + ( λ)b) u (a) λ v (b) λ. 88
These functions are defined on R and, by induction assumption, ( ) λ ( ) λ ( ) λ ( ) λ w ds u ds v ds = wdz udz vdz, R R R R n+ R n+ R n+ which finishes the proof of the induction step. It remains to prove the case n =. Let us show two different proofs.. One approach is based on the Brunn-Minkowski inequality on the real line which says that, if γ is the Lebesgue measure and A, B are Borel sets on R, then γ(λa + ( λ)b) λγ(a) + ( λ)γ(b), where A+B is the set addition, i.e. A+B = {a+b : a A, b B}. We can also assume that u, v, w : R [0, ] because the inequality is homogeneous to scaling. We have {w a} λ{u a} + ( λ){v a} because if u(x) a and v(y) a then, by assumption, The Brunn-Minkowski inequality implies that Finally, w(λx + ( λ)y) u(x) λ v(y) λ a λ a λ = a. γ(w a) λγ(u a) + ( λ)γ(v a). w(z)dz = I(x w(z))dxdz = γ(w x)dx R R 0 0 λ γ(u x)dx + ( λ) γ(v x)dx 0 0 ( ) λ ( ) λ = λ u(z)dz + ( λ) v(z)dz u(z)dz v(z)dz. R R R R. Another approach is based on the transportation of measure. We can assume that u = v = by rescaling u v w u, v, w. u v ( u)λ ( v) λ Then we need to show that w. Without loss of generality, let us assume that u, v 0 are smooth and strictly positive, since one can easily reduce to this case. Define x(t), y(t) for 0 t by Then x(t) y(t) u(s)ds = t, v(s)ds = t. u(x(t))x (t) =, u(y(t))y (t) = and the derivatives x (t), y (t) > 0. Define z(t) = λx(t) + ( λ)y(t). Then + w(s)ds = w(z(s))dz(s) = w(λx(s) + ( λ)y(s))z (s)ds. 0 0 By arithmetic-geometric mean inequality z (s) = λx (s) + ( λ)y (s) (x (s)) λ (y (s)) λ 89
and, by assumption, w(λx(s) + ( λ)y(s)) u(x(s)) λ v(y(s)) λ. Therefore, λ λ w(s)ds u(x(s))x (s) v(y(s))y (s) ds = ds =. 0 0 This finishes the proof of theorem. Entropy and the Kullback-Leibler divergence. Consider a probability measure P on R n and a nonnegative measurable function u : R n [0, ). Definition (Entropy)We define the entropy of u with respect to P by Ent P (u) = u log udp udp log udp. One can give a different representation of entropy by { } Ent P (u) = sup uvdp : e v dp. (.0.) Indeed, if we consider a convex set V = {v : e v dp } then the above supremum is obviously a solution of the following saddle point problem: ( ) L(v, λ) = uvdp λ e v dp sup inf. The functional L is linear in λ and concave in v. Therefore, by the minimax theorem, a saddle point solution exists and sup inf = inf sup. The integral uvdp λ e v dp = (uv λe v )dp can be maximized pointwise by taking v such that u = λe v. Then u L(v, λ) = u log dp udp + λ λ and maximizing over λ gives λ = u and v = log(u/ u). This proves (.0.). Suppose now that a law Q is absolutely continuous with respect to P and denote its Radon-Nikodym derivative by Definition (Kullback-Leibler divergence) The quantity D(Q P) := log u dq = log dq dq dp is called the Kullback-Leibler divergence between P and Q. Clearly, D(Q P) = Ent P (u), since Ent P (u) = log dq dq dq dq dq dp dp log dp = log dq. dp dp dp dp dp The variational characterization (.0.) implies that if e v dp then vdq = uvdp D(Q P ). (.0.3) v λ 0 u = dq. (.0.) dp 90
Transportation inequality for log-concave measures. Suppose that a probability distribution P on R n has the Lebesgue density e V (x) where V (x) is strictly convex in the following sense: tv (x) + ( t)v (y) V (tx + ( t)y) C p ( t + o( t)) x y p (.0.4) as t for some p and C p > 0. Example. One example of the distribution that satisfies (.0.4) is the non-degenerate normal distribution N(0, C) that corresponds to V (x) = (C x, x) + const for some covariance matrix C, det C = 0. If we denote A = C / then t(ax, x) + ( t)(ay, y) (A(tx + ( t)y), (tx + ( t)y)) = t( t)(a(x y), (x y)) λmax (C) t( t) x y, (.0.5) where λ max (C) is the largest eigenvalue of C. Thus, (.0.4) holds with p = and C p = /(λ max (C)). Let us prove the following useful inequality for the Wasserstein distance. Theorem 5 If P satisfies (.0.4) and Q is absolutely continuous w.r.t. P then W p (Q, P) p Cp D(Q P). Proof. Take functions f, g C(R n ) such that f(x) + g(y) t( t) C p ( t + o( t)) x y p. Then, by (.0.4), f(x) + g(y) tv (x) + ( t)v (y) V (tx + ( t)y) t( t) and This implies that for By the Prekopa-Leindler inequality, t( t)f(x) tv (x) + t( t)g(y) ( t)v (y) V (tx + ( t)y). and since e V is the density of P we get w(tx + ( t)y) u(x) t v(y) t u(x) = e ( t)f(x) V (x), v(y) = e tg(y) V (y) and w(z) = e V (z). ( ) t ( ) t e ( t)f(x) V (x) dx e tg(x) V (x) dx e V (x) dx ( ) ( ( ( t ) t ) e ( t)f dp e tg dp and e ( t)f t dp e tg dp It is a simple calculus exercise to show that ( ) R s lim e sf dp = e fdp, s 0 ) t. 9
and, therefore, letting t proves that R if f(x) + g(y) C p x y p then e g dp e fdp. If we denote v = g + fdp then the last inequality is e v dp and (.0.3) implies that vdq = fdp + gdq D(Q P). Finally, using the Kantorovich-Rubinstein theorem, (0.0.), we get { } W p (Q, P) p = inf C p x y p dµ(x, y) : µ M(P, Q) C p { } = sup fdp + gdq : f(x) + g(y) C p x y p D(Q P) C p Cp and this finishes the proof. Concentration of Gaussian measure. Applying this result to the example before Theorem 5 gives that for the non-degenerate Gaussian distribution P = N(0, C), W (P, Q) λ max (C)D(Q P). (.0.6) Given a measurable set A R n with P(A) > 0, define the conditional distribution P A by Then, obviously, the Radon-Nikodym derivative P(CA) P A (C) =. P(A) dp A = I A dp P(A) and the Kullback-Leibler divergence D(P A P) = log dp A = log. A P(A) P(A) Since W is a metric, for any two Borel sets A and B ( ) W (P A, P B ) W (P A, P) + W (P B, P) λ max (C) log + log. P(A) P(B) Suppose that the sets A and B are apart from each other by a distance t, i.e. d(a, B) t > 0. Then any two points in the support of measures P A and P B are at a distance at least t from each other and the transportation distance W (P A, P B ) t. Therefore, ( ) t W (P A, P B ) λ max (C) log + log 4λ max (C) log. P(A) P(B) P(A)P(B) Therefore, ( t ) P(B) exp. P(A) 4λmax (C) In particular, if B = {x : d(x, A) t} then ( t ) P d(x, A) t exp P(A) 4λmax (C). 9
If the set A is not too small, e.g. P(A) /, this implies that ( t ) P d(x, A) t exp 4λmax (C). This shows that the Gaussian measure is exponentially concentrated near any large enough set. The constant /4 in the exponent is not optimal and can be replaced by /; this is just an example of application of the above ideas. The optimal result is the famous Gaussian isoperimetry, if P(A) = P(B) for some half-space B then P(A t ) P(B t ). Gaussian concentration via the Prekopa-Leindler inequality. If we denote c = /λ max (C) then setting t = / in (.0.5), V (x) + V (y) V x + y c x y. 4 Given a function f on R n let us define its infimum-convolution by g(y) = inf f(x) + c x y. x 4 Then, for all x and y, If we define c g(y) f(x) x y V (x) + V (y) V x + y. (.0.7) 4 u(x) = e f(x) V (x), v(y) = e g(y) V (y) V (z), w(z) = e then (.0.7) implies that w x + y u(x) / v(y) /. The Prekopa-Leindler inequality with λ = / implies that e g dp e f dp. (.0.8) Given a measurable set A, let f be equal to 0 on A and + on the complement of A. Then g(y) = c d(x, A) 4 and (.0.8) implies ( c ) exp d(x, A) dp(x). 4 P(A) By Chebyshev s inequality, ( ct ) ( t ) P d(x, A) t exp = exp P(A) 4 P(A) 4λmax (C). Trivial metric and total variation. Definition A total variation distance between probability measure P and Q on a measurable space (S, B) is defined by TV(P, Q) = sup P(A) Q(A). A B Using the Hahn-Jordan decomposition, we can represent a signed measure µ = P Q as µ = µ + µ such that for some set D B and for any set E B, µ + (E) = µ(ed) 0 and µ (E) = µ(ed c ) 0. 93
Therefore, for any A B, which makes it obvious that P(A) Q(A) = µ + (A) µ (A) = µ + (AD) µ (AD c ) sup P(A) Q(A) = µ + (D). A B Let us describe some connections of the total variation distance to the Kullback-Leibler divergence and the Kantorovich-Rubinstein theorem. Let us start with the following simple observation. Lemma 43 If f is a measurable function on S such that f and fdp = 0 then for any λ R, λf dp e λ / e. Proof. Since ( + f)/, ( f)/ [0, ] and x by convexity of e Therefore, we get + f λf = λ + f ( λ), e λf + f e λ + f e λ = ch(λ) + fsh(λ). e λf dp ch(λ) e λ /, where the last inequality is easy to see by Taylor s expansion. Let us now consider a trivial metric on S given by Formally, the Kantorovich-Rubinstein theorem in this case would state that { } W (P, Q) := inf I(x y)dµ(x, y) : µ M(P, Q) Theorem 53 If Q is absolutely continuous w.r.t. P then TV(P, Q) D(Q P). d(x, y) = I(x y). (.0.9) Then a -Lipschitz function f w.r.t. d, f L, is defined by the condition that for all x, y S, f(x) f(y). (.0.0) { } = sup fdq fdp : f L =: γ(p, Q). However, since any uncountable set S is not separable w.r.t. a trivial metric d, we can not apply the Kantorovich-Rubinstein theorem directly. In this case one can use the Hahn-Jordan decomposition to show that γ coincides with the total variation distance, γ(p, Q) = TV(P, Q) and it is easy to construct a measure µ M(P, Q) explicitly that witnesses the above equality. We leave this as an exercise. Thus, for the trivial metric d, W (P, Q) = γ(p, Q) = TV(P, Q). We have the following analogue of the KL divergence bound. 94
Proof. Take f such that (.0.0) holds. If we define g(x) = f(x) fdp then, clearly, g and gdp = 0. The above lemma implies that for any λ R, e λf λ R fdp λ / dp. The variational characterization of entropy (.0.3) implies that λ fdq λ fdp λ / D(Q P) and for λ > 0 we get fdq λ fdp + D(Q P). λ Minimizing the right hand side over λ > 0, we get fdq fdp D(Q P). Applying this to f and f yields the result. 95