Theory of Probability Fall 2008

Similar documents
A glimpse into convex geometry. A glimpse into convex geometry

Logarithmic Sobolev Inequalities

4. Conditional risk measures and their robust representation

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

Lecture 35: December The fundamental statistical distances

Around the Brunn-Minkowski inequality

Nishant Gurnani. GAN Reading Group. April 14th, / 107

Asymptotic Geometric Analysis, Fall 2006

n [ F (b j ) F (a j ) ], n j=1(a j, b j ] E (4.1)

Symmetric Matrices and Eigendecomposition

THE EQUALITY CASES OF THE EHRHARD-BORELL INEQUALITY

MATH MEASURE THEORY AND FOURIER ANALYSIS. Contents

Convex Functions. Daniel P. Palomar. Hong Kong University of Science and Technology (HKUST)

Moment Measures. Bo az Klartag. Tel Aviv University. Talk at the asymptotic geometric analysis seminar. Tel Aviv, May 2013

CHAPTER 6. Differentiation

Real Analysis Chapter 3 Solutions Jonathan Conder. ν(f n ) = lim

Stat410 Probability and Statistics II (F16)

Product measure and Fubini s theorem

MAT 571 REAL ANALYSIS II LECTURE NOTES. Contents. 2. Product measures Iterated integrals Complete products Differentiation 17

IE 521 Convex Optimization

Review of Multi-Calculus (Study Guide for Spivak s CHAPTER ONE TO THREE)

L p Spaces and Convexity

1 Functional Analysis

Dimensionality in the Stability of the Brunn-Minkowski Inequality: A blessing or a curse?

Nonparametric estimation of. s concave and log-concave densities: alternatives to maximum likelihood

EC /11. Math for Microeconomics September Course, Part II Problem Set 1 with Solutions. a11 a 12. x 2

Lecture No 1 Introduction to Diffusion equations The heat equat

A function(al) f is convex if dom f is a convex set, and. f(θx + (1 θ)y) < θf(x) + (1 θ)f(y) f(x) = x 3

Chapter 1. Optimality Conditions: Unconstrained Optimization. 1.1 Differentiable Problems

Generalized Orlicz spaces and Wasserstein distances for convex concave scale functions

STAT 7032 Probability Spring Wlodek Bryc

g 2 (x) (1/3)M 1 = (1/3)(2/3)M.

Exercise 1. Show that the Radon-Nikodym theorem for a finite measure implies the theorem for a σ-finite measure.

Helly's Theorem and its Equivalences via Convex Analysis

Linear and non-linear programming

2 (Bonus). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

MATHS 730 FC Lecture Notes March 5, Introduction

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

CHAPTER 1. Metric Spaces. 1. Definition and examples

A SET OF LECTURE NOTES ON CONVEX OPTIMIZATION WITH SOME APPLICATIONS TO PROBABILITY THEORY INCOMPLETE DRAFT. MAY 06

x log x, which is strictly convex, and use Jensen s Inequality:

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

Bounded uniformly continuous functions

The Pedestrian s Guide to Local Time

Convex Optimization Notes

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Theorem 2.1 (Caratheodory). A (countably additive) probability measure on a field has an extension. n=1

Probability and Measure

Math 5588 Final Exam Solutions

Lecture 1 Measure concentration

Some lecture notes for Math 6050E: PDEs, Fall 2016

Introduction to Real Analysis Alternative Chapter 1

Convex Functions and Optimization

A Brunn Minkowski inequality for the Hessian eigenvalue in three-dimensional convex domain

If Y and Y 0 satisfy (1-2), then Y = Y 0 a.s.

Examples of Dual Spaces from Measure Theory

3 (Due ). Let A X consist of points (x, y) such that either x or y is a rational number. Is A measurable? What is its Lebesgue measure?

+ 2x sin x. f(b i ) f(a i ) < ɛ. i=1. i=1

Chapter 8. General Countably Additive Set Functions. 8.1 Hahn Decomposition Theorem

Convex Geometry. Otto-von-Guericke Universität Magdeburg. Applications of the Brascamp-Lieb and Barthe inequalities. Exercise 12.

Concentration inequalities and the entropy method

Lebesgue Integration on R n

An introduction to some aspects of functional analysis

Preparatory Material for the European Intensive Program in Bydgoszcz 2011 Analytical and computer assisted methods in mathematical models

LECTURE 2. Convexity and related notions. Last time: mutual information: definitions and properties. Lecture outline

VISCOSITY SOLUTIONS. We follow Han and Lin, Elliptic Partial Differential Equations, 5.

4 Expectation & the Lebesgue Theorems

SHARP BOUNDARY TRACE INEQUALITIES. 1. Introduction

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

Economics 204 Summer/Fall 2011 Lecture 5 Friday July 29, 2011

Subgradients. subgradients and quasigradients. subgradient calculus. optimality conditions via subgradients. directional derivatives

CONTENTS. 4 Hausdorff Measure Introduction The Cantor Set Rectifiable Curves Cantor Set-Like Objects...

Distance-Divergence Inequalities

Steiner s formula and large deviations theory

Solutions to Tutorial 11 (Week 12)

Estimates for probabilities of independent events and infinite series

Non-linear wave equations. Hans Ringström. Department of Mathematics, KTH, Stockholm, Sweden

Chapter 1 Preliminaries

4th Preparation Sheet - Solutions

Chapter 2 Convex Analysis

Math 5052 Measure Theory and Functional Analysis II Homework Assignment 7

Convex Analysis Background

QUADRATIC MAJORIZATION 1. INTRODUCTION

Calculus of Variations. Final Examination

Lecture 4: Convex Functions, Part I February 1

Part II Probability and Measure

Nonlinear Programming Algorithms Handout

Probabilistic Methods in Asymptotic Geometric Analysis.

The proximal mapping

Signed Measures. Chapter Basic Properties of Signed Measures. 4.2 Jordan and Hahn Decompositions

Lecture Notes in Advanced Calculus 1 (80315) Raz Kupferman Institute of Mathematics The Hebrew University

On the Brunn-Minkowski inequality for general measures with applications to new isoperimetric-type inequalities

Convex functions. Definition. f : R n! R is convex if dom f is a convex set and. f ( x +(1 )y) < f (x)+(1 )f (y) f ( x +(1 )y) apple f (x)+(1 )f (y)

Semicontinuous functions and convexity

NOTES ON VECTOR-VALUED INTEGRATION MATH 581, SPRING 2017

I. ANALYSIS; PROBABILITY

02. Measure and integral. 1. Borel-measurable functions and pointwise limits

Theory of Ordinary Differential Equations

Midterm Examination: Economics 210A October 2011

Brownian Motion. 1 Definition Brownian Motion Wiener measure... 3

Transcription:

MIT OpenCourseWare http://ocw.mit.edu 8.75 Theory of Probability Fall 008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

Section Prekopa-Leindler inequality, entropy and concentration. In this section we will make several connections between the Kantorovich-Rubinstein theorem and other classical objects. Let us start with the following classical inequality. Theorem 5 (Prekopa-Leindler) Consider nonnegative integrable functions w, u, v : R n [0, ) such that for some λ (0, ), w(λx + ( λ)y) u(x) λ v(y) λ for all x, y R n. Then, ( ) λ ( ) λ wdx udx vdx. Proof. The proof will proceed by induction on n. Let us first show the induction step. Suppose the statement holds for n and we would like to show it for n +. By assumption, for any x, y R n and a, b R Let us fix a and b and consider functions w(λx + ( λ)y, λa + ( λ)b) u(x, a) λ v(y, b) λ. w (x) = w(x, λa + ( λ)b), u (x) = u(x, a), v (x) = v(x, b) on R n that satisfy w (λx + ( λ)y) u (x) λ v (y) λ. By induction assumption, ( ) λ ( ) λ w dx u dx v dx. R n R n R n These integrals still depend on a and b and we can define w (λa + ( λ)b) = w dx = w(x, λa + ( λ)b)dx R n R n and, similarly, u (a) = u (x, a)dx, v (b) = v (x, b)dx R n R n so that w (λa + ( λ)b) u (a) λ v (b) λ. 88

These functions are defined on R and, by induction assumption, ( ) λ ( ) λ ( ) λ ( ) λ w ds u ds v ds = wdz udz vdz, R R R R n+ R n+ R n+ which finishes the proof of the induction step. It remains to prove the case n =. Let us show two different proofs.. One approach is based on the Brunn-Minkowski inequality on the real line which says that, if γ is the Lebesgue measure and A, B are Borel sets on R, then γ(λa + ( λ)b) λγ(a) + ( λ)γ(b), where A+B is the set addition, i.e. A+B = {a+b : a A, b B}. We can also assume that u, v, w : R [0, ] because the inequality is homogeneous to scaling. We have {w a} λ{u a} + ( λ){v a} because if u(x) a and v(y) a then, by assumption, The Brunn-Minkowski inequality implies that Finally, w(λx + ( λ)y) u(x) λ v(y) λ a λ a λ = a. γ(w a) λγ(u a) + ( λ)γ(v a). w(z)dz = I(x w(z))dxdz = γ(w x)dx R R 0 0 λ γ(u x)dx + ( λ) γ(v x)dx 0 0 ( ) λ ( ) λ = λ u(z)dz + ( λ) v(z)dz u(z)dz v(z)dz. R R R R. Another approach is based on the transportation of measure. We can assume that u = v = by rescaling u v w u, v, w. u v ( u)λ ( v) λ Then we need to show that w. Without loss of generality, let us assume that u, v 0 are smooth and strictly positive, since one can easily reduce to this case. Define x(t), y(t) for 0 t by Then x(t) y(t) u(s)ds = t, v(s)ds = t. u(x(t))x (t) =, u(y(t))y (t) = and the derivatives x (t), y (t) > 0. Define z(t) = λx(t) + ( λ)y(t). Then + w(s)ds = w(z(s))dz(s) = w(λx(s) + ( λ)y(s))z (s)ds. 0 0 By arithmetic-geometric mean inequality z (s) = λx (s) + ( λ)y (s) (x (s)) λ (y (s)) λ 89

and, by assumption, w(λx(s) + ( λ)y(s)) u(x(s)) λ v(y(s)) λ. Therefore, λ λ w(s)ds u(x(s))x (s) v(y(s))y (s) ds = ds =. 0 0 This finishes the proof of theorem. Entropy and the Kullback-Leibler divergence. Consider a probability measure P on R n and a nonnegative measurable function u : R n [0, ). Definition (Entropy)We define the entropy of u with respect to P by Ent P (u) = u log udp udp log udp. One can give a different representation of entropy by { } Ent P (u) = sup uvdp : e v dp. (.0.) Indeed, if we consider a convex set V = {v : e v dp } then the above supremum is obviously a solution of the following saddle point problem: ( ) L(v, λ) = uvdp λ e v dp sup inf. The functional L is linear in λ and concave in v. Therefore, by the minimax theorem, a saddle point solution exists and sup inf = inf sup. The integral uvdp λ e v dp = (uv λe v )dp can be maximized pointwise by taking v such that u = λe v. Then u L(v, λ) = u log dp udp + λ λ and maximizing over λ gives λ = u and v = log(u/ u). This proves (.0.). Suppose now that a law Q is absolutely continuous with respect to P and denote its Radon-Nikodym derivative by Definition (Kullback-Leibler divergence) The quantity D(Q P) := log u dq = log dq dq dp is called the Kullback-Leibler divergence between P and Q. Clearly, D(Q P) = Ent P (u), since Ent P (u) = log dq dq dq dq dq dp dp log dp = log dq. dp dp dp dp dp The variational characterization (.0.) implies that if e v dp then vdq = uvdp D(Q P ). (.0.3) v λ 0 u = dq. (.0.) dp 90

Transportation inequality for log-concave measures. Suppose that a probability distribution P on R n has the Lebesgue density e V (x) where V (x) is strictly convex in the following sense: tv (x) + ( t)v (y) V (tx + ( t)y) C p ( t + o( t)) x y p (.0.4) as t for some p and C p > 0. Example. One example of the distribution that satisfies (.0.4) is the non-degenerate normal distribution N(0, C) that corresponds to V (x) = (C x, x) + const for some covariance matrix C, det C = 0. If we denote A = C / then t(ax, x) + ( t)(ay, y) (A(tx + ( t)y), (tx + ( t)y)) = t( t)(a(x y), (x y)) λmax (C) t( t) x y, (.0.5) where λ max (C) is the largest eigenvalue of C. Thus, (.0.4) holds with p = and C p = /(λ max (C)). Let us prove the following useful inequality for the Wasserstein distance. Theorem 5 If P satisfies (.0.4) and Q is absolutely continuous w.r.t. P then W p (Q, P) p Cp D(Q P). Proof. Take functions f, g C(R n ) such that f(x) + g(y) t( t) C p ( t + o( t)) x y p. Then, by (.0.4), f(x) + g(y) tv (x) + ( t)v (y) V (tx + ( t)y) t( t) and This implies that for By the Prekopa-Leindler inequality, t( t)f(x) tv (x) + t( t)g(y) ( t)v (y) V (tx + ( t)y). and since e V is the density of P we get w(tx + ( t)y) u(x) t v(y) t u(x) = e ( t)f(x) V (x), v(y) = e tg(y) V (y) and w(z) = e V (z). ( ) t ( ) t e ( t)f(x) V (x) dx e tg(x) V (x) dx e V (x) dx ( ) ( ( ( t ) t ) e ( t)f dp e tg dp and e ( t)f t dp e tg dp It is a simple calculus exercise to show that ( ) R s lim e sf dp = e fdp, s 0 ) t. 9

and, therefore, letting t proves that R if f(x) + g(y) C p x y p then e g dp e fdp. If we denote v = g + fdp then the last inequality is e v dp and (.0.3) implies that vdq = fdp + gdq D(Q P). Finally, using the Kantorovich-Rubinstein theorem, (0.0.), we get { } W p (Q, P) p = inf C p x y p dµ(x, y) : µ M(P, Q) C p { } = sup fdp + gdq : f(x) + g(y) C p x y p D(Q P) C p Cp and this finishes the proof. Concentration of Gaussian measure. Applying this result to the example before Theorem 5 gives that for the non-degenerate Gaussian distribution P = N(0, C), W (P, Q) λ max (C)D(Q P). (.0.6) Given a measurable set A R n with P(A) > 0, define the conditional distribution P A by Then, obviously, the Radon-Nikodym derivative P(CA) P A (C) =. P(A) dp A = I A dp P(A) and the Kullback-Leibler divergence D(P A P) = log dp A = log. A P(A) P(A) Since W is a metric, for any two Borel sets A and B ( ) W (P A, P B ) W (P A, P) + W (P B, P) λ max (C) log + log. P(A) P(B) Suppose that the sets A and B are apart from each other by a distance t, i.e. d(a, B) t > 0. Then any two points in the support of measures P A and P B are at a distance at least t from each other and the transportation distance W (P A, P B ) t. Therefore, ( ) t W (P A, P B ) λ max (C) log + log 4λ max (C) log. P(A) P(B) P(A)P(B) Therefore, ( t ) P(B) exp. P(A) 4λmax (C) In particular, if B = {x : d(x, A) t} then ( t ) P d(x, A) t exp P(A) 4λmax (C). 9

If the set A is not too small, e.g. P(A) /, this implies that ( t ) P d(x, A) t exp 4λmax (C). This shows that the Gaussian measure is exponentially concentrated near any large enough set. The constant /4 in the exponent is not optimal and can be replaced by /; this is just an example of application of the above ideas. The optimal result is the famous Gaussian isoperimetry, if P(A) = P(B) for some half-space B then P(A t ) P(B t ). Gaussian concentration via the Prekopa-Leindler inequality. If we denote c = /λ max (C) then setting t = / in (.0.5), V (x) + V (y) V x + y c x y. 4 Given a function f on R n let us define its infimum-convolution by g(y) = inf f(x) + c x y. x 4 Then, for all x and y, If we define c g(y) f(x) x y V (x) + V (y) V x + y. (.0.7) 4 u(x) = e f(x) V (x), v(y) = e g(y) V (y) V (z), w(z) = e then (.0.7) implies that w x + y u(x) / v(y) /. The Prekopa-Leindler inequality with λ = / implies that e g dp e f dp. (.0.8) Given a measurable set A, let f be equal to 0 on A and + on the complement of A. Then g(y) = c d(x, A) 4 and (.0.8) implies ( c ) exp d(x, A) dp(x). 4 P(A) By Chebyshev s inequality, ( ct ) ( t ) P d(x, A) t exp = exp P(A) 4 P(A) 4λmax (C). Trivial metric and total variation. Definition A total variation distance between probability measure P and Q on a measurable space (S, B) is defined by TV(P, Q) = sup P(A) Q(A). A B Using the Hahn-Jordan decomposition, we can represent a signed measure µ = P Q as µ = µ + µ such that for some set D B and for any set E B, µ + (E) = µ(ed) 0 and µ (E) = µ(ed c ) 0. 93

Therefore, for any A B, which makes it obvious that P(A) Q(A) = µ + (A) µ (A) = µ + (AD) µ (AD c ) sup P(A) Q(A) = µ + (D). A B Let us describe some connections of the total variation distance to the Kullback-Leibler divergence and the Kantorovich-Rubinstein theorem. Let us start with the following simple observation. Lemma 43 If f is a measurable function on S such that f and fdp = 0 then for any λ R, λf dp e λ / e. Proof. Since ( + f)/, ( f)/ [0, ] and x by convexity of e Therefore, we get + f λf = λ + f ( λ), e λf + f e λ + f e λ = ch(λ) + fsh(λ). e λf dp ch(λ) e λ /, where the last inequality is easy to see by Taylor s expansion. Let us now consider a trivial metric on S given by Formally, the Kantorovich-Rubinstein theorem in this case would state that { } W (P, Q) := inf I(x y)dµ(x, y) : µ M(P, Q) Theorem 53 If Q is absolutely continuous w.r.t. P then TV(P, Q) D(Q P). d(x, y) = I(x y). (.0.9) Then a -Lipschitz function f w.r.t. d, f L, is defined by the condition that for all x, y S, f(x) f(y). (.0.0) { } = sup fdq fdp : f L =: γ(p, Q). However, since any uncountable set S is not separable w.r.t. a trivial metric d, we can not apply the Kantorovich-Rubinstein theorem directly. In this case one can use the Hahn-Jordan decomposition to show that γ coincides with the total variation distance, γ(p, Q) = TV(P, Q) and it is easy to construct a measure µ M(P, Q) explicitly that witnesses the above equality. We leave this as an exercise. Thus, for the trivial metric d, W (P, Q) = γ(p, Q) = TV(P, Q). We have the following analogue of the KL divergence bound. 94

Proof. Take f such that (.0.0) holds. If we define g(x) = f(x) fdp then, clearly, g and gdp = 0. The above lemma implies that for any λ R, e λf λ R fdp λ / dp. The variational characterization of entropy (.0.3) implies that λ fdq λ fdp λ / D(Q P) and for λ > 0 we get fdq λ fdp + D(Q P). λ Minimizing the right hand side over λ > 0, we get fdq fdp D(Q P). Applying this to f and f yields the result. 95