Sharpening Jensen s Inequality

Similar documents
Page 2 of 8 6 References 7 External links Statements The classical form of Jensen's inequality involves several numbers and weights. The inequality ca

Expectation is a positive linear operator

Stat410 Probability and Statistics II (F16)

18.175: Lecture 3 Integration

Chapter 8.8.1: A factorization theorem

Solution for Problem 7.1. We argue by contradiction. If the limit were not infinite, then since τ M (ω) is nondecreasing we would have

Posterior Regularization

Brief Review on Estimation Theory

6.1 Variational representation of f-divergences

Supermodular ordering of Poisson arrays

If g is also continuous and strictly increasing on J, we may apply the strictly increasing inverse function g 1 to this inequality to get

Statistical Methods for Handling Incomplete Data Chapter 2: Likelihood-based approach

Concentration Inequalities

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

4 Expectation & the Lebesgue Theorems

Moments. Raw moment: February 25, 2014 Normalized / Standardized moment:

Introduction to Machine Learning

Chapter 2: Fundamentals of Statistics Lecture 15: Models and statistics

Chapter 3 : Likelihood function and inference

Spring 2012 Math 541B Exam 1

1 Solution to Problem 2.1

Expectation Maximization

Hypothesis testing: theory and methods

Exponential Families

MAS223 Statistical Inference and Modelling Exercises

Discrete time processes

COMP2610/COMP Information Theory

arxiv:math/ v1 [math.st] 19 Jan 2005

Introduction to Machine Learning

Mathematical statistics

Advanced Probability

Convex Functions. Wing-Kin (Ken) Ma The Chinese University of Hong Kong (CUHK)

Lecture 3: Expected Value. These integrals are taken over all of Ω. If we wish to integrate over a measurable subset A Ω, we will write

Chapters 9. Properties of Point Estimators

Mathematical statistics

A D VA N C E D P R O B A B I L - I T Y

Inference in non-linear time series

Lectures 22-23: Conditional Expectations

On the Chi square and higher-order Chi distances for approximating f-divergences

DA Freedman Notes on the MLE Fall 2003

ST5215: Advanced Statistical Theory

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Estimating the parameters of hidden binomial trials by the EM algorithm

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

Lecture 7 Introduction to Statistical Decision Theory

Chapter 4. Theory of Tests. 4.1 Introduction

MATH 217A HOMEWORK. P (A i A j ). First, the basis case. We make a union disjoint as follows: P (A B) = P (A) + P (A c B)

Phenomena in high dimensions in geometric analysis, random matrices, and computational geometry Roscoff, France, June 25-29, 2012

ECE 4400:693 - Information Theory

Chapter 3. Point Estimation. 3.1 Introduction

ECE534, Spring 2018: Solutions for Problem Set #3

Lecture 4 Lebesgue spaces and inequalities

PARAMETER CONVERGENCE FOR EM AND MM ALGORITHMS

S6880 #13. Variance Reduction Methods

Lecture 2: August 31

Half convex functions

Probability and Distributions

Brief Review of Probability

University of Regina. Lecture Notes. Michael Kozdron

BMIR Lecture Series on Probability and Statistics Fall 2015 Discrete RVs

Methods of Estimation

Lecture 1: August 28

IE 521 Convex Optimization

HW1 solutions. 1. α Ef(x) β, where Ef(x) is the expected value of f(x), i.e., Ef(x) = n. i=1 p if(a i ). (The function f : R R is given.

ECE598: Information-theoretic methods in high-dimensional statistics Spring 2016

A Note On Large Deviation Theory and Beyond

STAT215: Solutions for Homework 2

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

QUADRATIC MAJORIZATION 1. INTRODUCTION

Parameter Estimation

Lecture 2: Convex functions

Exercises and Answers to Chapter 1

Chapter 4. Continuous Random Variables 4.1 PDF

STAT 730 Chapter 4: Estimation

Latent Variable Models and EM algorithm

Lecture 1: Entropy, convexity, and matrix scaling CSE 599S: Entropy optimality, Winter 2016 Instructor: James R. Lee Last updated: January 24, 2016

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 9 10/2/2013. Conditional expectations, filtration and martingales

Lecture 1 Measure concentration

A note on the convex infimum convolution inequality

Characterization of quadratic mappings through a functional inequality

n! (k 1)!(n k)! = F (X) U(0, 1). (x, y) = n(n 1) ( F (y) F (x) ) n 2

3. Convex functions. basic properties and examples. operations that preserve convexity. the conjugate function. quasiconvex functions

Math 341: Probability Seventeenth Lecture (11/10/09)

1 Probability space and random variables

Lecture 6: Gaussian Channels. Copyright G. Caire (Sample Lectures) 157

CHAPTER 3: LARGE SAMPLE THEORY

STAT 512 sp 2018 Summary Sheet

Model comparison and selection

Lecture 5 Channel Coding over Continuous Channels

Stat 451 Lecture Notes Monte Carlo Integration

Machine learning - HT Maximum Likelihood

Optimization Methods II. EM algorithms.

2 Inference for Multinomial Distribution

Lecture 21: Expectation of CRVs, Fatou s Lemma and DCT Integration of Continuous Random Variables

A NEW PROOF OF THE WIENER HOPF FACTORIZATION VIA BASU S THEOREM

1. Stochastic Processes and filtrations

p. 6-1 Continuous Random Variables p. 6-2

EM Algorithm II. September 11, 2018

Midterm Examination. STA 215: Statistical Inference. Due Wednesday, 2006 Mar 8, 1:15 pm

1 Delayed Renewal Processes: Exploiting Laplace Transforms

Transcription:

Sharpening Jensen s Inequality arxiv:1707.08644v [math.st] 4 Oct 017 J. G. Liao and Arthur Berg Division of Biostatistics and Bioinformatics Penn State University College of Medicine October 6, 017 Abstract This paper proposes a new sharpened version of the Jensen s inequality. The proposed new bound is simple and insightful, is broadly applicable by imposing minimum assumptions, and provides fairly accurate result in spite of its simple form. Applications to the moment generating function, power mean inequalities, and Rao- Blackwell estimation are presented. This presentation can be incorporated in any calculus-based statistical course. Keywords: Jensen gap, Power mean inequality, Rao-Blackwell Estimator, Taylor series 1

1 Introduction Jensen s inequality is a fundamental inequality in mathematics and it underlies many important statistical proofs and concepts. Some standard applications include derivation of the arithmetic-geometric mean inequality, non-negativity of Kullback and Leibler divergence, and the convergence property of the expectation-maximization algorithm(dempster et al., 1977). Jensen s inequality is covered in all major statistical textbooks such as Casella and Berger (00, Section 4.7) and Wasserman (013, Section 4.) as a basic mathematical tool for statistics. Let X be a random variable with finite expectation and let ϕ(x) be a convex function, then Jensen s inequality (Jensen, 1906) establishes E[ϕ(X)] ϕ(e[x]) 0. (1) This inequality, however, is not sharp unless var(x) = 0 or ϕ(x) is a linear function of x. Therefore, there is substantial room for advancement. This paper proposes a new sharper bound for the Jensen gap E[ϕ(X)] ϕ(e[x]). Some other improvements of Jensen s inequality have been developed recently; see for example Walker(014), Abramovich and Persson (016); Horvath et al. (014) and references cited therein. Our proposed bound, however, has the following advantages. First, it has a simple, easy to use, and insightful form in terms of the second derivative ϕ (x) and var(x). At the same time, it gives fairly accurate results in the several examples below. Many previously published improvements, however, are much more complicated in form, much more involved to use, and can even be more difficult to compute than E[ϕ(X)] itself as discussed in Walker (014). Second, our method requires only the existence of ϕ (x) and is therefore broadly applicable. In contrast, some other methods require ϕ(x) to admit a power series representation with positive coefficients (Abramovich and Persson, 016; Dragomir, 014; Walker, 014) or require ϕ(x) to be super-quadratic (Abramovich et al., 014). Third, we provide both a lower bound and an upper bound in a single formula. We have incorporated the materials in this paper in our classroom teaching. With only slightly increased technical level and lecture time, we are able to present a much sharper version of the Jensen s inequality that significantly enhances students understanding of

the underlying concepts. Main result Theorem 1. LetX beaone-dimensional randomvariablewithmeanµ, andp(x (a,b)) = 1, where a < b. Let ϕ(x) is a twice differentiable function on (a,b), and define function Then h(x;ν) ϕ(x) ϕ(ν) (x ν) ϕ (ν) x ν. inf {h(x;µ)}var(x) E[ϕ(X)] ϕ(e[x]) sup {h(x;µ)}var(x). () x (a,b) x (a,b) Proof. Let F(x) be the cumulative distribution function of X. Applying Taylor s theorem to ϕ(x) about µ with a mean-value form of the remainder gives ϕ(x) = ϕ(µ)+ϕ (µ)(x µ)+ ϕ (g(x)) (x µ), where g(x) is between x and µ. Explicitly solving for ϕ (g(x))/ gives ϕ (g(x))/ = h(x;µ) as defined above. Therefore E[ϕ(X)] ϕ(e[x]) = = = b a b a b a {ϕ(x) ϕ(µ)} df(x) { ϕ (µ)(x µ)+h(x;µ)(x µ) } df(x) h(x;µ)(x µ) df(x), and the result follows because inf x (a,b) h(x;µ) h(x;µ) sup x (a,b) h(x;µ). Theorem1alsoholdswheninfh(x;µ)isreplacedbyinfϕ (x)/andsuph(x;µ)replaced by supϕ (x)/ since inf ϕ (x) infh(x;µ) and sup ϕ (x) suph(x;µ). These less tight bounds are implied in the economics working paper Becker (01). Our lower and upper bounds have the general form J var(x), where J depends on ϕ. Similar 3

forms of bounds are presented in Abramovich and Persson (016); Dragomir(014); Walker (014), but our J in Theorem 1 is much simpler and applies to a wider class of ϕ. Inequality () implies Jensen s inequality when ϕ (x) 0. Note also that Jensen s inequality is sharp when ϕ(x) is linear, whereas inequality () is sharp when ϕ(x) is a quadratic function of x. In some applications the moments of X present in () are unknown, although a random sample x 1,...,x n from the underlying distribution F is available. A version of Theorem 1 suitable for this situation is given in the following corollary. Corollary 1.1. Let x 1,...,x n be any n datapoints in (, ), and let x = 1 n n x i, ϕ x = 1 n i=1 n ϕ(x i ), S = 1 n i=1 n (x i x). i=1 Then inf x [a,b] h(x; x)s ϕ x ϕ( x) sup h(x; x)s, x [a,b] where a = min{x 1,...,x n } and b = max{x 1,...,x n }. Proof. Consider the discrete random variable X with probability distribution P(X = x i ) = 1/n, i = 1,...,n. We have E[X] = x, E[ϕ(X)] = ϕ x, and var(x) = S. Then the corollary follows from application of Theorem 1. Lemma 1. If ϕ (x) is convex, then h(x;µ) is monotonically increasing in x, and if ϕ (x) is concave, then h(x; µ) is monotonically decreasing in x. Proof. We prove that h (x;µ) 0 when ϕ (x) is convex. The analogous result for concave ϕ (x) follows similarly. Note that so it suffices to prove dh(x; µ) dx = ϕ (x)+ϕ (µ) ϕ (x)+ϕ (µ) ϕ(x) ϕ(µ) x µ 1 (x µ), ϕ(x) ϕ(µ). x µ Without loss of generality we assume x > µ. Convexity of ϕ (x) gives ϕ (y) ϕ (µ)+ ϕ (x) ϕ (µ) (y µ) x µ 4

for all y (µ,x). Therefore we have and the result follows. ϕ(x) ϕ(µ) = x µ x µ ϕ (y)dy { ϕ (µ)+ ϕ (x) ϕ (µ) (y µ) x µ = ϕ (x)+ϕ (µ) (x µ). Lemma 1 makes Theorem 1 easy to use as the follow results hold: infh(x;µ) = limh(x;µ) x a suph(x;µ) = limh(x;µ), when ϕ (x) is convex x b infh(x;µ) = lim x b h(x;µ) } dy suph(x;µ) = lim x a h(x;µ), when ϕ (x) is concave. Note the limits of h(x;µ) can be either finite or infinite. The proof of Lemma 1 borrows ideas from Bennish (003). Examples of functions ϕ(x) for which ϕ is convex include ϕ(x) = exp(x) and ϕ(x) = x p for p or p (0,1]. Examples of functions ϕ(x) for which ϕ is concave include ϕ(x) = logx and ϕ(x) = x p for p < 0 or p [1,]. 3 Examples Example 1 (Moment Generating Function). For any random variable X supported on (a, b) with a finite variance, we can bound the moment generating function E[e tx ] using Theorem 1 to get where inf {h(x;µ)}var(x) E[e tx ] e te[x] sup {h(x;µ)}var(x), x (a,b) x (a,b) For t > 0 and (a,b) = (, ), we have h(x;µ) = etx e tµ (x µ) tetµ x µ. infh(x;µ) = lim h(x;µ) = 0 and suph(x;µ) = lim h(x;µ) =. x x 5

So Theorem 1 provides no improvement over Jensen s inequality. However, on a finite domain such as a non-negative random variable with (a,b) = (0, ), a significant improvement in the lower bound is possible because infh(x;µ) = h(0;µ) = 1 etµ +tµe tµ µ > 0. Similar results hold for t < 0. We apply this to an example from Walker (014), where X is an exponential random variable with mean 1 and ϕ(x) = e tx with t = 1/. Here the actual Jensen s gap is E[e tx ] e te[x] = e.351. Since var(x) = 1, we have.176 h(0;µ) E[e tx ] e te[x] lim x h(x;µ) =. The less sharp lower bound using infϕ (x)/ is 0.15. Utilizing elaborate approximations and numerical optimizations Walker (014) yielded a more accurate lower bound of 0.71. Example (Arithmetic vs Geometric Mean). Let X be a positive random variable on interval (a,b) with mean µ. Note that log(x) is convex whose derivative is concave. Applying Theorem 1 and Lemma 1 leads to where limh(x;µ)var(x) E{log(X)}+logµ lim h(x;µ)var(x), x b x a h(x;µ) = logx+logµ 1 + (x µ) µ(x µ). Now consider a sample of n positive data points x 1,...,x n. Let x be the arithmetic mean and x g = (x 1 x x n ) 1 n be the geometric mean. Applying Corollary 1.1 gives exp{s h(b; x)} x x g exp{s h(a; x)}, where a, b, S are as defined in Corollary 1.1. To give some numerical results, we generated 100 random numbers from uniform distribution on [10,100]. For these 100 numbers, the arithmetic mean x is 54.830 and the geometric mean x g is 47.509. The above inequality becomes 1.075 x = 1.154 1.331, (x 1 x x n ) 1 n which are fairly tight bounds. Replacing h(x n ; x) by ϕ (x n )/ and h(x 1 ; x) by ϕ (x 1 )/ leads to a less accurate lower bound 1.0339 and upper bound 1.698. 6

Example 3 (Power Mean). Let X be a positive random variable on a positive interval (a,b) with mean µ. For any real number s 0, define the power mean as M s (X) = (EX s ) 1/s Jensen s inequality establishes that M s (X) is an increasing function of s. We now give a sharper inequality by applying Theorem 1. Let r 0, Y = x r, µ y = EY, p = s/r and ϕ(y) = y p. Note that EX s = E{ϕ(Y)}. Applying Theorem 1 leads to where infh(y;µ y )var(y) E[X s ] (EX r ) p suph(y;µ y )var(y), h(y;µ y ) = yp µ p y (y µ y ) pµp 1 y. y µ y To apply Lemma 1, note that ϕ (y) is convex for p or p (0,1] and is concave for p < 0 or p [1,] as noted in Section. Applying the above result to the case of r = 1 and s = 1, we have Y = X, p = 1. Therefore ( ) 1 (EX) 1 + limh(y;µ y )var(x) ( EX 1) ( ) 1 1 (EX) 1 +limh(y;µ y )var(x). y a y b For the same sequence x 1,...,x n generated in Example, we have x harmonic = 39.113. Applying Corollary 1.1 leads to 5.337 x harmonic = 39.113 48.905. Note that the upper bound 48.905 is much smaller than the arithmetic mean x = 54.830 by the Jensen s inequality. Replacing h(b; x) by ϕ (b)/ and h(a; x) by ϕ (a)/ leads to a less accurate lower bound 0.898 and 51.0839. In a recent article published in the American Statistician, de Carvalho (016) revisited Kolmogorov s formulation of generalized mean as E ϕ (X) = ϕ 1 (E[ϕ(X)]), (3) where ϕ is a continuous monotone function with inverse ϕ 1. The Example corresponds to ϕ(x) = log(x) and Example 3 corresponds to ϕ(x) = x s. We can also apply Theorem 1 to bound ϕ 1 (Eϕ(X)) for a more general function ϕ(x). 7

Example 4 (Rao-Blackwell Estimator). Rao-Blackwell theorem (Theorem 7.3.17 in Casella and Berger, 00; Theorem 10.4 in Wasserman, 013) is a basic result in statistical estimation. Let ˆθ be an estimator of θ, L(θ, ˆθ) be a loss function convex in ˆθ, and T a sufficient statistic. Then the Rao-Blackwell estimator, ˆθ = E[ˆθ T], satisifies the following inequality in risk function E[L(θ, ˆθ)] E[L(θ, ˆθ )]. (4) We can improve this inequality by applying Theorem 1 to ϕ(ˆθ) = L(θ, ˆθ) with respect to the conditional distribution of ˆθ given T: E[L(θ, ˆθ) T] L(θ, ˆθ ) inf x (a,b) h(x; ˆθ )var(ˆθ T), where function h is defined as in Theorem 1 for ϕ(ˆθ) and P(ˆθ (a,b) T) = 1. Further taking expectations over T gives [ ] E[L(θ, ˆθ)] E[L(θ, ˆθ )] E inf h(x; ˆθ )var(ˆθ T). x (a,b) In particular for square-error loss, L(θ, ˆθ) = (ˆθ θ), we have E[(θ ˆθ) ] E[(θ ˆθ ) ] = E [ ] var(ˆθ T). Using the original Jensen s inequality only establishes the cruder inequality in Equation (4). 4 Improved bounds by partitioning As discussed in Example 1 above, Theorem 1 does not improve on Jensen s inequality if infh(x;µ) = 0. In such cases, we can often sharpen the bounds by partitioning the domain (a, b) following an approach used in Walker (014). Let a = x 0 < x 1 < < x m = b, I j = [x j 1,x j ), η j = P(X I j ), and µ j = E(X X I j ). It follows from the law of total expectation that 8

E[ϕ(X)] = = m η j E[ϕ(X) X I j ] j=1 m η j ϕ(µ j )+ j=1 m η j (E[ϕ(X) X I j ] ϕ(µ j )). j=1 Let Y be a discrete random variable with distribution P(Y = µ j ) = η j,j = 1,,...,m. It is easy to see that EY = EX. It follows by Theorem 1 that m j=1 η j ϕ(µ j ) = E[ϕ(Y)] ϕ(ey)+ inf h(y;µ y )var(y). y [µ 1,µ m] We can also apply Theorem 1 to each E[ϕ(X X I j )] ϕ(µ j ) term: E[ϕ(X X I j )] ϕ(µ j ) inf x I j h(x;µ j )var(x X I j ). Combining the above two equations, we have E[ϕ(X)] ϕ(ex) inf y [µ 1,µ m] h(y;µ y)var(y)+ Replacing inf by sup in the righthand side gives the upper bound. m j=1 η j inf x I j h(x;µ j )var(x X I j ). (5) The Jensen gap on the left side of (5) is positive if any of the m+1 terms on the right is positive. In particular, the Jensen gap is positive if there exists an interval I (a,b) that satisfies inf x I ϕ (x) > 0, P(X I) > 0 and var(x X I) > 0. Note that a finer partition does not necessarily lead to a sharper lower bound in (5). The focus of the partition should therefore be on isolating the part of interval (a,b) in which ϕ (x) is close to 0. Consider example X N (µ,σ ) with µ = 0 and σ = 1 and ϕ(x) = e x. We divide (, ) into three intervals with equal probabilities. This gives I j η j E[X X I j ] var(x X I j ) inf x Ij h(x;µ j ) sup x Ij h(x;µ j ) (,.431) 1/3-1.091 0.80 0.000 0.1 ( 0.431, 0.431) 1/3 0.000 0.060 0.435 0.580 (0.431, ) 1/3 1.091 0.80 1.09 9

The actual Jensen gap is e µ+σ e µ = 0.649. The lower bound from (5) is 0.409, which is a huge improvement over Jensen s bound of 0. The upper bound, however, provides no improvement over Theorem 1. To summarize, this paper proposes a new sharpened version of the Jensen s inequality. The proposed bound is simple and insightful, is broadly applicable by imposing minimum assumptions on ϕ(x), and provides fairly accurate result in spite of its simple form. It can be incorporated in any calculus-based statistical course. References Abramovich, S. and L.-E. Persson (016). Some new estimates of the Jensen gap. Journal of Inequalities and Applications 016(1), 39. Abramovich, S., L.-E. Persson, and N. Samko (014). Some new scales of refined Jensen and Hardy type inequalities. Mathematical Inequalities & Applications 17(3), 1105 1114. Becker, R. A. (01). The variance drain and Jensen s inequality. Technical report, CAEPR Working Paper No. 01-004. Available at SSRN: https://ssrn.com/abstract=07471 or http://dx.doi.org/10.139/ssrn.07471. Bennish, J. (003). A proof of Jensens inequality. Missouri Journal of Mathematical Sciences, 15(1). Casella, G. and R. L. Berger (00). Statistical inference, Volume. Duxbury Pacific Grove, CA. de Carvalho, M. (016). Mean, what do you mean? The American Statistician 70(3), 70 74. Dempster, A. P., N. M. Laird, and D. B. Rubin (1977). Maximum likelihood from incomplete data via the em algorithm. Journal of the royal statistical society. Series B (methodological), 1 38. Dragomir, S. (014). Jensen integral inequality for power series with nonnegative coefficients and applications. RGMIA Res. Rep. Collect 17, 4. 10

Horvath, L., K. A. Khan, and J. Pecaric (014). Refinement of Jensen s inequality for operator convex functions. Advances in inequalities and applications 014. Jensen, J. L. W. V. (1906). Sur les fonctions convexes et les inégalités entre les valeurs moyennes. Acta mathematica 30(1), 175 193. Walker, S. G. (014). On a lower bound for the Jensen inequality. SIAM Journal on Mathematical Analysis 46(5), 3151 3157. Wasserman, L. (013). All of statistics: a concise course in statistical inference. Springer Science & Business Media. 11