Appendix B: Inequalities Involving Random Variables and Their Expectations

Size: px

Start display at page:

Download "Appendix B: Inequalities Involving Random Variables and Their Expectations"

Jared Harrington
5 years ago
Views:

1 Chapter Fourteen Appendix B: Inequalities Involving Random Variables and Their Expectations In this appendix we present specific properties of the expectation (additional to just the integral of measurable functions on possibly infinite measure spaces) It is to be expected that on probability spaces we may obtain more specific properties since the probability space has measure 1 Proposition 141 (Markov inequality) Let Z be a rv and let g : R [0, ] be an increasing, positive measurable function Then E [ g(z ) ] E [ g(z )1 {Z c} ] g(c)p(z c) Thus P(Z c) E[g(Z )] g(c) for all g increasing functions and c>0 Proof: Take >0 arbitrary and define the random variable Y = 1 { X } Handbook of Probability, First Edition Ionuţ Florescu and Ciprian Tudor 2014 John Wiley & ons, Inc Published 2014 by John Wiley & ons, Inc 434

2 CHAPTER 14 Appendix B: Inequalities Involving Random Variables 435 Then clearly Y X and taking the expectation, we get EY = P( X ) E X EXAMPLE 141 pecial cases of the Markov inequality If we take g(x) = x an increasing function and X a positive random variable, then we obtain P(Z c) E(Z ) c To get rid of the condition X 0, we take the random variable Z = X Then we obtain the classical form of the Markov inequality: P( X c) E( X ) c If we take g(x) = x 2, Z = X E(X ) and we use the definition of variance, we obtain the Chebyshev inequality: P( X E(X ) c) Var(X ) c 2 If we denote E(X ) = and Var(X ) = and we take c = k in the previous inequality, we will obtain the classical Chebyshev inequality presented in undergraduate courses: Proposition 142 EX 2 < we have For every 0 and for any random variable X such that P( X k) 1 k 2 If we take g(x) = e x, with >0, then P(Z c) e c E(e z ) This last inequality states that the tail of the distribution decays exponentially in c if Z has finite exponential moments With simple manipulations, one can obtain Chernoff s inequality using it

3 436 CHAPTER 14 Appendix B: Inequalities Involving Random Variables Remark 143 In fact the Chebyshev inequality is far from being sharp Consider, for example, a random variable X with standard normal distribution N (0, 1) If we calculate the probability of the normal using a table of the normal law or using the computer, we obtain P(X 2) = = However, if we bound the probability using Chebyshev inequality, we obtain P(X 2) = 1 2 P( X 2) = 1 8 = 0125, which is very far from the actual probability The following definition is just a reminder Definition 144 A function g : I R is called a convex function on I (where I is any open interval in R, if its graph lies below any of its chords) Mathematically: For any x, y I and for any (0, 1), we have g( x + (1 )y) g(x) + (1 )g(y) A function g is called concave if the opposite is happening: g( x + (1 )y) g(x) + (1 )g(y) ome examples of convex functions on the whole R: x, x 2, and e x, with >0 Lemma 145 (Jensen s inequality) Let f be a convex function and let X be an rv in L 1 () Assume that E(f (X )), then f (E(X )) E(f (X )) Proof: kipped The classic approach indicators simple functions positive measurable measurable is a standard way to prove Jensen Remark 146 The discrete form of Jensen s inequality is as follows: Let ϕ : R R be a convex function and let x 1,, x n R and a i > 0 for i = 1,, n Then ( n ϕ a ) n ix i n a a iϕ(x i ) n i a i If the function ϕ is concave, we have ( n ϕ a ) ix i n a i n a iϕ(x i ) n a i The remark is a particular case of the Jensen inequality Indeed, consider a discrete random variable X with outcomes x i and corresponding probabilities

4 CHAPTER 14 Appendix B: Inequalities Involving Random Variables 437 a i / a i Apply the classic Jensen approach above to the convex function ϕ using the expression of expectation of discrete random variables A Historical Remark The next inequality, one of the most famous and useful in any area of analysis (not only probability), is usually credited to Cauchy for sums and chwartz for integrals and is usually known as the Cauchy chwartz inequality However,the Russian mathematician Victor Yakovlevich Bunyakovsky ( ) discovered and first published the inequality for integrals in 1859 (when chwartz was 16) Unfortunately, he was born in eastern Europe However, all who are born in eastern Europe (including myself) learn the inequality by its proper name Lemma 147 (Cauchy Bunyakovsky chwarz inequality) If X, Y L 2 (), then XY L 1 () and E[XY ] E[ XY ] X 2 Y 2, where we used the notation of the norm in L p : X p = ( E[ X p ] ) 1 p Proof: The first inequality is clear applying Jensen inequality to the function x We need to show Let Clearly, W, Z 0 E[ XY ] (E[X 2 ]) 1/2 (E[Y 2 ]) 1/2 W = X and Z = Y Truncation Let W n = W n and Z n = Z n that is { W (ω), if W (ω) <n, W n (ω) = n, if W (ω) n Clearly, defined in this way, W n,z n are bounded Let a, b R two constants Then 0 E[(aW n + bz n ) 2 ] = a 2 E(Wn 2 ) + 2abE(W nz n ) + b 2 E(Zn 2 ) If we let a/b = c,weget c 2 E(Wn 2 ) + 2cE(W nz n ) + E(Zn 2 ) 0, c R This means that the quadratic function in c has to be positive But this is only possible if the determinant of the equation is negative and the leading coefficient

5 438 CHAPTER 14 Appendix B: Inequalities Involving Random Variables E(Wn 2 ) is strictly positive; the later condition is obviously true Thus we must have 4(E(W n Z n )) 2 4E(W 2 n )E(Z 2 n ) 0 (E(W n Z n )) 2 E(W 2 n )E(Z 2 n ) E(W 2 )E(Z 2 ) n, which is in fact the inequality for the truncated variables If we let n and we use the monotone convergence theorem, we get (E(WZ )) 2 E(W 2 )E(Z 2 ) A generalization of the Cauchy Buniakovski chwartz is: Lemma 148 (Hölder inequality) If 1/p + 1/q = 1, X L p (), and Y L q (), then XY L 1 () and E XY X p Y q = ( E X p) 1 p (E Y q ) 1 q Proof: The proof is simple and uses the following inequality (Young inequality): If a and b are positive real numbers and p, q are as in the theorem, then ab ap p + bq q, with equality if and only if a p = b q Taking this inequality as given (not hard to prove), define f = X X p, g = Y Y p Note that the Holder inequality is equivalent to E[f g] 1 (Note that X p and Y q are just numbers which can be taken in and out of integral using the linearity property of the integral) To finish the proof, apply the Young inequality to f 0 and g 0 and then integrate to obtain E[f g] 1 p E[f p ] + 1 q E[g q ] = 1 p + 1 q = 1, since E[f p ] = 1 and similarly for g Finally, the extreme cases (p = 1, q =, etc) may be treated separately, but they will yield the same inequality This inequality and Riesz representation theorem creates the notion of conjugate space This notion is only provided to create links with real analysis For further details we recommend Royden (1988)

6 CHAPTER 14 Appendix B: Inequalities Involving Random Variables 439 Definition 149 (Conjugate space of L p ) For p>0 let L p () define the space on (,F, P) The number q>0 with the property 1/p + 1/q = 1 is called the conjugate index of p The corresponding space L q () is called the conjugate space of L p () Any of these spaces are metric spaces with the distance induced by the norm, that is, d (X, Y ) = X Y p = ( E [ X Y p]) 1 p The fact that this is a properly defined linear space is implied by the triangle inequality in L p the next theorem Lemma 1410 (Minkowski inequality) If X, Y L p then X + Y L p and Proof: We clearly have X + Y p X p + Y p X + Y p 2 p 1 ( X p + Y p ) For example, to show this inequality in terms of real numbers, just use the definition of convexity for the function x p with x = X and y = Y and = 1/2 Integrating the inequality will impliy that X + Y L p Now we can write X + Y p p = E[ X + Y p ] E [ ( X + Y ) X + Y p 1] = E [ X X + Y p 1] + E [ Y X + Y p 1] Holder ( E [ X p]) 1/p ( E [ X + Y (p 1)q ]) 1/q + ( E [ Y p]) 1/p ( E [ X + Y (p 1)q ]) 1/q ( ) q= p p 1 = ( X p + Y p )( E [ X + Y p ]) 1 1 p = ( X p + Y p ) E [ X + Y p ] X + Y p Finally, identifying the left and right hand after simplifications, we obtain the result The Case of L 2 The case when p = 2 is quite special This is because 2 is its own conjugate index (1/2 + 1/2 = 1) Because of this, the space is quite similar to the Euclidian space If X, Y L 2, we may define the inner product: <X,Y>= E[XY ] = XY d P,

7 440 CHAPTER 14 Appendix B: Inequalities Involving Random Variables which is a well-defined quantity using the Cauchy Bunyakovsky chwartz inequality The existence of the inner product and the completeness of the norm makes L 2 a Hilbert space with all the benefits that follow In particular, the notion of orthogonality is well-defined Two variables X and Y in L 2 are orthogonal if and only if <X,Y>= 0 In turn the orthogonality definition allows a Fourier representation and, in general, representations in terms of an orthonormal basis of functions in L 2 Again, we do not wish to enter into more details than necessary; please consult, (Billingsley, 1995, ection 19) for further reference A consequence of the Markov inequality is the Berstein inequality Proposition 1411 (Berstein inequality) Let X 1,X 2,,X n be independent random variable square integrable with zero expectation Assume that there exists a constant M>0 such that for every i = 1,,nwe have X i M almost surely, that is, the variables are bounded by M almost surely Then, for every t 0, we have ( n ) P X i >t e t2 2 n EX i 2+ 2Mt 3 EXAMPLE 142 A random variable X has finite variance 2 how that for any number c, how that if E(X ) = 0, then P(X t) E[(X + c)2 ] (t + c) 2 if t> c P(X t) 2, t >0 2 + t 2 olution: Let us use a technique similar to the Markov inequality to prove the first inequality Let F (x) be the distribution function of X For any c R we may write E [ (X + c) 2] = t (x + c) 2 df (x) + t (x + c) 2 df (x)

8 141 Functions of Random Variables The Transport Formula 441 The first integral is always positive and if t> c, then t + c>0and on the interval x (t, ) the function (x + c) 2 is increasing Therefore we may continue: E [ (X + c) 2] t (t + c) 2 df (x) = (t + c) 2 P(X >t) Rewriting the final expression gives the first assertion To show the second assertion, note that if E[x] = 0, then V (X ) = E[X 2 ] and thus E [ (X + c) 2] = 2 + c 2 Thus the inequality we just proved reads in this case: P(X t) 2 + c 2, if t> c (t + c) 2 Now take c = 2 This is a negative value for any positive t, so the condition is t satisfied for any t positive ubstituting after simplifications, we obtain exactly what we need You may wonder (and should wonder) how we came up with the value 2 The explanation is simple that is, the value of c which minimizes t the expression 2 +c 2 ; in other words, the value of c which produces the best (t+c) 2 bound 141 Functions of Random Variables The Transport Formula In the previous chapters dedicated to discrete and continuous random variables, we learned how to calculate distributions in particular, pdf s for continuous random variables In this appendix we present a more general result This general result allows us to construct random variables and, in particular, distributions on any abstract space This is the result that allows us to claim that studying random variables on ([0, 1], B ([0, 1]),) is enough We had to postpone presenting the result until this point since we had to learn first how to integrate Theorem 1412 (General Transport Formula) Let (,R,P) be a probability space Let f be a measurable function such that (,F ) f ϕ (, G ) (R, B (R)), where (, G ) is a measurable space Assuming that at least one of the integrals exists, we then have ϕ fdp = ϕdp f 1, for all ϕ measurable functions Proof: We will use the standard argument technique discussed above

9 442 CHAPTER 14 Appendix B: Inequalities Involving Random Variables 1 Let ϕ be the indicator function ϕ = 1 A for A G : { 1 ifω A, 1 A (ω) = 0 otherwise Then we get 1 A fdp = 1 A (f (ω)) d P(ω) = 1 f 1 (A)(ω) d P(ω) = P(f 1 (A)) = P f 1 (A) = 1 A d (P f 1 ), recalling the definition of the integral of an indicator 2 Let ϕ be a simple function ϕ = n a i1 Ai, where a i s are constant and A i G ( n ) ϕ fdp = a i 1 Ai fdp = (part 1) = n a i (1 Ai f ) d P = n a i 1 Ai d P f 1 = n a i 1 Ai fdp n a i 1 Ai d P f 1 = ϕd P f 1 3 Let ϕ be a positive measurable function and let ϕ n be a sequence of simple functions such that ϕ n ϕ, then ϕ fdp = ( lim ϕ n ) fdp n = lim (ϕ monotone convergence n f ) d P = lim ϕ n fdp n n (part 2) 1 monotone convergence = lim ϕ n d P f = lim ϕ n d P f 1 n n = ϕd(p f 1 ) 4 Let ϕ be a measurable function then ϕ + = max(ϕ, 0), ϕ = max( ϕ, 0) This then gives us ϕ = ϕ + ϕ ince at least one integral is assumed to exist, we get that ϕ + and ϕ exist Also note that ϕ + f (ω) = ϕ + (f 1 (ω)) = max(ϕ(f (ω)), 0), max(ϕ f (ω), 0) = (ϕ f ) + (ω)

10 141 Functions of Random Variables The Transport Formula 443 Then ϕ + d P f 1 = ϕ + fd P = (ϕ f ) + d P, ϕ d P f 1 = ϕ fd P = (ϕ f ) d P These equalities follow from part 3 of the proof After subtracting both, we obtain ϕdp f 1 = ϕ fdp EXAMPLE 143 If X and Y are independent random variables defined on (,R,P) with X, Y L 1 (), then XY L 1 (): XY d P = XdP YdP (E(XY ) = E(X )E(Y )) olution: Let us solve this example using the transport formula Let us take f : R 2, f (ω) = (X (ω),y(ω)); and ϕ : R 2 R, ϕ(x, y) = xy Then we have from the transport formula the following: X (ω)y (ω) dp(ω) (T = ) xy dp (X, Y ) 1 R 2 The integral on the left is E(XY ), while the integral on the right can be calculated as xy d (P X 1,P Y 1 ) = xdp X 1 ydp Y 1 R 2 R R (T ) = X (ω) dp(ω) Y (ω) dp(ω) = E(X )E(Y )

11 444 CHAPTER 14 Appendix B: Inequalities Involving Random Variables EXAMPLE 144 Finally we conclude with an application of the transport formula which will produce one of the most useful formulas Let X be an rv defined on the probability space (,F, P) with distribution function F (x) how that E(X ) = xdf(x), R where the integral is understood in the Riemann tieltjes sense Proving the formula is immediate Take f : R, f (ω) = X (ω) and ϕ : R R, ϕ(x) = x Then from the transport formula, we have E(X ) = = R X (ω) d P(ω) = xdf(x) x X (ω) d P(ω) (T) = R xdp X 1 (x) Clearly if the distribution function F (x) is derivable with df (x) = f (x) dx or df (x) = f (x) dx, we obtain the lower-level classes formula for calculating expectation of a continuous random variable: E(X ) = xf(x) dx R

Lecture 3: Expected Value. These integrals are taken over all of Ω. If we wish to integrate over a measurable subset A Ω, we will write

Lecture 3: Expected Value. These integrals are taken over all of Ω. If we wish to integrate over a measurable subset A Ω, we will write Lecture 3: Expected Value 1.) Definitions. If X 0 is a random variable on (Ω, F, P), then we define its expected value to be EX = XdP. Notice that this quantity may be. For general X, we say that EX exists