Section 21. Expected Values. Po-Ning Chen, Professor. Institute of Communications Engineering. National Chiao Tung University

Section 21 Expected Values Po-Ning Chen, Professor Institute of Communications Engineering National Chiao Tung University Hsin Chu, Taiwan 30010, R.O.C.

Expected value as integral 21-1 Definition (expected value) The expected value of a random variable X is the integral of X with respect to its cdf. I.e., E[X] =E[X + ] E[X ]= 0 xdf X (x) 0 ( x)df X (x) = Properties of expected value 1. E[X] exists if, and only if, at least one of E[X + ]ande[x ]isfinite. 2. X is integrable, if E[ X ] <. 3. For B/B-measurable (or simply B-measurable) function g, E[g(X)] = E[g(X) + ] E[g(X) ] = g(x)df X (x) = {x R:g(x) 0} g(x)df X (x). {x R:g(x)<0} xdf X (x). [ g(x)]df X (x)

Absolute moments 21-2 Definition (absolute moments) The kth absolute moment of X is: E [ X k] = Properties of absolute moments 1. The absolute moment always exists. x k df X (x). 2. If kth absolute moment is finite, then jth moment is finite for any j k. Proof: It can be easily proved by x j 1+ x k for j k.

Moments 21-3 Definition (moments) The kth moment of X is: E [ X k] = = x k df X (x) max{x k, 0}dF X (x) = E[(X k ) + ] E[(X k ) ]. ( min{x k, 0} ) df X (x) Properties of moments 1. E[X k ] exists if, and only if, at least one of E[(X k ) + ]ande[(x k ) ]isfinite. 2. X k is integrable, if E[ X k ] <. 3. For B/B-measurable (or simply B-measurable) function g, E[g(X k )] = E[g(X k ) + ] E[g(X k ) ] = g(x k )df X (x) = {x R:g(x k ) 0} g(x k )df X (x). {x R:g(x k )<0} [ g(x k )]df X (x)

Moments 21-4 Example 21.1 (moments of standard normal) The pdf of a standard normal distribution is equal to: 1 e x2 /2. 2π Integration by parts shows that: 1 x k e x2 /2 dx = 2π or equivalently, As a result, E[X k ]= E[X k ]=(k 1)E[X k 2 ]. 1 2π (k 1)x k 2 e x2 /2 dx, { 0, for k odd; 1 3 5 (k 1), for k even.

Computation of mean 21-5 Theorem For non-negative random variable X, E[X] = 0 Pr[X >t]dt = 0 Pr[X t]dt. In other words, the area of ccdf (complementary cdf) is the mean. (The theorem is valid even if E[X] =.)(This is an extension of the law of large numbers when empirical distribution is concerned)! Geometric interpretation Suppose Pr[X = x i ]=p i for 1 i k. 1 p k x k p k cdf p k 1 x k 1 p k 1. The area of ccdf = p 1 x 1 + p 2 x 2 + p 3 x 3 + + p k 1 x k 1 + p k x k = E[X] p 3 x 3 p 2 x 2 p 2 p 1 x 1 p 1 0 x 1 x 2 x 3 p 3 x k 1 x k

Computation of mean 21-6 Theorem For α 0, α + Note that it is always true that 0 xdf X (x) = 0 + xdf X (x) but it is possible α xdf x(x) α + xdf X (x)! xdf X (x) =α Pr[X >α]+ α Pr[X >t]dt. Proof: Let Y = X I [X>α],whereI [X>α] =1ifX>α, and zero, otherwise. Hence, Y =0forX α, andy = X for X>α.Consequently, xdf X (x) =E[Y ] = α + = = 0 α 0 α 0 Pr[Y >t]dt Pr[Y >t]dt + Pr[X >α]dt + = α Pr[X >α]+ α α α Pr[Y >t]dt Pr[X >t]dt Pr[X >t]dt. The empirical approximation of Pr[X > t](orpr[x t]) is more easily obtained than df (x). With the above result, E[X](orE[Y ]) can be established directly from Pr[X >t].

Inequalities regarding moments 21-7 Inequalities regarding moments

Markov s inequality 21-8 Lemma (Markov s inequality) For any k>0 (and implicitly α>0), Pr[ X α] 1 α k E[ X k ]. Proof: The below inequality is valid for any x R: x k α k 1{ x k α k } (21.1) Hence, E[ X k ] = x k df X (x) α k 1{ x k α k }df X (x) =α k Pr[ X α]. Equality holds if, and only if, equality in (21.1) is true with probability 1. I.e., Pr [ X k = α k 1{ X k α k } ] =1, or equivalently, Pr[ X =0orα] =1.

Chebyshev-Bienaymé inequality 21-9 Lemma (Chebyshev-Bienaymé inequality) For α>0, Pr[ X E[X] α] 1 α 2Var[X]. Proof: By Markov s inequality with k =2,wehave: Pr[ X E[X] α] 1 α 2E[ X E[X] 2 ]. Equality holds if, and only if, Pr[ X E[X] =0]+Pr [ X E[X] = α ] =1, which implies that Pr [ X = E[X]+α ] =Pr [ X = E[X] α ] = p and for α>0. Pr[X = E[X]] = 1 2p

Jensen s inequality 21-10 Definition (convexity) A function ϕ(x) issaidtobeconvex over an interval (a, b) if for every x 1,x 2 (a, b) and0 λ 1, ϕ(λx 1 +(1 λ)x 2 ) λϕ(x 1 )+(1 λ)ϕ(x 2 ). Furthermore, a function ϕ is said to be strictly convex if equality holds only when λ =0orλ =1.(Can we replace (a, b) byarealsetx?) Definition (support line) A line y = ax + b is said to be a support line of function ϕ(x) if among all lines of the same slope a, it is the largest one satisfying ax + b ϕ(x) for every x. A support line ax + b may not necessarily intersect with ϕ( ). In other words, it is possible that no x 0 satisfies ax 0 + b = ϕ(x 0 ). However, the existence of intersection between function ϕ( ) and its support line is guaranteed, if ϕ( ) isconvex.

Jensen s inequality 21-11 An example that no intersection exists for a function and its support line. ϕ(x) ax + b

Jensen s inequality 21-12 Lemma (Jensen s inequality) Suppose that function ϕ( ) isconvexonthe domain X of X. (Implicitly, E[X] X.) Then ϕ(e[x]) E[ϕ(X)]. Proof: Let ax + b be a support line through the point (E[X],ϕ(E[X])). Thus, over the domain X of ϕ(x), If equality holds in X in this step, ax + b ϕ(x). then equality remains true for the subsequent steps. By taking the expectation value of both sides, we obtain a E[X]+b E[ϕ(X)], but we know that a E[X]+b = ϕ(e[x]). Consequently, ϕ(e[x]) E[ϕ(X)]. Equality hold if, and only if, either there exist a and b such that ae[x] +b = ϕ(e[x]) and Pr ( {x X : ax + b = ϕ(x)} ) =1.

Jensen s inequality 21-13 The support line y = ax + b of the convex function ϕ(x). ϕ(x) ax + b ( E[X],ϕ(E[X]) )

Hölder s inequality 21-14 Lemma (Hölder s inequality) For p>1,q >1and1/p +1/q =1, E[ XY ] E 1/p [ X p ]E 1/q [ Y q ]. Proof: Since the inequality is trivially valid, if E 1/p [ X p ]E 1/q [ Y q ] = 0. Without loss of generality, assume E 1/p [ X p ]E 1/q [ Y q ] > 0. exp{x} is a convex function in x. Hence, by Jensen s inequality, { 1 exp p s + 1 } q t 1 p exp{s} + 1 q exp{t}. Since e x is strictly convex, equality holds iff s = t. Let a =exp{s/p} and b =exp{t/q}. Then the above inequality becomes: ab 1 p ap + 1 q bq, Equality holds iff a p = b q. whose validity is not restricted to positive a and b but to non-negative a and b. By letting a = x /E 1/p [ X p ]andb = y /E 1/q [ Y q ], we obtain: xy E 1/p [ X p ]E 1/q [ Y q ] 1 x p p E[ X p ] + 1 y q q E[ Y q ]. Equality [ holds if, and ] only if, X p Y q Pr = =1. E[ X p ] E[ Y q ]

Hölder s inequality 21-15 Taking the expectation values of both sides yields: E[ XY ] E 1/p [ X p ]E 1/q [ Y q ] 1 E[ X p ] p E[ X p ] + 1 E[ Y q ] q E[ Y q ] = 1 p + 1 q =1. Lemma (Hölder s inequality) For p>1,q >1and1/p +1/q =1, E[ XY ] E 1/p [ X p ]E 1/q [ Y q ]. Equality holds if, and only if, [ ] X p Y q Pr = E[ X p ] E[ Y q ] =1or Pr[X =0]=1or Pr[Y =0]=1. Example. p = q =2and Y =0 Y =1 X =0 p 00 p 01 X =1 p 10 p 11 E[ XY ] = p 11 = p 1/2 11 p1/2 11 (p 10 + p 11 ) 1/2 (p 01 + p 11 ) 1/2 = E 1/2 [ X 2 ]E 1/2 [ Y 2 ] with equality holding if, and only if, p 10 = p 01 =0.

Cauchy-Schwartz s inequality 21-16 Suppose E[ X ] > 0andE[ Y > 0. Lemma (Hölder s inequality) For p>1,q >1and1/p +1/q =1, E[ XY ] E 1/p [ X p ]E 1/q [ Y q ]. Equality holds if, and only if, there exists a such that Pr[ X p = a Y q ]=1. Lemma (Cauchy-Schwartz s inequality) E[ XY ] E 1/2 [X 2 ]E 1/2 [Y 2 ]. Equality holds if, and only if, there exists a such that Pr[X 2 = ay 2 ]=1. Proof: A special case of Hölder s inequality with p = q =2. Equality holds if, and only if, Pr [ X 2 = ay 2] =1 for some a.

Lyapounov s inequality 21-17 Lemma (Lyapounov s inequality) For 0 <α<β, E 1/α [ Z α ] E 1/β [ Z β ]. Equality holds if, and only if, Pr[ Z = a] =1forsomea. Proof: Letting X = Z α, Y =1,p = β/α and q = β/(β α) inhölder s inequality yields: [ ] [ ] E[ Z α ] E α/β ( Z α ) β/α E (β α)/β 1 β/(β α) = E α/β [ Z β ]. Equality holds if, and only if, [ ] Pr ( Z α ) β/α = a =Pr [ Z β = a ] =Pr [ Z = a 1/β] =1 for some a. Notably, in the statement of the lemma, β is strictly larger than α. It is certain that if α = β, the inequality automatically becomes an equality.

Joint integrals 21-18 Lemma For B k /B-measurable (or simply B k -measurable) function g and k- dimensional random vector X, E[g(X)] = g(x k )df X (x k ), R k if one of E[(g(X)) + ]ande[(g(x)) ]isfinite. Definition (covariance) The covariance of two random vectors X and Y is: Cov[X, Y ] = E [ (X E[X])(Y E[Y ]) T] X 1 E[X 1 ] X = E 2 E[X 2 ] [ Y1 E[Y. 1 ] Y 2 E[Y 2 ] Y l E[Y l ] ] X k E[X k ] (X 1 E[X 1 ])(Y 1 E[Y 1 ]) (X 1 E[X 1 ])(Y l E[Y l ]) =.. (X k E[X k ])(Y 1 E[Y 1 ]) (X k E[X k ])(Y l E[Y l ]) where T represents vector transpose operation, if one of E[((X i E[X i ])(Y j E[Y j ])) + ]ande[((x i E[X i ])(Y j E[Y j ])) ] is finite for every i, j. k l

Joint integrals 21-19 Definition (uncorrelated) X and Y is uncorrelated, if Cov[X, Y ]=0 k l. Definition (independence) X and Y is independent, if Pr [ (X 1 x 1 X k x k ) (Y 1 y 1 Y l y l ) ] = Pr [ ] [ ] X 1 x 1 X k x k Pr Y1 y 1 Y l y l. Lemma (integrability of product) For independent X 1,X 2,...,X k,ifeach of X i is integrable, so is X 1 X 2 X k,and E[X 1 X 2 X k ]=E[X 1 ]E[X 2 ] E[X k ]. Lemma (sum of variance for pair-wise independent samples) If X 1,X 2,...,X k are pair-wise independent and integrable, Var[X 1 + X 2 + + X k ]=Var[X 1 ]+Var[X 2 ]+ +Var[X k ]. Notably, pair-wise independence does not imply complete independence.

Joint integrals 21-20 Example (only pair-wise independence) Toss a fair coin twice, and assume independence. Define { 1, if head appears on the first toss; X = 0, otherwise, { 1, if head appears on the second toss; Y = 0, otherwise, { 1, if exactly one head and one tail appear on the two tosses; Z = 0, otherwise, Then Pr[X =1 Y =1 Z =1]=0; but Pr[X =1]Pr[Y =1]Pr[Z =1]= 1 2 1 2 1 2 = 1 8. So X, Y and Z are not independent. Obviously, X Y. In addition, Pr[X =1 Z =1] Pr[X =1 Z =1]= Pr[Z =1] Pr[X =1 Z =0] Pr[X =1 Z =0]= Pr[Z =0] = 1/4 1/2 = 1 2 = 1/4 1/2 = 1 2 One can similarly show (or by symmetry) that Y Z. =Pr[X =1] =Pr[X =1] X Z.

Joint integrals 21-21 Example (con t) By head head X + Y + Z =2 head tail X + Y + Z =2 tail head X + Y + Z =2 tail tail X + Y + Z =0 Var[X + Y + Z] = 3 4 (2 3/2)2 + 1 4 (0 3/2)2 = 3 4 This is equal to: Var[X]+Var[Y ]+Var[Z] = 1 4 + 1 4 + 1 4 = 3 4. This result matches that The variance of sum equals the sum of variances holds for pair-wise independent random variables.

Moment generating function 21-22 Definition (moment generating function) The moment generating function of X is defined as: M X (t) =E[e tx ]= e tx df X (x), for all t for which this is finite. If M X (t) is defined (i.e., finite) throughout an interval ( t 0,t 0 ), where t 0 > 0. then t k M X (t) = k! E[Xk ]. k=0 In other words, M X (t) has a Taylor expansion about t = 0 with positive radius of convergence if it is defined in some ( t 0,t 0 ). In case that M X (t) has a Taylor expansion about t = 0 with positive radius of convergence, the moment of X can be computed by the derivatives of M X (t) through: M (k) (0) = E[X k ]. If M Xi (t) is defined throughout an interval ( t 0,t 0 )foreachi,andx 1,X 2,...,X n are independent, then the moment generating function of X 1 + + X n is also n defined on ( t 0,t 0 ), and is equal to M Xi (t). i=1

Moment generating function 21-23 Example (random variable whose moment generating function is defined only at zero) The pdf of a Cauchy distribution is For t 0, the integral e tx 1 π(1 + x 2 ) dx = 0 0 0 1 f(x) = 1 π(1 + x 2 ). ( e tx + e tx) 1 π(1 + x 2 ) dx e t x 1 π(1 + x 2 ) dx t x π(1 + x 2 ) dx (by ex 1+x x for x 0) t x π(1 + x 2 ) dx t x t dx = π(x 2 + x 2 ) 2π So the moment generating function is not defined for any t 0. 1 1 1 dx =. x The Cauchy distribution is indeed the Student s T -distribution with 1 degree of freedom.

Moment generating function 21-24 Example (Student s T -distribution with n degree of freedom) This distribution has moments of order n 1 but any higher moments either do not exist or are infinity. Some books considers infinite moments as non-existence. So they write This distribution has moments of order n 1 but no higher moments exist. Let X, Y 1,Y 2,...,Y n be i.i.d. with standard normal marginal distribution. Then X n T n = Y 2 1 + + Yn 2 is called the Student s t-distribution (on R) withn degree of freedom. The numerator X n has a normal density with mean 0 and variance n. χ 2 = Y1 2 + +Yn 2 is a chi-square distribution with n degree of freedom, and has density f 1/2,n/2 (y), where f α,ν (x) = 1 Γ(ν) αν x ν 1 e αx on [0, ) is the gamma density (or sometimes named Erlangian density when ν is a positive integer) with parameters ν>0andα>0, and Γ(t) = x t 1 e x dx is the gamma function. 0

Moment generating function 21-25 (closure under convolutions for gamma density) f α,µ f α,ν = f α,µ+ν. Proof: (f α,µ f α,ν )(x) = = = = 0 α µ+ν f α,µ (x y)f α,ν (y)dy x (x y) µ 1 e α(x y) y ν 1 e αy dy Γ(µ)Γ(ν) 0 α µ+ν x Γ(µ)Γ(ν) e αx (x y) µ 1 y ν 1 dy 0 α µ+ν 1 Γ(µ)Γ(ν) e αx 0 1 1 = Γ(µ + ν) αµ+ν x µ+ν 1 e αx = f α,µ+ν (x), (x xt) µ 1 (xt) ν 1 xdt (by y = xt) Γ(µ + ν) Γ(µ)Γ(ν) 0 (1 t) µ 1 t ν 1 dt where that the last term is equal to one follows the beta integral. For µ>0andν>0, B(µ, ν) = beta integral. 1 0 (1 t) µ 1 t ν 1 dt = Γ(µ)Γ(ν) Γ(µ + ν) is the so-called

Moment generating function 21-26 That the pdf of chi-square distribution with 1 degree of freedom equals f 1/2,1/2 (y) canbederivedas: Pr[Y 2 1 y] = Pr[ y Y 1 y]=φ( y) Φ( y), where Φ( ) represents the unit Gaussian cdf. So the derivative of the above equation is y 1/2 φ(y 1/2 1 )= Γ(1/2) (1/2)1/2 y (1/2) 1 e (1/2)y for y 0. Then by closure under convolution for gamma densities, the distribution of chi-square distribution with n degree of freedom can be obtained. In statistical mechanics, Y1 2 + Y 2 2 + Y 3 2 appears as the square of the speed of particles. So the density of particle speed (on [0, )) is equal to v(x) = 2xf 1/2,3/2 (x 2 ), which is called Maxwell density. By letting Ȳ = Y 2 1 + + Y 2 n, Pr[T n t] = Pr[(X n)/ȳ t] = Pr[X yt/ n]df Ȳ (y) = 0 Φ(yt/ n) ( 2yf 1/2,n/2 (y 2 ) ) dy. 0

Moment generating function 21-27 So the density of T n is: ( y f Tn (t) = n φ(yt/ n) )( 2yf 1/2,n/2 (y 2 ) ) dy 0 ( y = e y2 t 2 )( /(2n) 1 0 2πn 2 n/2 1 Γ(n/2) yn 1 e ) y2 /2 dy 1 = 2 (n 1)/2 Γ(n/2) y n e y2 (1+t 2 /n)/2 dy πn 0 1 = 2 (n 1)/2 Γ(n/2) 2 n/2 s n/2 2 1/2 s 1/2 ) πn 0 (1 + t 2 /n) n/2e s( ds (by y = (1 + t 2 /n) 1/2 1 = Γ(n/2) s (n+1)/2 1 e s ds πn(1 + t 2 /n) (n+1)/2 0 C n = (1 + t 2 /n) (n+1)/2, where C Γ((n +1)/2) n = Γ(n/2) πn. 2s (1 + t 2 /n) 1/2)

Moment generating function 21-28 For t 0, the integral e tx C n (1 + x 2 /n) (n+1)/2dx C n 0 e t x 1 (1 + x 2 /n) (n+1)/2dx t n x n (1 + x 2 /n) (n+1)/2dx C n n! 0 ( by e x 1+x + x2 2 C n n! C n n! n n t n x n (1 + x 2 /n) (n+1)/2dx xn + + n! xn n! for x 0) t n x n (x 2 /n + x 2 /n) (n+1)/2dx = C n t n n (n+1)/2 n!2 (n+1)/2 So the moment generating function is not defined for any t 0. 1 dx =. n x

Moment generating function 21-29 However for r<n, E[ T n r t r ] = C n (1 + t 2 /n) (n+1)/2dt t r = 2C n (1 + t 2 /n) (n+1)/2dt But E[(Tn n ) + ]= 0 n (r+1)/2 Γ((n r)/2)γ((r +1)/2) = 2C n 2Γ((n +1)/2) = nr/2 Γ((n r)/2)γ((r +1)/2) Γ(n/2) <. π C n C n 0 t n (1 + t 2 /n)(n+1)/2dt, if n even; t n (1 + t 2 /n)(n+1)/2dt, if n odd. = and 0, if n even; E[(Tn n ) ]= 0 t n C n (1 + t 2 /n)(n+1)/2dt =, if n odd. This is a good example for which the moments, even if some of them exist (and are finite), cannot be obtained through the moment generating function.

Some seldom-known densities 21-30 gamma distribution : f α,ν (x) (on[0, )) has mean ν/α and variance ν/α 2. Snedecor s distribution or F -distribution : It is the distribution of Z = X1 2 + + Xm 2 m Y1 2 + + Yn 2 n where X i and Y j are all independent standard normal. Its pdf with positive integer parameters m and n is: m m/2 Γ((m + n)/2) n m/2 Γ(m/2)Γ(n/2) z m/2 1 (1 + mz/n) (m+n)/2 on z 0. Bilateral exponential distribution : It is the distribution of Z = X 1 X 2, where X 1 and X 2 are independent and have common exponential density αe αx on x 0. Its pdf is 1 2 αe α x on x R. It has zero mean and variance 2α 2.,

Some seldom-known densities 21-31 Beta distribution : Its pdf with parameters µ>0andν>0is β µ,ν (x) = Γ(µ + ν) Γ(µ)Γ(ν) (1 x)µ 1 x ν 1 on 0 <x<1. Its mean is ν/(µ + ν), and its variance is µν/[(µ + ν) 2 (µ + ν +1)]. Arc sine distribution :Itspdfisβ 1/2,1/2 (x) = Its cdf is given by 2 π sin 1 ( x)for0<x<1. 1 π x(1 x) on 0 <x<1. Generalized arc sine distribution : It is a beta distribution with µ + ν =1. Pareto distribution : It is the distribution of Z = X 1 1, where X is beta distributed with parameters µ>0andν>0. Its density is z µ 1 Γ(µ + ν) on 0 <z<. Γ(µ)Γ(ν) (1 + z) µ+ν This is often used as an incoming traffic with heavy tail as z (ν+1).

Some seldom-known densities 21-32 Cauchy distribution : It is the distribution of Z = αx/ Y, wherex and Y are independent standard normal distributed. Its pdf with parameter α>0is Its cdf is 1 2 + 1 π tan 1 (x/α). γ α (x) = 1 π α x 2 + α 2 on R. It is also closure under convolution, i.e., γ s γ u = γ s+u. It is interesting that Cauchy distribution is also closure under scaling, i.e., a X has density γ aα ( ), if X has density γ α ( ). Hence, we can easily obtain the density of a 1 X 1 + a 2 X 2 + + a n X n as γ a1 α 1 +a 2 α 2 + +a n α n ( ), if X i has density γ αi ( ).

Some seldom-known densities 21-33 One-sided stable distribution of index 1/2 : It is the distribution of Z = α 2 X 2,whereX has a standard normal distribution. Its pdf is f α (z) = α 1 2π z 3 e 1 2 α2 /z on z 0. It is also closure under convolution, namely,f α f β = f α+β. If Z 1,Z 2,...,Z n are i.i.d. with marginal density f α ( ), then Z 1 + + Z n n 2 has density f α ( ). also Weibull distribution : Its pdf and cdf are respectively given by ( ) α 1 α y e (y/β)α and 1 e (y/β)α β β with α>0, β>0and support (0, ). By defining X =1/Y,whereY has the above distribution, we derive the pdf and cdf of X as respectively: This is useful for ordered statistics. αβ(βy) 1 α e 1/(βy) and e 1/(βx)α.

Some seldom-known densities 21-34 For example, [ let X (n) ] denote the largest one among i.i.d. X 1,X 2,...,X n. X(n) Then Pr n x = e (α/π)(1/x),ifx j is Cauchy distributed with parameter α. [ ] X(n) Or Pr x = e (α 2/ π)(1/x 1/2),ifX j is a one-sided stable distribution n 2 of index 1/2 with parameter α. Logistic distribution : Its cdf with parameters α>0andβ Ris on R. 1 1+e αx β

Distribution with nice moment generating function 21-35 Gaussian distribution : M(t) = 1 2πσ 2 which exists for all t R. Exponential distribution : M(t) = is defined for t<α. So the kth moment is k!α k. Poisson distribution : which exists for all t R. 0 M(t) = e tx αe αx dx = r=0 e tx e (x m)2 /(2σ 2) dx = e mt+σ2 t 2 /2, α α t = k=0 e rt e λλr r! = eλ(et 1), ( ) k! t k α k k!