EXPECTED VALUE of a RV corresponds to the average value one would get for the RV when repeating the experiment, independently, infinitely many times. Sample (RIS) of n values of X (e.g. More accurately, consider a Random Independent 0,,, 0,,,,,, 0) and the corresponding P nj= X j Sample Mean X P i f n i =0 3 + 5 + =0.9, where f 0 0 0 i are the All i observed relative frequencies (of each possible value). We know that, as n (the sample size) increases, each f i tends to the corresponding Pr(X = i).this implies that, in the same limit, X tends to P All i i Pr(X = i), which we denote E(X) and call the expected (mean) value of X (so, in the end, the expected value is computed from the theoretical probabilities, not by doing the experiment). Note that P All i i Pr(X = i) is a weighted average of all possible values of X by their probabilities. And, as we have two names for it, we also have two alternate notations, E(X) is sometimes denoted µ x (the Greek mu ). EXAMPLES: When X is the number of dots in a single roll of a die, we get E(X) = ++3+4+5+ =3.5 (for a symmetric distribution, the mean is at the centre of symmetry - for any other distribution, it is at the center of mass ). Note that the name expected value is somehow misleading. Let Y be the larger of the two numbers when rolling two dice. E(Y )= 3 + 3 +3 5 +4 7 +5 9 + = ++5+8+45+ = =4.47. 3 3 3 3 3 3 3 Let U have the following (arbitrarily chosen) distribution: E(U) =0.+0.8+0.4 =.4. U = 0 3 4 Prob: 0.3 0. 0.4 0 0.
What happens to the expected value of X when we transform it, i.e. define a new RV by: U X, or g(x) in general? +X The main thing to remember is that, in general,the mean does not transform accordingly, i.e. µ u = h(µ x,µ y ). µ x +µ x, etc. This is also true for a transformation of X and Y, i.e. E[h(X, Y )] = But(atleastonegoodnews),tofindtheexpectedvalueofg(X), h(x, Y ),..., we can bypass constructing the new distribution (which was a tedious process) and use: E[g(X)] = X All i g(i) Pr(X = i) E[h(X, Y )] = X All i,j h(i, j) Pr(X = i Y = j) EXAMPLE: Based on U,wedefine W = U 3 and compute E(U) =8 0.3+ 0.+0 0.4+8 0. = 3.4. Clearly, it would have been a mistake to use.4 3 =0. (totally wrong). We can also get the correct answer the long way (just to verify the short answer), by first finding the distribution of W : W = 8 0 8 U = 0 4 Prob: 0.3 0. 0.4 0. implying W = 0 8 Prob: 0.4 0. 0.4. Based on this, E(W )=0.+3. =3.4 (check). Exception: Linear Transformations Only for these, we can find the mean of the new RV by simply replacing X by µ x, thus: E(a X + c) =a µ x + c
Proof: E(a X + c) = P (a i + c)f x (i) =a P i f x (i)+c P f x (i) =a µ x + c All i All i All i EXAMPLE: E(U 3) =.4 3= 0. Expected values related to a bivariate distribution When a bivariate distribution is given, the easiest way to compute the individual expected values (of X and Y ) is through the marginals. EXAMPLE: X = 3 Y = 0 0. 0 0.3 0.4 0.3 0. 0. 0. 0.4 0. 0.5 we compute E(X) = 0.4+ 0.+3 0.5 =. and E(Y )=0 0.4+ 0. =0.. This is how we also deal with an expected value of g (X) and/or g (Y ). Only when the new variable is defined by h(x, Y ), we have weigh-average the whole table. For example: E(X Y )= 0 0.+ 0 0+3 0 0.3+ 0.3+ 0.+3 0. =. More EXAMPLES: E [(X ) ]=0 0.4+ 0.+ 0.5 =. E +Y = 0.4+ 0. =0.7 + 0 + h i E (X ) (multiplying the last two results would be wrong). Here it may help to first +Y build the corresponding table of the (X ) +Y values: 0.4 =.5 0 4 0.Answer:. + 0.05 + 3
For Linear Transformation (of two RVs) we get: E(a X + b Y + c) =a µ x + b µ y + c Proof: E(aX + by +c) = P P (a i +b j + c)f xy (i, j) =a P i f x (i)+b P i j i j a E(X)+b E(Y )+c Note that X and Y need not be independent! j f y (j)+c = EXAMPLE: Using the previous bivariate distribution, E(X 3Y +4)is simply. 3 0.+4=.4 The previous formula easily extends to any number of RVs (again, not necessarily independent!) E(a X + a X +... + a k X k + c) = a E(X )+a E(X )+... + a k E(X k )+c Can Independence help (in other cases of expected value)? Yes, the expected value of a product of RVs equals the product of the individual expected values, but ONLY when these RVs are independent: E(X Y )=E(X) E(Y ) Proof: E(X Y )= P µ Ã! P P P i j f x (i) f y (j) = i f x (i) j f y (j) = E(X) E(Y ) i j i j The statement can actually be made more general: WhenX and Y are independent E [g (X) g (Y )] = E [g (X)] E [g (Y )] Moments of a RV 4
There are two types of moments, Simple Moments E(X k ) and Central Moments E (X µ x ) k where k is an integer. Special cases: The 0 th moment is identically equal to. The first simple moment is µ x (yet another name for it!). The second simple moment is E(X ) = µ x, etc. The first central moment is identically equal to 0. The second central moment E [(X µ x ) ] must be 0 (averaging non-negative quantities cannot result in a negative number). It is of such importance that it gets its own name: the variance of X, denoted Var(X). When doing the computation by hand, it helps to realize that E [(X µ x ) ]=E(X ) µ x. As a measure of the spread of the distribution of X values, we take σ x p Var(X) and call it the standard deviation of X. For all distributions, the (µ σ, ρ + σ) interval should contain the bulk of the distribution, i.e. anywhere from 50 to 90% (in terms of the corresponding histogram). Finally, skewness is defined as E [(X µ x) 3 ] (itmeasurestowhatextentisthedistribution non-symmetric, or skewed ), and kurtosis as E [(X µ x) 4 ] σ 3 x (it measures the degree σ 4 x of flatness, 3 being its typical value, higher for peaked, smaller for flat distributions). Thesearealotlessimportantthanthemeanandvariance(later,wewillunderstandwhy). EXAMPLES: X is the number of dots when rolling one die: µ x = 7,Var(X) + +3 +4 +5 + = 7 q = 35 implying σ 35 x = =.7078. Note that 3.5±.708 contains.7% of the dis- tribution. Skewness, for a symmetric distribution, is always equal to 0, kurtosis can be 5
computed from E [(X µ) 4 ]= (.5)4 +(.5) 4 +( 0.5) 4 +0.5 4 +.5 4 +.5 4 =4. 79 kurtosis = 4.79 ( 35 =. 734 ( flat ). ) U = 0 4 For µ u =.4, Var(U) =0 0.3+ 0.+ 0.4+4 Prob: 0.3 0. 0.4 0. 0..4 =.44 σ u =.44 =.. From E [(U µ u ) 3 ]=(.4) 3 0.3+ ( 0.4) 3 0.+ 0. 3 0.4+. 3 0. =.008, the skewness is.008. =0.58 3 (long right tail), 3 and from E [(U µ u ) 4 ]=(.4) 4 0.3+ ( 0.4) 4 0.+ 0. 4 0.4+. 4 0. =5.779 the kurtosis equals 5.779. 4 =.787 When X is transformed to Y = g(x), we already know that there is no general shortcut for computing E(Y ). This (even more so) applies to the variance of Y, which also needs to be computed from scratch. But, we did manage to simplify the expected value of a linear transformation of X (i.e., of Y = ax + c). Is there any direct conversion of Var(X) into Var(Y ) in this (linear) case? The answer is yes, and we can easily derive the corresponding formula: Var(aX + c) = E [(ax + c) ] (aµ x +c) = E [a X +acx + c ] (aµ x +c) = a E(X ) a µ x = a Var(X). This implies that σ ax+c = a σ X Moments - the bivariate case Firstly, there will be the individual moments of X (and, separately, Y ), which can be established based on the corresponding marginal distribution. Are there any other (joint) moments? Yes, and again we have the simple moments E(X k Y m ) and the central moments E (X µ x ) k (Y µ y ) m. The most important of
these is the first,first central moment called the covariance of X and Y : Cov(X, Y )= E (X µ x ) (Y µ y ) E(X Y ) µ x µ y It is obviously symmetric, i.e. Cov(X, Y )=Cov(Y,X) and it becomes zero when X and Y are independent (but not necessarily the other way round). Based on Cov(X, Y ), one can define the Correlation Coefficient between X and Y by: ρ xy = Cov(X, Y ) σ x σ y (Greek letter rho ). Its value must be always between and. Proof: Obviously, Var(X by )=Var(X) bcov(x, Y )+b Var(Y ) 0 for any value of b, including This implies that b = Cov(X,Y ) Var(Y ) Var(X) Cov(X,Y ) Cov(X,Y ) Cov(X, Y )+ Var(Y ) Var(Y ) Var(Y ) = Var(X) which, after dividing by Var(X) yields Cov(X,Y ) Var(Y ) 0 ρ 7
EXAMPLE: For one of our previous distributions X = 3 Y = 0 0. 0 0.3 0.4 0.3 0. 0. 0. we 0.4 0. 0.5 get µ x =., µ y =0., Var(X) =5.3. =0.89, Var(Y )=0. 0. =0.4, Cov(X, Y )= 0.3+0.+0.. 0. = 0. (may be negative), and ρ xy = 0. 0.89 0.4 = 0.34 One can also show that ρ ax+c,by +d = Cov(aX+c,bY +d) σ ax+c σ by +d = a b Cov(X,Y ) a b σ X σ Y = ±ρ xy (linear transformation does not change the value of ρ, but it may change its sign - can you tell when?). Linear Combination of RVs Starting with X and Y, let s see whether we can simplify the expression for Var(aX + by + c) = E [(ax + by + c) ] aµ x + bµ y + c = a E(X )+b E(Y )+abe(x Y ) a µ x b µ x abµ x µ y = a Var(X)+b Var(Y )+abcov(x, Y ) Independence eliminates the last term. This result can be easily extended to a linear combination of any number of random variables: Var(a X + a X +...a k X k + c) = a Var(X )+a Var(X )+... + a kvar(x k )+ a a Cov(X,X )+a a 3 Cov(X,X 3 )+...... +a k a k Cov(X k,x k ) Mutual independence (if present) eliminates the last row of k covariances. 8
And finally a formula for a covariance of one linear combination of RVs against another: Cov (a X + a X +..., b Y + b Y +...) = a b Cov(X,Y )+a b Cov(X,Y ) +a b Cov(X,Y )+a b Cov(X,Y )+... (I will call this the distributive law of covariance). Sample Mean and its distribution Consider a random independent sample of size n from an arbitrary distribution. We know that the sample mean X = P n i= X i n is itself a RV with its own distribution. Regardless how that distribution looks like, its expected value must be the same as the expected value of the distribution from which we sample. Proof: E( X) = n E(X )+ n E(X )+... + n E(X n)= n µ + n µ +... + n µ = µ That s why X is often used as an estimator of µ, when its value is unknown. Similarly, Var( X) =( n ) Var(X )+( n ) Var(X )+... +( n ) Var(X n )=( n ) σ +( n ) σ +... +( n ) σ = σ n that where σ is the standard deviation of the original distribution. This implies σ X = σ n The standard deviation of X (sometimes called the standard error of X) is n times smaller than that of the original distribution. Note that the standard error tends to zero as sample size increases. Now, how about the shape of the X-distribution, how does it relate to the shape of the sampled distribution? The surprising answer is: it doesn t. For n bigger than say 5, the 9
distribution of X quickly approaches the same regular shape, regardless of how the original distribution looked like. Now, consider a random independent sample of size n from a bi-variate distribution, where (X,Y ), (X,Y ),... (X n,y n ) are the individual pairs of observations. Then, Var(X + P X +... + X n )=Var(X )+ Var(X )+...+Var(X n ) n Var(X) and, similarly, Var( n Y i )= i= P n Var(Y ). Similarly, Cov( n P X i, n Y i )=Cov(X,Y )+Cov(X,Y )+...+Cov(X n,y n )=ncov(x, Y ). i= i= P All this implies that the correlation coefficient between n P X i and n n Cov(X,Y ) Y i equals n Var(X) n Var(Y ) ρ xy (the correlation between a single X and Y pair). Thesameistrueforthecorresponding i= i= sample means X P ni= X i n and Ȳ P ni= Y i n,why? Conditional expected value is, simply put, an expected value computed using the corresponding conditional distribution, e.g. E(X Y =)= X i i f x (i Y =) etc. EXAMPLE: Using our old bivariate distributions X = E(X Y = 3 Y = 0 0. 0 0.3 0.4 0.3 0. 0. 0. 0.4 0. 0.5 ) is constructed based on the corresponding conditional distribution X Y = 3 Prob: 3 0
by the usual process: 3 + +3 =.8 3 (note that this is different from E(X) =. calculated previously). Similarly E(X Y =)= 3 + +3 =4.. These imply that Var(X Y =)= 4..8 3 =0.805. Also,E( X Y =)= 3 + + 3 =0.9 4. When values of a RV are only integers (the case of most of our examples so far), we define its probability generating function (PGF) by P (z) = P All i z i Pr(X = i) where z is a parameter. Differentiating k times and substituting z =yields d k P (z) dz k = E[X (X )... (X k +)] z= (the so called k th factorial moment). For mutually independent RVs, we also get P X+Y +Z (z) =P x (z) P y (z) P z (z) Furthermore, given P (z), we can recover the distribution by Taylor-expanding P (z) - the coefficient of z i is Pr(X = i). EXAMPLES: For the distribution of our last example, we get P (z) = P (note that this can be also obtained from the MGF by e t z). P 0 (z) = P 00 (z) = ( z) 4 ( z) 3 z= = z= =4 i= z i ( )i = z z = z z giving us the mean of (check) and the variance of 4+ =(check).
What is the distribution of the total number of dots when rolling 3 dice? Well, the PGF of X,X and X 3 (each) is z + z + z 3 + z 4 + z 5 + z since they are independent RVs, the PGF of their sum is simply µ z + z + z 3 + z 4 + z 5 + z 3 Expanding (easy in Maple), we get: z3 + 7 z4 + 3 z5 + 5 08 z + 5 7 z7 + 7 7 z8 + 5 z9 + 8 z0 + 8 z + 5 z + 7 7 z3 + 5 7 z4 + 5 08 z5 + 3 z + 7 z7 + z8 Note that a RV can have infinite expected value (not too common, but it can happen). EXAMPLE: Suppose you flip a coin till the first head appears, and you win $ if this takes only one flip, $4 if it takes two flips, $8 if it takes 3 flips, etc. (the amount always doubles - i.e. you get $ for each of the first flips, $4 for the third, $8 for the fourth, etc.). What is the expected value of your win? Clearly, you win Y = X dollars. E(Y )= +4 +8 3 +... =+++... =.