1 Summary of EE226a: MMSE and LLSE Jean Walrand Here are the key ideas and results: I. SUMMARY The MMSE of given Y is E[ Y]. The LLSE of given Y is L[ Y] = E() + K,Y K 1 Y (Y E(Y)) if K Y is nonsingular. Theorem 3 states the properties of the LLSE. The linear regression approximates the LLSE when the samples are realizations of i.i.d. random pairs ( m, Y m ). II. ESTIMATION: FORMULATION The estimation problem is a generalized version of the Bayesian decision problem where the set of values of can be R n. In applications, one is given a source model f and a channel model f Y that together define f,y. One may have to estimate f Y by using a training sequence, i.e., by selecting the values of and observing the channel outputs. Alternatively, one may be able to observe a sequence of values of (, Y) and use them to estimate f,y. Definition 1: Estimation Problems One is given the joint distribution of (, Y). The estimation problem is to calculate Z = g(y) to minimize E(c(, Z)) for a given function c(, ). The random variable Z = g(y) that minimizes E( Z 2 ) is the minimum mean squares estimator (MMSE) of given Y. The random variable Z = AY + b that minimizes E( Z 2 ) is called the linear least squares estimator (LLSE) of given Y; we designate it by Z = L[ Y]. III. MMSE Here is the central result about minimum mean squares estimation. You have seen this before, but we recall the proof of that important result. Theorem 1: MMSE The MMSE of given Y is E[ Y]. Proof: You should recall the definition of E[ Y], a random variable that has the property E[( E[ Y])h 1 (Y)] =, h 1 ( ), (1) or equivalently E[h 2 (Y)( E[ Y])] =, h 2 ( ). (2) By E() = E[E[ Y]], the interpretation is that E[ Y] h(y) for all h( ). By Pythagoras, one then expect E[ Y] to be the function of Y that is closest to, as illustrated in Figure 1. E[ Y] = g(y) h(y) {r(y), r(.) = function} Fig. 1. The conditional expectation as a projection. Formally, let h(y) be an arbitrary function. Then E( h(y) 2 ) = E( E[ Y] + E[ Y] h(y) 2 ) = E( E[ Y] 2 ) + 2E((E[ Y] h(y)) T ( E[ Y])) + E( E[ Y] h(y) 2 ) = E( E[ Y] 2 ) + E( E[ Y] h(y) 2 ) E( E[ Y] 2 ). In this derivation, the third identity follows from the fact that the cross-term vanishes in view of (2). Note also that this derivation corresponds to the fact that the triangle {, E[ Y], h(y)} is a right triangle with hypothenuse (, h(y)).
2 Example 1: You recall that if (, Y) are jointly Gaussian with K Y nonsingular, then E[ Y] = E() + K,Y K 1 Y (Y E(Y)). You also recall what happens when K Y is singular. The key is to find A so that K,Y = AK Y = AQΛQ T. By writing [ ] Λ1 Q = [Q 1 Q 2 ], Λ = where Λ 1 corresponds the part of nonzero eigenvalues of K Y, one finds [ ] [ K,Y = AQΛQ T Λ1 Q T = A[Q 1 Q 2 ] 1 so that K,Y = AK Y with A = K,Y Q 1 Λ 1 1 QT 1. IV. LLSE Q T 2 ] = AQ 1 Λ 1 Q T 1, Theorem 2: LLSE L[ Y] = E() + K,Y K 1 Y (Y E(Y)) if K Y is nonsingular. L[ Y] = E() + K,Y Q 1 Λ 1 1 QT 1 (Y E(Y)) if K Y is singular. Proof: Z = L[ Y] = AY + b satisfies Z BY + d for any B and d with E() = E[L[ Y]]. If we consider any Z = CY + c, then E((Z Z ) T ( Z)) = since Z Z = AY + b CY c = (A C)Y + (b c) = BY + d. It follows that E( Z 2 ) = E( Z + Z Z 2 ) = E( Z 2 ) + 2E((Z Z ) T ( Z)) + E( Z Z 2 ) = E( Z 2 ) + E( Z Z 2 ) E( Z 2 ). Figure 2 illustrates this calculation. It shows that L[ Y] is the projection of on the set of linear functions of Y. The projection Z is characterized by the property that Z is orthogonal to all the linear functions BY + d. This property holds if and only if (from (1)) E( Z) = and E(( Z)Y T ) =, which are equivalent to E( Z) = and cov( Z, Y) =. The comparison between L[ Y] and E[ Y] is shown in Figure 3. Z = L[ Y] Z {r(y) = CY + d} Fig. 2. The LLSE as a projection. L[ Y] E[ Y] {CY + d} {g(y)} Fig. 3. LLSE vs. MMSE. The following result indicates some properties of the LLSE. They are easy to prove. Theorem 3: Properties of LLSE (a) L[A 1 + B 2 Y] = AL[ 1 Y] + BL[ 2 Y]. (b) L[L[ Y, Z] Y] = L[ Y]. (c) If Y, then L[ Y] = E(). (d) Assume that and Z are conditionally independent given Y. Then, in general, L[ Y, Z] L[ Y ].
3 Proof: (b): It suffices to show that E(L[ Y, Z] L[ Y]) = and E((L[ Y] L[ Y, Z])Y T ) =. We already have E( L[ Y, Z]) = and E( L[ Y]) =, which implies E(L[ Y, Z] L[ Y]) =. Moreover, from E(( L[ Y, Z])Y T ) = and E(( L[ Y])Y T ) =, it is obtained E((L[ Y] L[ Y, Z])Y T ) =. (d): Here is a trivial example. Assume that = Z = Y 1/2 where is U[, 1]. Then L[ Y, Z] = Z. However, since Y = 2, Y 2 = 4, and Y = 3, we find We see that L[ Y, Z] L[ Y ] because E(Y ) E()E(Y ) L[ Y ] = E() + E(Y 2 ) E(Y ) 2 (Y E(Y )) = 1 ( 1 2 + 4 1 2 1 ) ( 1 3 5 1 ) 1 ( Y 1 ) 9 3 = 3 16 Y. Z 3 16 Y = 3 16 Z2. Figure 4 illustrates property (b) which is called the smoothing property of LLSE. L[ Y] L[ Y, Z] {CY + d} {AY + BZ + c} Fig. 4. Smoothing property of LLSE. V. EAMPLES We illustrate the previous ideas on a few examples. Example 2: Assume that U, V, W are independent and U[, 1]. Calculate L[(U + V ) 2 V 2 + W 2 ]. Solution: Let = (U + V ) 2 and Y = V 2 + W 2. Then Now, and Hence, We conclude that K,Y = cov(u 2 + 2UV + V 2, V 2 + W 2 ) = 2cov(UV, V 2 ) + cov(v 2, V 2 ). cov(uv, V 2 ) = E(UV 3 ) E(UV )E(V 2 ) = 1 8 1 12 = 1 24 cov(v 2, V 2 ) = E(V 4 ) E(V 2 )E(V 2 ) = 1 5 1 9 = 4 45. K,Y = 1 12 + 4 45 = 31 18. L[(U + V ) 2 V 2 + W 2 ] = E((U + V ) 2 ) + K,Y K 1 Y (Y E(Y )) = 7 6 + 31 32 Example 3: Let U, V, W be as in the previous example. Calculate L[cos(U + V ) V + W ]. Solution: Let = cos(u + V ) and Y = V + W. We find K,Y = E(Y ) E()E(Y ). ( Y 2 ). 3 Now, E(Y ) = E(V cos(u + V )) + 1 2 E().
4 Also, Moreover, In addition, and E(V cos(u + V )) = E() = = v cos(u + v)dudv = (v sin(v + 1) v sin(v))dv = = [v cos(v + 1)] 1 + [v sin(u + v)] 1 dv v d cos(v + 1) + cos(v + 1)dv + [v cos(v)] 1 = cos(2) + [sin(v + 1)] 1 + cos(1) [sin(v)] 1 = cos(2) + sin(2) sin(1) + cos(1) sin(1). cos(u + v)dudv = [sin(u + v)] 1 du = (sin(u + 1) sin(u))du cos(v)dv v d cos(v) = [cos(u + 1)] 1 + [cos(u)] 1 = cos(2) + cos(1) + cos(1) cos() = cos(2) + 2 cos(1) 1. E(Y ) = E(V ) + E(U) = 1. K Y = 1 6. VI. LINEAR REGRESSION VS. LLSE We discuss the connection between the familiar linear regression procedure and the LLSE. One is given a set of n pairs of numbers {(x m, y m ), m = 1,..., n}, as shown in Figure 5. One draws a line through the points. That line approximates the values y m by z m = αx m + β. The line is chosen to minimize (z m y m ) 2 = (αx m + β y m ) 2. (3) That is, the linear regression is the linear approximation that minimizes the sum of the squared errors. Note that there is no probabilistic framework in this procedure. To find the values of α and β that minimize the sum of the squared errors, one differentiates (3) with respect to α and β and sets the derivatives to zero. One finds (αx m + β y m ) = and x m (αx m + β y m ) =. Solving these equations, we find α = A(xy) A(x)A(y) A(x 2 ) A(x) 2 In these expressions, we used the following notation: A(y) := 1 y m, A(x) := 1 x m, A(xy) := 1 n n n Thus, the point y m is approximated by z m = A(y) + and β = A(y) αa(x). x m y m, and A(x 2 ) := 1 n A(xy) A(x)A(y) A(x 2 ) A(x) 2 (x m A(x)). x 2 m. Note that if the pairs (x m, y m ) are realizations of i.i.d. random variables ( m, Y m ) = D (, Y ) with finite variances, then, as n, A(x) E(), A(x 2 ) E( 2 ), A(y) E(Y ), and A(xy) E(Y ). Consequently, if n is large, we see that See [2] for examples. z m A(Y ) + cov(, Y ) (x m E()) = L[Y = x m ]. var() REFERENCES [1] R. G. Gallager: Stochastic Processes: A Conceptual Approach. August 2, 21. [2] D. Freedman: Statistical Models: Theory and Practice. Cambridge University Press, 25. [3] J Walrand: Lecture Notes on Probability and Random Processes, On-Line Book, August 24.
5 z m y m x m Fig. 5. Linear Regression.