Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Chapter 5 Two Random Variables In a practical engineering problem, there is almost always causal relationship between different events. Some relationships are determined by physical laws, e.g., voltage and current, while some are abstracted from the problem, e.g., probability of passing a class and probability of graduating. Whenever we need to handle relationship between two or more events, we need mathematical tools to describe the probabilistic phenomenon. The objective of this chapter to present the concepts of joint distributions. 5. Joint PMF and Joint PDF Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows. Definition. Let X and Y be two discrete random variables. The joint PMF of X and Y is defined as p X,Y (x, y) P[X x Y y]. (5.) The interpretation of a joint PMF is that the sample space is now the Cartesian plane of Ω X Ω Y, where Ω X is the sample space of X, and Ω Y is the sample space of Y. Pictorially, this means that the sample space of the joint PMF is a two-dimensional plane (X, Y ). We stress the importance of this two-dimensional sample space, because every outcome of a joint variable is a point in the two-dimensional space, i.e., (X, Y ). Therefore, P[X A Y B] for sets A and B can be interpreted as P[X A Y B] P[(ξ, ζ) ξ X (A), and ζ Y (B)}]. (5.) For discrete random variables, the PMF p X,Y (x, y) can be considered as delta functions in the two-dimensional space.

Example. Let X be a coin flip, Y be a dice. Find the joint PMF of X and Y. Solution. The joint PMF is p X,Y (x, y), x 0,, y,, 3, 4, 5, 6. Pictorially, we have the joint PMF given by the following table. Y 3 4 5 6 X 0 X In this example, we observe that if X and Y are not interacting (formally, we call them independent which we will discuss later), then the joint PMF is the product of the two individual probabilities. The continuous version of the joint PMF is called the joint PDF. Definition. Let X and Y be two continuous random variables. The joint PDF of X and Y is a function f X,Y (x, y) that can be integrated to yield a probability: P[a X b c Y d] d b c a f X,Y (x, y)dxdy. (5.3) Like PDFs for single random variables, a joint PDF is a density which can be integrated to obtain the probability. Note also in this definition, the probabilities of the events a X b} and c Y d} are related using logical AND. Example. Consider a uniform joint PDF f X,Y (x, y) defined on [0, ], as shown in Figure 5.. The shaded area corresponds to P[a X b c X d] d b c a d b c a f X,Y (x, y)dxdy dxdy (d c)(b a). In general, when f X,Y (x, y) is not uniform, we have to integrate f X,Y (x, y) over the interval specified. c 8 Stanley Chan. All Rights Reserved.

(a) General f X,Y (x, y) (b) Example. Figure 5.: The joint PDF f X,Y (x, y) is a two-dimensional function. Integrating over the rectangle [a, b] [c, d] returns the probability P[a X b c Y d]. Normalization The normalization property of a two-dimensional PMF and PDF states that by enumerating over all outcomes of the sample space we will obtain. Theorem. All joint PMFs and joint PDFs satisfy p X,Y (x, y) x y or f X,Y (x, y)dxdy. (5.4) Example. Consider a joint uniform PDF defined in the shaded area Ω with PDF defined below. Find the constant c. c, if (x, y) Ω, f X,Y (x, y) 0, otherwise. Solution. To find the constant c, we note that f X,Y (x, y)dxdy. The left hand side of this equation is precisely the area, which is Ω. Therefore, we have c / Ω. c 8 Stanley Chan. All Rights Reserved. 3

Marginal PMF and Marginal PDF If we only sum / integrate with respect to one random variable, we obtain the PMF / PDF of the other random variable. The resulting PMF / PDF is called the marginal PMF / PDF. Definition 3. The marginal PMF is defined as p X (x) y p X,Y (x, y) and p Y (y) x p X,Y (x, y) (5.5) Definition 4. The marginal PDF is defined as f X (x) f X,Y (x, y)dy and f Y (y) f X,Y (x, y)dx (5.6) Since f X,Y (x, y) is a two-dimensional function, when integrating over y from to, we project f X,Y (x, y) onto the x-axis. Therefore, the resulting function depends on x only. Example. Consider the joint PDF f X,Y (x, y) shown in Figure 5.. Find the marginal PDFs. Solution. If we integrate over x and y, then we have, if < x, 3, if < x,, if < x 3, f X (x), if < x 3, and f Y (y), if 3 < x 4, 0, otherwise. 0, otherwise. Figure 5.: Example of a joint uniform PDF f X,Y (x, y) and the corresponding marginal PDFs. Example. Consider a D Gaussian PDF as shown in Figure 5.3. The PDF of the joint Gaussian is f X,Y (x, y) πσ exp ((x µ } X) + (y µ Y ) ). σ c 8 Stanley Chan. All Rights Reserved. 4

Find the marginal PDFs f X (x) and f Y (y). Solution. f X (x) Similarly, we have f X,Y (x, y)dy exp (x µ X) πσ σ exp (x µ X) πσ σ f Y (y) πσ exp } }. ((x µ X) + (y µ Y ) ) σ exp (y µ Y ) πσ σ exp (y µ } Y ). πσ σ } dy } dy The result of this example shows that the marginalization of a D Gaussian is D Gaussian along the vertical and the horizontal axes. Thus, we can think of marginalization of a projection. Figure 5.3: Marginalization is equivalent to projection. A joint PDF shown in this figure can be marginalized onto the x or the y axis. Independence of Random Variables Finally, we say that two random variables are independent if the joint PMF or PDF can be factorized as a product of the marginal PMF / PDFs: Definition 5. If two random variables X and Y are independent, then p X,Y (x, y) p X (x)p Y (y), and f X,Y (x, y) f X (x)f Y (y). c 8 Stanley Chan. All Rights Reserved. 5

To see why this definition is coherent to the definition of independence of two events, we recall that two events A and B are independent if P[A B] P[A]P[B]. Letting A X x} and B Y y}, we see that if A and B are independent then P[X x Y y] P[X x]p[y y]. This is precisely the relationship p X,Y (x, y) p X (x)p Y (y). Independence is an important statistical property. If there are many random variables X, X,..., X N, the joint PDF f X,...,X N (x,..., x N ) is a N-dimensional function which could be computationally intractable. However, if we assume all these random variables are independent, then the joint PDF becomes f X,...,X N (x,..., x N ) N f Xn (x n ), which is often manageable. As a special case of independent random variables, we define the notion of independent and identically distributed (i.i.d.) random variables. n Definition 6 (Independent and Identically Distributed (i.i.d.)). A collection of random variables X,..., X N are called independent and identically distributed (i.i.d.) if All X,..., X N are independent; All X,..., X N have the same distribution, i.e., f X (x)... f XN (x). If X,..., X N are i.i.d., we have that f X,...,X N (x,..., x) N f Xn (x) [f X (x)] N, (5.7) where the particular choice of X is unimportant because f X (x)... f XN (x). n 5. Joint CDF Same as Ch.3 and Ch.4, we need to understand the cumulative distribution function (CDF) for the multi-variable case. Definition 7. Let X and Y be two random variables. The joint CDF of X and Y is the function F X,Y (x, y) such that F X,Y (x, y) P[X x Y y]. (5.8) c 8 Stanley Chan. All Rights Reserved. 6

From this definition, we can explicitly write out the probability as follows. Definition 8. If X and Y are discrete, then F X,Y (x, y) p X,Y (x, y ). (5.9) y y x x If X and Y are continuous, then F X,Y (x, y) y x f X,Y (x, y )dx dy. (5.0) Note that since F X,Y (x, y) is the integration from to x (and y), we have F X,Y (, y) y y 0dy 0. f X,Y (x, y )dx dy Similarly, we have F X,Y (x, ) 0, and F X,Y (, ) 0. CDF evaluated at x and y is F X,Y (, ) f X,Y (x, y )dx dy. If only x or y is at, we obtain the marginal CDF. Proposition. Let X and Y be two random variables. Then marginal CDF can be obtained from F X (x) F X,Y (x, ) F Y (y) F X,Y (, y). To see these results, we note that F X,Y (x, ) x y f X,Y (x, y )dy dx f X (x )dx F X (x). c 8 Stanley Chan. All Rights Reserved. 7

By fundamental theorem of calculus, we can derive PDF from the CDF. Definition 9. Let F X,Y (x, y) be the joint CDF of X and Y. Then, the joint PDF can be obtained through f X,Y (x, y) y x F X,Y (x, y). The order of the partial derivative can be switched, yielding a symmetric result: f X,Y (x, y) x y F X,Y (x, y). 5.3 Conditional PMF and PDF Conditional PMF Definition 0. Let X and Y be two discrete random variables. The conditional PMF of X given Y is p X Y (x y) p X,Y (x, y). (5.) p Y (y) By definition of conditional probability, we can also define p X Y (x y) P[X x Y y] because p X Y (x y) p X,Y (x, y) p Y (y) P[X x Y y] P[Y y] P[X x Y y]. It is important to understand the randomness exhibited in a conditional PMF. In p X Y (x y), the random variable Y is fixed to a specific value Y y. The randomness of Y has been taken care by the denominator p Y (y) in Equation 5.. Therefore, there is no randomness associated with Y. The variable x in p X Y (x y) describes the randomness. In particular, we have that but p X Y (x p X,Y y) (x, y) p x x Y (y) x p X,Y (x, y) p Y (y) p X Y (x y p X,Y (x, ) y ). p y y Y (y ) Therefore, p X Y (x y) is a probability of X, not Y. p Y (y) p Y (y), Unlike marginal PMF which is a function of either x or y, e.g., p X (x) or p Y (y), a conditional PMF can be a function of both x and y. For example, p X Y (x y) is the conditional probability c 8 Stanley Chan. All Rights Reserved. 8

of having random variable X x, given that Y is at a fixed value y. Thus p X Y (x y) depends on both x and y. Example. Consider a joint PMF given in the following table. Find the conditional PMF p X Y (x ) and the marginal PMF p X (x). Y 3 4 X 0 3 0 4 3 3 Solution. To find the marginal PMF, we need to sum over all the y for every x. Therefore, 4 x : p X () p X,Y (, y) + + + 0 3 x : p X () x 3 : p X (3) x 4 : p X (4) Hence, the marginal PMF is y 4 p X,Y (, y) + + + 6 y 4 p X,Y (3, y) + 3 + 3 + 8 y 4 p X,Y (4, y) 0 + + + 3. y p X (x) [ 3 6 8 ] 3 The conditional PMF p X Y (x ) is p X Y (x ) p X,Y (x, ) p Y () [ 3 3 3 0 ]. [ 3 ] 0 Example. Consider two random variables X and Y defined as follows. 0, with prob 5/6, 0 4 Y, with prob /, Y X 0 0 4, with prob /6. 3 Y, with prob /3, 0 Y, with prob /6. c 8 Stanley Chan. All Rights Reserved. 9

Find p X Y (x y), p X (x) and p X,Y (x, y). Solution. Since Y takes two different states, we can enumerate Y 0 and Y 0 4. This gives us /, if x 0.0, /, if x, p X Y (x 0 ) /3, if x 0., and p X Y (x 0 4 ) /3, if x 0, /6, if x. /6, if x 00. The joint PDF p X,Y (x, y) can be found as ( ) ( 5 ) p X,Y (x, 0 ) p X Y (x 0 )p Y (0 6, x 0.0, ( ) ) ( 5 ) 3 6, x 0., ( ) ( 5 ) 6 6, x. ( ) ( ) p X,Y (x, 0 4 ) p X Y (x 0 4 )p Y (0 4 6, x, ( ) ) ( ) 3 6, x 0, ( ) ( ) 6 6, x 00. Therefore, the joint PDF is given by the following table. The marginal PDF p X (x) is 0 4 0 0 8 36 0 5 5 5 0 0 8 36 0.0 0. 0 00 p X (x) y p X,Y (x, y) [ 5 5 8 9 8 36]. Conditional PDF Definition. Let X and Y be two continuous random variables. The conditional PDF of X given Y is f X Y (x y) f X,Y (x, y). (5.) f Y (y) Example. Let X and Y be two continuous random variables with a joint PDF e x e y, 0 y x < f X,Y (x, y) 0, otherwise. Find the conditional PDF f X Y (x y) and f Y X (y x). c 8 Stanley Chan. All Rights Reserved. 0

Solution. In order to find the conditional PDFs, we first find the marginal PDFs. f X (x) f Y (y) Therefore, the conditional PDFs are f X Y (x y) f X,Y (x, y) f Y (y) f Y X (y x) f X,Y (x, y) f X (x) f X,Y (x, y)dy f X,Y (x, y)dx x 0 y e x e y dy e x ( e x ) e x e y dx e y. e x e y e y e (x+y), x y e x e y e x ( e x ) e y, 0 y < x. e x Example. This example considers a classical detection problem. Let X be a random bit such that +, with prob / X, with prob /. Suppose that X is transmitted over a noisy channel so that the observed signal is Y X + N, where N N (0, ) is the noise which is independent to the signal X. Suppose that we observe Y > 0, is the signal more likely to be X + or X? Solution. First of all, we know that Therefore, given Y > 0, we need to find It holds that P[X + Y > 0] P[Y > 0 X +] f Y X (y + ) e (y ) π f Y X (y ) π e (y+). P[Y > 0 X +]P[X +]. P[Y > 0] 0 π e (y ) dy 0 e (y ) dy π ( ) 0 Φ Φ( ). c 8 Stanley Chan. All Rights Reserved.

Similarly, we have By law of total probability, we have that P[Y > 0 X ] Φ(+). P[Y > 0] P[Y > 0 X +]P[X +] + P[Y > 0 X ]P[X ] (Φ(+) + Φ( )), because Φ(+) + Φ( ) Φ(+) + Φ(+). Therefore, P[X + Y > 0] Φ( ) 0.843. The implication is that if Y > 0, the posterior probability P[X + Y > 0] 0.843. The complement of this result gives that P[X Y > 0] 0.843 0.587. Therefore, X + is more likely. 5.4 Joint Expectation, Moment, and Covariance Joint Expectation and Joint Moment Definition. Let X and Y be two random variables. The joint expectation is E[XY ] xyp X,Y (x, y) (5.3) y x if X and Y are discrete, or E[XY ] xyf X,Y (x, y)dxdy (5.4) if X and Y are continuous. Joint expectation is also called correlation. Theorem. If X and Y are independent, then E[XY ] E[X]E[Y ]. (5.5) Proof. We only prove the discrete case because the continuous can be proved similarly. If X and Y are independent, we have p X,Y (x, y) p X (x)p Y (y). Therefore, E[XY ] xyp X,Y (x, y) ( ) ( ) xyp X (x)p Y (y) xp X (x) yp Y (y) y x y x x E[X]E[Y ]. y c 8 Stanley Chan. All Rights Reserved.

In general, for any two independent random variables and two functions f and g, it holds that E[f(X)g(Y )] E[f(X)]E[g(Y )]. (5.6) Of particular interest is the function f(x) X k and g(y ) Y l, which gives the definition of joint moments. Definition 3. Let X and Y be two random variables. The joint moment is E[X k Y l ] x k y l p X,Y (x, y) (5.7) y x if X and Y are discrete, or if X and Y are continuous. Covariance E[X k Y l ] x k y l f X,Y (x, y)dxdy (5.8) The concept of covariance can be considered as a generalization of the concept of variance. Instead of measuring (X µ X ), a covariance of two random variables measures (X µ X )(Y µ Y ). Thus while the variance is always non-negative, a covariance can be negative. Definition 4. Let X and Y be two random variables. The covariance is where µ X E[X] and µ Y E[Y ]. Cov(X, Y ) E[(X µ X )(Y µ Y )], (5.9) The following theorem illustrates a few important properties of the covariance. Theorem 3. The following results hold: a. Cov(X, Y ) E[XY ] E[X]E[Y ] b. X and Y are independent Cov(X, Y ) 0. c. Cov(X, Y ) 0 X and Y are independent. Remark: If Y X, then Cov(X, Y ) E[X ] E[X] Var[X]. Proof. The proof of part (a) is straight-forward: Cov(X, Y ) E[(X µ X )(Y µ Y )] E[XY Xµ Y Y µ X + µ X µ Y ] E[XY ] µ X µ Y. c 8 Stanley Chan. All Rights Reserved. 3

The proof of part (b) follows from Equation (5.5). If X and Y are independent, then E[XY ] E[X]E[Y ]. In this case, Cov(X, Y ) E[XY ] E[X]E[Y ] E[X]E[Y ] E[X]E[Y ] 0. Proof of part (c) requires a counter example. Consider a discrete random variable Z with PMF p Z (z) [ 4 4 4 4]. Let X and Y be X cos π Z and Y sin π Z. Then, we can show that E[X] 0, E[Y ] 0. The covariance is Cov(X, Y ) E[(X 0)(Y 0)] E [cos π Z sin π ] Z [ E [ (sin π0) 4 + (sin π) 4 + (sin π) 4 + (sin π3) 4 ] sin πz ] 0. Our next goal is to show that X and Y are dependent. To this end, we only need to show that p X,Y (x, y) p X (x)p Y (y). The joint PMF p X,Y (x, y) can be found by noting that Z 0 X, Y 0 Z X 0, Y Z X, Y 0 Z 3 X 0, Y. Thus, the PMF is The marginal PMFs are 0 0 4 p X,Y (x, y) 0. 4 4 0 0 4 p X (x) [ 4 ] 4, py (y) [ 4 4]. The product p X (x)p Y (y) is p X (x)p Y (y) Therefore, p X,Y (x, y) p X (x)p Y (y), although E[XY ] E[X]E[Y ]. c 8 Stanley Chan. All Rights Reserved. 4 6 8 6 8 4 8 6 8 6.

The next theorem is general to random variables that are not necessarily independent. Theorem 4. For any X and Y (not necessarily independent), a. E[X + Y ] E[X] + E[Y ]. b. Var[X + Y ] Var[X] + Cov(X, Y ) + Var[Y ]. Of course, if X and Y are independent, then Cov(X, Y ) 0 and hence Var[X + Y ] Var[X] + Var[Y ]. Proof. Proof of (a). Recall the definition of joint expectation: E[X + Y ] y y x x (x + y)p X,Y (x, y) x xp X,Y (x, y) + yp X,Y (x, y) x y x ( ) x p X,Y (x, y) + ( ) y p X,Y (x, y) y y x xp X (x) + y yp Y (y) E[X] + E[Y ]. Proof of (b). Var[X + Y ] E[(X + Y ) ] E[X + Y ] E[(X + Y ) ] (µ X + µ Y ) E[X + XY + Y ] (µ X + µ X µ Y + µ Y ) E[X ] µ X + E[Y ] µ Y + (E[XY ] µ X µ Y ) Var[X] + Cov(X, Y ) + Var[Y ]. Correlation Coefficient Definition 5. Let X and Y be two random variables. The correlation coefficient is ρ Cov(X, Y ) Var[X]Var[Y ] (5.) Correlation coefficient provides a convenient way of assessing the relationship between two random variables. The following proposition outlines its properties. c 8 Stanley Chan. All Rights Reserved. 5

Theorem 5. The correlation coefficient ρ has the properties that: When X Y (fully correlated), ρ +. When X Y (negatively correlated), ρ. When X and Y are independent, ρ 0. However, if ρ 0, it does not imply that X and Y are independent. Proof. When X Y, ρ Var[X] Var[X]Var[X]. When X Y, ρ E[X( X)] E[X]E[ X] Var[X]Var[ X]. When X and Y are independent, then Cov(X, Y ) 0. A counter example for the converse can be found in Theorem 3(c). In general a correlation coefficient is always bounded between - and. Theorem 6. Correlation coefficient always satisfies ρ. (5.) Proof. We prove this result by Cauchy inequality. Cauchy inequality states that E[XY ] E[X ]E[Y ]. Therefore, we have Cov(X, Y ) E[(X µ X )(Y µ Y )] E[(X µ X ) ]E[(Y µ Y ) ] Var[X]Var[Y ]. Hence, we have Cov(X, Y ) Var[X]Var[Y ]. c 8 Stanley Chan. All Rights Reserved. 6

5.5 Conditional Expectation When dealing with two dependent random variables, sometimes we would like to determine the expectation of a random variable when the second random variable takes a particular state. The conditional expectation is a formal way of doing so. Definition 6. The conditional expectation of X given Y y is E[X Y y] x xp X Y (x y) (5.) for the discrete random variables, and E[X Y y] for the continuous random variables. There are a few points to note here: xf X Y (x y)dx (5.3) In E[X Y y], the expectation is taken over X. In other words, we are exploring the randomness of X. To evaluate the conditional expectation, the PDF is f X Y (x y). The random variable Y is fixed at Y y. Thus, there is no randomness associated with Y. The resulting object E[X Y y] is a function of y because the random variable X has been eliminated by the expectation. Conditional expectation is meaningful only when X and Y are dependent. If X and Y are independent, then f X Y (x y) f X (x) and so E[X Y y] E[X]. That is the conditional expectation does not really depend on y. If we do not specify a particular value Y takes, then we refer to E[X Y ], which is a random variable in Y. One of the most useful results in conditional expectation is the following theorem. Theorem 7 (Law of Total Expectation). E[X] y E[X Y y]p Y (y), or E[X] E[X Y y]p Y (y)dy. (5.4) c 8 Stanley Chan. All Rights Reserved. 7

Proof. We only prove the discrete case, as the continuous case can be proved by replacing summation with integration. E[X] xp X (x) ( ) x p X,Y (x, y) xp X Y (x y)p Y (y) x x y x y ( ) xp X Y (x y) p Y (y) E[X Y y]p Y (y). y x y Corollary. Let X and Y be two random variables. Then, E[X] E [E[X Y ]]. (5.5) Proof. The previous theorem states that E[X] y E[X Y y]p Y (y). If we treat E[X Y y] as a function of y, e.g., h(y), then E[X] y E[X Y y]p Y (y) y h(y)p Y (y) E[h(Y )] E [E[X Y ]]. Remark: To be slightly more clear, the two expectations in Equation (5.5) are E[X] E Y [ EX Y [X Y ] ], i.e., the inner expectation is taken over f X Y, whereas the outer expectation is f Y. Example. Consider a joint PMF given by the following table. Find E[X Y 0 ] and E[X Y 0 4 ]. Y 0 4 0 0 0 5 5 8 36 5 0 0 8 36 0.0 0. 0 00 X Solution. To find the conditional expectation, we first need to know the conditional PMF. 0 0 ] 6 p X Y (x 0 ) [ 3 p X Y (x 0 4 ) [ 0 0 c 8 Stanley Chan. All Rights Reserved. 8 3 6].

Therefore, the conditional expectations are ( ) ( ) ( ) E[X Y 0 ] (0 ) + (0 ) + () 3 6 ( ) ( ) ( ) E[X Y 0 4 ] () + (0) + (00) 3 6 From the conditional expectations we can also find E[X]: 3 600 3 6. E[X] E[X Y 0 ]p Y (0 + E[X Y 0 4 ]p Y (0 4 ) ( ) ( ) ( ) ( ) 3 5 3 + 3.5875. 600 6 6 6 Example. Consider two random variables X and Y. The random variable X is Gaussian distributed with X N (µ, σ ). The random variable Y has a conditional distribution Y X N (X, X ). Find E[Y ]. Solution. We know that the two PDFs are f X (x) (x µ) e σ, and f Y X (y x) πσ The conditional expectation of Y given X is E[Y X x] yf Y X (y x)dy (y x) e x. πx (y x) y e x dy x. πx The last equality holds because we are computing the expectation of a Gaussian random variable with mean x. Finally, applying the law of total expectation we can show that E[Y ] E[Y X x]f X (x)dx x πσ e (x µ) σ dx µ. Application: MMSE Estimator. (Optional) Consider a pair of random variables (X, Y ). We observed this pair of random variables. Can we determine the relationship between them? That is, can we design a function g such that we can minimize the error min g E[(Y g(x)) ]. c 8 Stanley Chan. All Rights Reserved. 9

We may assume that we know the distributions f X (x), f Y (y), and f Y X (y x). The solution to this problem is called the minimum mean squared error (MMSE) estimator. Theorem 8. The MMSE estimator is a function g which minimizes the mean squared error: g argmin E[(Y g(x)) ], g and is given by Proof. By law of total expectation, we have that E[(Y g(x)) ] g (x) E[Y X x]. (5.6) E[(Y g(x)) X x]f Y X (y x)dy. Since all terms in this integration are non-negative, we can minimize the overall by minimizing the inner expectation. The inner expectation is E[(Y g(x)) X x]. When conditioned on X x, the function g(x) g(x), and is independent of Y. Therefore, we can treat g(x) c for some constant c and try to determine c. This means that we want to find c to minimize c argmin c argmin c E[(Y c) X x] (y c) f Y X (y x)dy. Take derivative with respect to c and set it to zero yields ( d ) (y c) f Y X (y x)dy 0 (y c)f Y X (y x)dy 0. dc which implies that c yf Y X (y x)dy E[Y X x]. Therefore, the inner expectation is minimized when g(x) E[Y X x]. 5.6 Sum of Two Random Variables One typical problem we encounter in engineering is that given two random variables X and Y, what is the PDF of the sum, i.e., X + Y? Such problem arises naturally when we want to evaluate the average of a number of random variables, e.g., the sample mean of a collection c 8 Stanley Chan. All Rights Reserved.

of data points. In this section we will discuss a general principle of how to determine the PDF of a sum of two random variables. To start with, we consider two random variable X and Y with PDFs f X (x) and f Y (y) respectively. Let us define the sum as Z X + Y. Our goal is to determine the PDF of Z. Theorem 9. Let X and Y be two independent random variables with PDFs f X (x) and f Y (y) respectively. Let Z X + Y. The PDF of Z is given by f Z (z) (f X f Y )(z) where denotes the convolution. Proof. Let us start by analyzing the CDF of Z. The CDF of Z is F Z (z) P[Z z] z y f X (z y)f Y (y)dy, (5.7) f X (x)f Y (y)dxdy, where the integration limits can be seen from Figure 5.4. Then, by fundamental theorem of calculus, we can show that f Z (z) d dz F Z(z) d dz z y where denotes the convolution. ( d dz z y f X (x)f Y (y)dxdy ) f X (x)f Y (y)dx dy f X (z y)f Y (y)dy (f X f Y )(z), The result of this derivation shows that the PDF of X + Y is the convolution of f X (x) and f Y (y). The following example illustrate how we can compute the convolution. Example. Let X and Y be independent, and let xe x, x 0 f X (x) f Y (y) 0, x < 0 ye y, y 0 0, y < 0. c 8 Stanley Chan. All Rights Reserved.

Figure 5.4: The shaded region highlights the set X + Y Z. Find the PDF of Z X + Y. Solution. Using the results derived above, we see that f Z (z) z f X (z y)f Y (y)dy f X (z y)f Y (y)dy, where the upper limit z came from the fact that x 0. Therefore, since Z X + Y, we must have Z Y X 0 and so Z Y. Substituting the PDFs into the integration yields For z < 0, f Z (z) 0. f Z (z) z 0 (z y)e (z y) ye y dy z3 6 e z, z 0. In general, function of two random variables is not limited to summation. The following example illustrates the case of a product of two random variables. Example. Let X and Y be two independent random variables such that x, if 0 x,, if 0 y, f X (x) and f Y (y) 0, otherwise, 0, otherwise. Let Z XY. Find f Z (z). Solution. The CDF of Z can be evaluated as F Z (z) P[Z z] P[XY z] z y f X (x)f Y (y)dxdy. c 8 Stanley Chan. All Rights Reserved.

Taking the derivative yields z y f Z (z) d dz F Z(z) d dz (a) y f X( z y )f Y (y)dy, f X (x)f Y (y)dxdy where (a) holds by the fundamental theorem of calculus. The upper and lower limit of this integration can be determined by noting that z 0 z y x, which implies that z y. Since y, we have that z y. Therefore, the PDF is z ( ) z f Z (z) y f X f Y (y)dy y For z < 0, f Z (z) 0. z dy ( z), z 0. y 5.7 Two-dimensional Gaussian Covariance Matrix and Joint Gaussian PDF Among many joint distributions, the joint Gaussian is of particular interest because of its usefulness. To define a joint Gaussian distribution, we first define a few notations: [ ] [ ] [ ] X µ Var(X ) Cov(X X, µ, Σ, X ). Cov(X, X ) Var(X ) X µ The vector µ is called the mean vector, and the matrix Σ is called the covariance matrix. It is not difficult to show that the covariance matrix can be defined in the following way. Theorem 0. The covariance matrix Σ is equivalent to Σ E[(X µ)(x µ) T ]. (5.8) Proof. For a two-dimensional random variable, the theorem holds because [[ ] E[(X µ)(x µ) T X µ [X ] E ] ] µ X µ X µ [ ] (X E µ ) (X µ )(X µ ) (X µ )(X µ ) (X µ ) [ ] Var(X, X ) Cov(X, X ). Cov(X, X ) Var(X, X ) c 8 Stanley Chan. All Rights Reserved. 3

Clearly, the definition can be extended to random vector with any finite dimension. We can also prove the following property of the covariance matrix. Theorem. The covariance matrix Σ is symmetric positive semi-definite, i.e., Σ T Σ, and v T Σv 0, v R d. Proof. Symmetry is immediate from the definition, because Cov(X i, X j ) Cov(X j, X i ). The positive semi-definiteness comes from the fact that v T Σv v T E[(X µ X )(X µ X ) T ]v E[v T (X µ X )(X µ X ) T v] E[u T u], let u (X µ X ) T v E[ u ] 0. The PDF of a multi- With these tools in hand, we can now define a joint Gaussian. dimensional Gaussian is given by the following definition. Definition 7. A d-dimensional joint Gaussian has a PDF f X (x) (π)d Σ exp } (x µ)t Σ (x µ), (5.9) where d denotes the dimensionality of the vector x. In this course, we are mostly interested in the case when d. As a special case, if we assume that X and X are independent, then we can show the following result. Theorem. Let x [X, X ] T. If X and X are independent, then ( f X (x) exp (x } ) ( µ ) exp (x } ) µ ), (5.30) (π)σ (π)σ σ i.e., the product of two D Gaussians. σ c 8 Stanley Chan. All Rights Reserved. 4

Proof. To show this result, we note that if X and X are independent, then Σ [ ] [ ] Var(X ) Cov(X, X ) Var(X ) 0 Cov(X, X ) Var(X ) 0 Var(X ) [ ] σ 0 0 σ. The determinant Σ is Σ σ σ. Therefore, (x µ) T Σ (x µ) [ x µ x µ ] [ σ (x µ ) σ 0 σ + (x µ ). σ Substituting these results into Equation 5.9 yields the desired result. ] [x ] 0 µ x µ Geometric Interpretation Geometrically, the mean µ and the covariance matrix Σ can be interpreted as the center and the radius of the ellipse representing the Gaussian. Figure 5.5 illustrate three examples. As one can observe in these examples, the mean vector µ controls the center of the Gaussian. The radius and orientation of the Gaussian is controlled by the covariance matrix. Figure 5.5: The center and the radius of the ellipse is determined by µ and Σ. The precise relation of the radius and orientation of the Gaussian is determined by eigenc 8 Stanley Chan. All Rights Reserved. 5

vectors and eigenvalues of Σ. Definition 8. The covariance matrix Σ can be decomposed as Σ UΛU T, (5.3) for some unitary matrix U and diagonal matrix Λ. The columns of U are called the eigenvectors, and the entries of Λ are called the eigenvalues. If we write out the definition of the eigenvector and eigenvalue, we can see that (at least for the two-dimensional case): [ ] [ ] Σ UΛU T u u λ 0 u T 0 λ u T. The column vector u defines the direction of the major axis, and u defines the direction of the minor axis. The values λ and λ define the radii of the axes, respectively. See Figure 5.6 for illustration. Figure 5.6: The center and the radius of the ellipse is determined by µ and Σ. Maximum-a-Posteriori Classifier Consider a dataset of two classes C and C. We assume that all data within each class follows a Gaussian distribution. More specifically, we assume that X C N (µ, Σ ), and X C N (µ, Σ ). Suppose we are given testing data point x, how do we design a classifier to classify this data point? To answer this question, we first need to determine the two PDFs. Assume that the probability of obtaining C is π and the probability of obtaining C is π. That is, f C (C ) π, c 8 Stanley Chan. All Rights Reserved. 6

and f C (C ) π, with π + π. The conditional PDFs are given by f X C (x C ) (π)d Σ exp } (x µ ) T Σ (x µ ) f X C (x C ) (π)d Σ exp } (x µ ) T Σ (x µ ) One possible way of designing a classifier is to test the posterior distribution, and check f C X (C x) f C X (C x). (5.3) If f C X (C x) f C X (C x), we claim that the class is C. Otherwise it is C. By Bayes theorem, we can rewrite the posterior distribution as Substituting the Gaussians we have f X C (x C )f C (C ) f X C (x C )f C (C ). π (π)d Σ e (x µ )T Σ (x µ ) π (π)d Σ e (x µ )T Σ (x µ ). The comparison defined by this posterior distribution is called the maximum-a-posteriori (MAP) classification. Definition 9. The maximum-a-posteriori (MAP) classification is a test to check whether f X C (x C )f C (C ) f X C (x C )f C (C ). (5.33) To demonstrate how the MAP classification can be used in practice, we consider a special case when Σ Σ, and π π /. Theorem 3. Let X C N (µ, Σ ), and X C N (µ, Σ ). Suppose that Σ Σ Σ, and π π /. Then the MAP classifier of C and C is w T x + x 0 0, (5.34) where w Σ (µ µ ), and x 0 µ T Σ µ + µ T Σ µ }. Proof. When Σ Σ, and π π /, the MAP classifier can be simplified as e (x µ )T Σ (x µ ) e (x µ )T Σ (x µ ), (5.35) c 8 Stanley Chan. All Rights Reserved. 7

which implies that (x µ ) T Σ (x µ ) (x µ ) T Σ (x µ ). (5.36) Note that the sign is flipped because there is a / term in the exponential. Rewriting the terms we obtain an equivalent expression x T Σ (µ µ ) µ T Σ µ + µ T Σ µ }. (5.37) If we define w Σ (µ µ ), and x 0 µ T Σ µ + µ T Σ µ }, the above expression can be simplified as w T x + x 0 0. (5.38) The result above shows a linear classifier. Given a data point x, all we need to do is to project x by w, and then check whether the intercept w T x + x 0 is less than or greater than 0. If it is less than 0, then we claim that the class is C. Figure 5.7: Classifying two classes of data points. c 8 Stanley Chan. All Rights Reserved. 8