Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows.

Similar documents
Review of Probability Theory

STA 256: Statistics and Probability I

Random Variables. Cumulative Distribution Function (CDF) Amappingthattransformstheeventstotherealline.

conditional cdf, conditional pdf, total probability theorem?

1 Random Variable: Topics

Random Variables. Random variables. A numerically valued map X of an outcome ω from a sample space Ω to the real line R

Lecture 2: Repetition of probability theory and statistics

Continuous Random Variables

Chapter 4. Continuous Random Variables 4.1 PDF

3 Multiple Discrete Random Variables

ENGG2430A-Homework 2

P (x). all other X j =x j. If X is a continuous random vector (see p.172), then the marginal distributions of X i are: f(x)dx 1 dx n

3. Probability and Statistics

BASICS OF PROBABILITY

Lecture 11. Probability Theory: an Overveiw

ECE Lecture #9 Part 2 Overview

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Formulas for probability theory and linear models SF2941

Bivariate distributions

Problem Y is an exponential random variable with parameter λ = 0.2. Given the event A = {Y < 2},

Review: mostly probability and some statistics

2 (Statistics) Random variables

18.440: Lecture 28 Lectures Review

Lecture 1: August 28

6.1 Moment Generating and Characteristic Functions

Algorithms for Uncertainty Quantification

Probability review. September 11, Stoch. Systems Analysis Introduction 1

Lecture 16 : Independence, Covariance and Correlation of Discrete Random Variables

Multiple Random Variables

The Multivariate Gaussian Distribution

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Chapter 2. Some Basic Probability Concepts. 2.1 Experiments, Outcomes and Random Variables

1 Presessional Probability

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

ECE 4400:693 - Information Theory

Probability. Paul Schrimpf. January 23, Definitions 2. 2 Properties 3

Multivariate distributions

Recitation 2: Probability

STAT 430/510: Lecture 16

Joint Distributions. (a) Scalar multiplication: k = c d. (b) Product of two matrices: c d. (c) The transpose of a matrix:

Mathematics 426 Robert Gross Homework 9 Answers

EE4601 Communication Systems

Lecture Notes 3 Multiple Random Variables. Joint, Marginal, and Conditional pmfs. Bayes Rule and Independence for pmfs

ECE302 Exam 2 Version A April 21, You must show ALL of your work for full credit. Please leave fractions as fractions, but simplify them, etc.

Probability. Paul Schrimpf. January 23, UBC Economics 326. Probability. Paul Schrimpf. Definitions. Properties. Random variables.

EEL 5544 Noise in Linear Systems Lecture 30. X (s) = E [ e sx] f X (x)e sx dx. Moments can be found from the Laplace transform as

Random Signals and Systems. Chapter 3. Jitendra K Tugnait. Department of Electrical & Computer Engineering. Auburn University.

More on Distribution Function

18.440: Lecture 28 Lectures Review

Appendix A : Introduction to Probability and stochastic processes

01 Probability Theory and Statistics Review

Lecture 8: Channel Capacity, Continuous Random Variables

Multivariate probability distributions and linear regression

ACM 116: Lectures 3 4

Random Variables. Saravanan Vijayakumaran Department of Electrical Engineering Indian Institute of Technology Bombay

Multivariate Random Variable

Joint Distribution of Two or More Random Variables

Chapter 4. Multivariate Distributions. Obviously, the marginal distributions may be obtained easily from the joint distribution:

5 Operations on Multiple Random Variables

4. Distributions of Functions of Random Variables

Probability and Distributions

MAS113 Introduction to Probability and Statistics. Proofs of theorems

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

Introduction to Computational Finance and Financial Econometrics Probability Review - Part 2

Summary of basic probability theory Math 218, Mathematical Statistics D Joyce, Spring 2016

SDS 321: Introduction to Probability and Statistics

Solutions to Homework Set #5 (Prepared by Lele Wang) MSE = E [ (sgn(x) g(y)) 2],, where f X (x) = 1 2 2π e. e (x y)2 2 dx 2π

Review (Probability & Linear Algebra)

5. Random Vectors. probabilities. characteristic function. cross correlation, cross covariance. Gaussian random vectors. functions of random vectors

Bivariate Distributions

Statistics for scientists and engineers

Recall that if X 1,...,X n are random variables with finite expectations, then. The X i can be continuous or discrete or of any other type.

3-1. all x all y. [Figure 3.1]

Practice Examination # 3

Probability Review. Yutian Li. January 18, Stanford University. Yutian Li (Stanford University) Probability Review January 18, / 27

Let X and Y denote two random variables. The joint distribution of these random

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Chapter 4. Chapter 4 sections

[POLS 8500] Review of Linear Algebra, Probability and Information Theory

MAS223 Statistical Inference and Modelling Exercises

Multivariate Distributions (Hogg Chapter Two)

STT 441 Final Exam Fall 2013

Fundamentals. CS 281A: Statistical Learning Theory. Yangqing Jia. August, Based on tutorial slides by Lester Mackey and Ariel Kleiner

CDA6530: Performance Models of Computers and Networks. Chapter 2: Review of Practical Random Variables

Random Variables and Their Distributions

Chapter 4 Multiple Random Variables

Data Analysis and Monte Carlo Methods

Notes for Math 324, Part 19

Chapter 5,6 Multiple RandomVariables

PCMI Introduction to Random Matrix Theory Handout # REVIEW OF PROBABILITY THEORY. Chapter 1 - Events and Their Probabilities

Quick Tour of Basic Probability Theory and Linear Algebra

Lecture 25: Review. Statistics 104. April 23, Colin Rundel

Continuous r.v practice problems

Introduction to Machine Learning

MULTIVARIATE PROBABILITY DISTRIBUTIONS

CDA5530: Performance Models of Computers and Networks. Chapter 2: Review of Practical Random Variables

Review (probability, linear algebra) CE-717 : Machine Learning Sharif University of Technology

4 Pairs of Random Variables

UCSD ECE153 Handout #27 Prof. Young-Han Kim Tuesday, May 6, Solutions to Homework Set #5 (Prepared by TA Fatemeh Arbabjolfaei)

Transcription:

Chapter 5 Two Random Variables In a practical engineering problem, there is almost always causal relationship between different events. Some relationships are determined by physical laws, e.g., voltage and current, while some are abstracted from the problem, e.g., probability of passing a class and probability of graduating. Whenever we need to handle relationship between two or more events, we need mathematical tools to describe the probabilistic phenomenon. The objective of this chapter to present the concepts of joint distributions. 5. Joint PMF and Joint PDF Perhaps the simplest way of modeling two (discrete) random variables is by means of a joint PMF, defined as follows. Definition. Let X and Y be two discrete random variables. The joint PMF of X and Y is defined as p X,Y (x, y) P[X x Y y]. (5.) The interpretation of a joint PMF is that the sample space is now the Cartesian plane of Ω X Ω Y, where Ω X is the sample space of X, and Ω Y is the sample space of Y. Pictorially, this means that the sample space of the joint PMF is a two-dimensional plane (X, Y ). We stress the importance of this two-dimensional sample space, because every outcome of a joint variable is a point in the two-dimensional space, i.e., (X, Y ). Therefore, P[X A Y B] for sets A and B can be interpreted as P[X A Y B] P[(ξ, ζ) ξ X (A), and ζ Y (B)}]. (5.) For discrete random variables, the PMF p X,Y (x, y) can be considered as delta functions in the two-dimensional space.

Example. Let X be a coin flip, Y be a dice. Find the joint PMF of X and Y. Solution. The joint PMF is p X,Y (x, y), x 0,, y,, 3, 4, 5, 6. Pictorially, we have the joint PMF given by the following table. Y 3 4 5 6 X 0 X In this example, we observe that if X and Y are not interacting (formally, we call them independent which we will discuss later), then the joint PMF is the product of the two individual probabilities. The continuous version of the joint PMF is called the joint PDF. Definition. Let X and Y be two continuous random variables. The joint PDF of X and Y is a function f X,Y (x, y) that can be integrated to yield a probability: P[a X b c Y d] d b c a f X,Y (x, y)dxdy. (5.3) Like PDFs for single random variables, a joint PDF is a density which can be integrated to obtain the probability. Note also in this definition, the probabilities of the events a X b} and c Y d} are related using logical AND. Example. Consider a uniform joint PDF f X,Y (x, y) defined on [0, ], as shown in Figure 5.. The shaded area corresponds to P[a X b c X d] d b c a d b c a f X,Y (x, y)dxdy dxdy (d c)(b a). In general, when f X,Y (x, y) is not uniform, we have to integrate f X,Y (x, y) over the interval specified. c 8 Stanley Chan. All Rights Reserved.

(a) General f X,Y (x, y) (b) Example. Figure 5.: The joint PDF f X,Y (x, y) is a two-dimensional function. Integrating over the rectangle [a, b] [c, d] returns the probability P[a X b c Y d]. Normalization The normalization property of a two-dimensional PMF and PDF states that by enumerating over all outcomes of the sample space we will obtain. Theorem. All joint PMFs and joint PDFs satisfy p X,Y (x, y) x y or f X,Y (x, y)dxdy. (5.4) Example. Consider a joint uniform PDF defined in the shaded area Ω with PDF defined below. Find the constant c. c, if (x, y) Ω, f X,Y (x, y) 0, otherwise. Solution. To find the constant c, we note that f X,Y (x, y)dxdy. The left hand side of this equation is precisely the area, which is Ω. Therefore, we have c / Ω. c 8 Stanley Chan. All Rights Reserved. 3

Marginal PMF and Marginal PDF If we only sum / integrate with respect to one random variable, we obtain the PMF / PDF of the other random variable. The resulting PMF / PDF is called the marginal PMF / PDF. Definition 3. The marginal PMF is defined as p X (x) y p X,Y (x, y) and p Y (y) x p X,Y (x, y) (5.5) Definition 4. The marginal PDF is defined as f X (x) f X,Y (x, y)dy and f Y (y) f X,Y (x, y)dx (5.6) Since f X,Y (x, y) is a two-dimensional function, when integrating over y from to, we project f X,Y (x, y) onto the x-axis. Therefore, the resulting function depends on x only. Example. Consider the joint PDF f X,Y (x, y) shown in Figure 5.. Find the marginal PDFs. Solution. If we integrate over x and y, then we have, if < x, 3, if < x,, if < x 3, f X (x), if < x 3, and f Y (y), if 3 < x 4, 0, otherwise. 0, otherwise. Figure 5.: Example of a joint uniform PDF f X,Y (x, y) and the corresponding marginal PDFs. Example. Consider a D Gaussian PDF as shown in Figure 5.3. The PDF of the joint Gaussian is f X,Y (x, y) πσ exp ((x µ } X) + (y µ Y ) ). σ c 8 Stanley Chan. All Rights Reserved. 4

Find the marginal PDFs f X (x) and f Y (y). Solution. f X (x) Similarly, we have f X,Y (x, y)dy exp (x µ X) πσ σ exp (x µ X) πσ σ f Y (y) πσ exp } }. ((x µ X) + (y µ Y ) ) σ exp (y µ Y ) πσ σ exp (y µ } Y ). πσ σ } dy } dy The result of this example shows that the marginalization of a D Gaussian is D Gaussian along the vertical and the horizontal axes. Thus, we can think of marginalization of a projection. Figure 5.3: Marginalization is equivalent to projection. A joint PDF shown in this figure can be marginalized onto the x or the y axis. Independence of Random Variables Finally, we say that two random variables are independent if the joint PMF or PDF can be factorized as a product of the marginal PMF / PDFs: Definition 5. If two random variables X and Y are independent, then p X,Y (x, y) p X (x)p Y (y), and f X,Y (x, y) f X (x)f Y (y). c 8 Stanley Chan. All Rights Reserved. 5

To see why this definition is coherent to the definition of independence of two events, we recall that two events A and B are independent if P[A B] P[A]P[B]. Letting A X x} and B Y y}, we see that if A and B are independent then P[X x Y y] P[X x]p[y y]. This is precisely the relationship p X,Y (x, y) p X (x)p Y (y). Independence is an important statistical property. If there are many random variables X, X,..., X N, the joint PDF f X,...,X N (x,..., x N ) is a N-dimensional function which could be computationally intractable. However, if we assume all these random variables are independent, then the joint PDF becomes f X,...,X N (x,..., x N ) N f Xn (x n ), which is often manageable. As a special case of independent random variables, we define the notion of independent and identically distributed (i.i.d.) random variables. n Definition 6 (Independent and Identically Distributed (i.i.d.)). A collection of random variables X,..., X N are called independent and identically distributed (i.i.d.) if All X,..., X N are independent; All X,..., X N have the same distribution, i.e., f X (x)... f XN (x). If X,..., X N are i.i.d., we have that f X,...,X N (x,..., x) N f Xn (x) [f X (x)] N, (5.7) where the particular choice of X is unimportant because f X (x)... f XN (x). n 5. Joint CDF Same as Ch.3 and Ch.4, we need to understand the cumulative distribution function (CDF) for the multi-variable case. Definition 7. Let X and Y be two random variables. The joint CDF of X and Y is the function F X,Y (x, y) such that F X,Y (x, y) P[X x Y y]. (5.8) c 8 Stanley Chan. All Rights Reserved. 6

From this definition, we can explicitly write out the probability as follows. Definition 8. If X and Y are discrete, then F X,Y (x, y) p X,Y (x, y ). (5.9) y y x x If X and Y are continuous, then F X,Y (x, y) y x f X,Y (x, y )dx dy. (5.0) Note that since F X,Y (x, y) is the integration from to x (and y), we have F X,Y (, y) y y 0dy 0. f X,Y (x, y )dx dy Similarly, we have F X,Y (x, ) 0, and F X,Y (, ) 0. CDF evaluated at x and y is F X,Y (, ) f X,Y (x, y )dx dy. If only x or y is at, we obtain the marginal CDF. Proposition. Let X and Y be two random variables. Then marginal CDF can be obtained from F X (x) F X,Y (x, ) F Y (y) F X,Y (, y). To see these results, we note that F X,Y (x, ) x y f X,Y (x, y )dy dx f X (x )dx F X (x). c 8 Stanley Chan. All Rights Reserved. 7

By fundamental theorem of calculus, we can derive PDF from the CDF. Definition 9. Let F X,Y (x, y) be the joint CDF of X and Y. Then, the joint PDF can be obtained through f X,Y (x, y) y x F X,Y (x, y). The order of the partial derivative can be switched, yielding a symmetric result: f X,Y (x, y) x y F X,Y (x, y). 5.3 Conditional PMF and PDF Conditional PMF Definition 0. Let X and Y be two discrete random variables. The conditional PMF of X given Y is p X Y (x y) p X,Y (x, y). (5.) p Y (y) By definition of conditional probability, we can also define p X Y (x y) P[X x Y y] because p X Y (x y) p X,Y (x, y) p Y (y) P[X x Y y] P[Y y] P[X x Y y]. It is important to understand the randomness exhibited in a conditional PMF. In p X Y (x y), the random variable Y is fixed to a specific value Y y. The randomness of Y has been taken care by the denominator p Y (y) in Equation 5.. Therefore, there is no randomness associated with Y. The variable x in p X Y (x y) describes the randomness. In particular, we have that but p X Y (x p X,Y y) (x, y) p x x Y (y) x p X,Y (x, y) p Y (y) p X Y (x y p X,Y (x, ) y ). p y y Y (y ) Therefore, p X Y (x y) is a probability of X, not Y. p Y (y) p Y (y), Unlike marginal PMF which is a function of either x or y, e.g., p X (x) or p Y (y), a conditional PMF can be a function of both x and y. For example, p X Y (x y) is the conditional probability c 8 Stanley Chan. All Rights Reserved. 8

of having random variable X x, given that Y is at a fixed value y. Thus p X Y (x y) depends on both x and y. Example. Consider a joint PMF given in the following table. Find the conditional PMF p X Y (x ) and the marginal PMF p X (x). Y 3 4 X 0 3 0 4 3 3 Solution. To find the marginal PMF, we need to sum over all the y for every x. Therefore, 4 x : p X () p X,Y (, y) + + + 0 3 x : p X () x 3 : p X (3) x 4 : p X (4) Hence, the marginal PMF is y 4 p X,Y (, y) + + + 6 y 4 p X,Y (3, y) + 3 + 3 + 8 y 4 p X,Y (4, y) 0 + + + 3. y p X (x) [ 3 6 8 ] 3 The conditional PMF p X Y (x ) is p X Y (x ) p X,Y (x, ) p Y () [ 3 3 3 0 ]. [ 3 ] 0 Example. Consider two random variables X and Y defined as follows. 0, with prob 5/6, 0 4 Y, with prob /, Y X 0 0 4, with prob /6. 3 Y, with prob /3, 0 Y, with prob /6. c 8 Stanley Chan. All Rights Reserved. 9

Find p X Y (x y), p X (x) and p X,Y (x, y). Solution. Since Y takes two different states, we can enumerate Y 0 and Y 0 4. This gives us /, if x 0.0, /, if x, p X Y (x 0 ) /3, if x 0., and p X Y (x 0 4 ) /3, if x 0, /6, if x. /6, if x 00. The joint PDF p X,Y (x, y) can be found as ( ) ( 5 ) p X,Y (x, 0 ) p X Y (x 0 )p Y (0 6, x 0.0, ( ) ) ( 5 ) 3 6, x 0., ( ) ( 5 ) 6 6, x. ( ) ( ) p X,Y (x, 0 4 ) p X Y (x 0 4 )p Y (0 4 6, x, ( ) ) ( ) 3 6, x 0, ( ) ( ) 6 6, x 00. Therefore, the joint PDF is given by the following table. The marginal PDF p X (x) is 0 4 0 0 8 36 0 5 5 5 0 0 8 36 0.0 0. 0 00 p X (x) y p X,Y (x, y) [ 5 5 8 9 8 36]. Conditional PDF Definition. Let X and Y be two continuous random variables. The conditional PDF of X given Y is f X Y (x y) f X,Y (x, y). (5.) f Y (y) Example. Let X and Y be two continuous random variables with a joint PDF e x e y, 0 y x < f X,Y (x, y) 0, otherwise. Find the conditional PDF f X Y (x y) and f Y X (y x). c 8 Stanley Chan. All Rights Reserved. 0

Solution. In order to find the conditional PDFs, we first find the marginal PDFs. f X (x) f Y (y) Therefore, the conditional PDFs are f X Y (x y) f X,Y (x, y) f Y (y) f Y X (y x) f X,Y (x, y) f X (x) f X,Y (x, y)dy f X,Y (x, y)dx x 0 y e x e y dy e x ( e x ) e x e y dx e y. e x e y e y e (x+y), x y e x e y e x ( e x ) e y, 0 y < x. e x Example. This example considers a classical detection problem. Let X be a random bit such that +, with prob / X, with prob /. Suppose that X is transmitted over a noisy channel so that the observed signal is Y X + N, where N N (0, ) is the noise which is independent to the signal X. Suppose that we observe Y > 0, is the signal more likely to be X + or X? Solution. First of all, we know that Therefore, given Y > 0, we need to find It holds that P[X + Y > 0] P[Y > 0 X +] f Y X (y + ) e (y ) π f Y X (y ) π e (y+). P[Y > 0 X +]P[X +]. P[Y > 0] 0 π e (y ) dy 0 e (y ) dy π ( ) 0 Φ Φ( ). c 8 Stanley Chan. All Rights Reserved.

Similarly, we have By law of total probability, we have that P[Y > 0 X ] Φ(+). P[Y > 0] P[Y > 0 X +]P[X +] + P[Y > 0 X ]P[X ] (Φ(+) + Φ( )), because Φ(+) + Φ( ) Φ(+) + Φ(+). Therefore, P[X + Y > 0] Φ( ) 0.843. The implication is that if Y > 0, the posterior probability P[X + Y > 0] 0.843. The complement of this result gives that P[X Y > 0] 0.843 0.587. Therefore, X + is more likely. 5.4 Joint Expectation, Moment, and Covariance Joint Expectation and Joint Moment Definition. Let X and Y be two random variables. The joint expectation is E[XY ] xyp X,Y (x, y) (5.3) y x if X and Y are discrete, or E[XY ] xyf X,Y (x, y)dxdy (5.4) if X and Y are continuous. Joint expectation is also called correlation. Theorem. If X and Y are independent, then E[XY ] E[X]E[Y ]. (5.5) Proof. We only prove the discrete case because the continuous can be proved similarly. If X and Y are independent, we have p X,Y (x, y) p X (x)p Y (y). Therefore, E[XY ] xyp X,Y (x, y) ( ) ( ) xyp X (x)p Y (y) xp X (x) yp Y (y) y x y x x E[X]E[Y ]. y c 8 Stanley Chan. All Rights Reserved.

In general, for any two independent random variables and two functions f and g, it holds that E[f(X)g(Y )] E[f(X)]E[g(Y )]. (5.6) Of particular interest is the function f(x) X k and g(y ) Y l, which gives the definition of joint moments. Definition 3. Let X and Y be two random variables. The joint moment is E[X k Y l ] x k y l p X,Y (x, y) (5.7) y x if X and Y are discrete, or if X and Y are continuous. Covariance E[X k Y l ] x k y l f X,Y (x, y)dxdy (5.8) The concept of covariance can be considered as a generalization of the concept of variance. Instead of measuring (X µ X ), a covariance of two random variables measures (X µ X )(Y µ Y ). Thus while the variance is always non-negative, a covariance can be negative. Definition 4. Let X and Y be two random variables. The covariance is where µ X E[X] and µ Y E[Y ]. Cov(X, Y ) E[(X µ X )(Y µ Y )], (5.9) The following theorem illustrates a few important properties of the covariance. Theorem 3. The following results hold: a. Cov(X, Y ) E[XY ] E[X]E[Y ] b. X and Y are independent Cov(X, Y ) 0. c. Cov(X, Y ) 0 X and Y are independent. Remark: If Y X, then Cov(X, Y ) E[X ] E[X] Var[X]. Proof. The proof of part (a) is straight-forward: Cov(X, Y ) E[(X µ X )(Y µ Y )] E[XY Xµ Y Y µ X + µ X µ Y ] E[XY ] µ X µ Y. c 8 Stanley Chan. All Rights Reserved. 3

The proof of part (b) follows from Equation (5.5). If X and Y are independent, then E[XY ] E[X]E[Y ]. In this case, Cov(X, Y ) E[XY ] E[X]E[Y ] E[X]E[Y ] E[X]E[Y ] 0. Proof of part (c) requires a counter example. Consider a discrete random variable Z with PMF p Z (z) [ 4 4 4 4]. Let X and Y be X cos π Z and Y sin π Z. Then, we can show that E[X] 0, E[Y ] 0. The covariance is Cov(X, Y ) E[(X 0)(Y 0)] E [cos π Z sin π ] Z [ E [ (sin π0) 4 + (sin π) 4 + (sin π) 4 + (sin π3) 4 ] sin πz ] 0. Our next goal is to show that X and Y are dependent. To this end, we only need to show that p X,Y (x, y) p X (x)p Y (y). The joint PMF p X,Y (x, y) can be found by noting that Z 0 X, Y 0 Z X 0, Y Z X, Y 0 Z 3 X 0, Y. Thus, the PMF is The marginal PMFs are 0 0 4 p X,Y (x, y) 0. 4 4 0 0 4 p X (x) [ 4 ] 4, py (y) [ 4 4]. The product p X (x)p Y (y) is p X (x)p Y (y) Therefore, p X,Y (x, y) p X (x)p Y (y), although E[XY ] E[X]E[Y ]. c 8 Stanley Chan. All Rights Reserved. 4 6 8 6 8 4 8 6 8 6.

The next theorem is general to random variables that are not necessarily independent. Theorem 4. For any X and Y (not necessarily independent), a. E[X + Y ] E[X] + E[Y ]. b. Var[X + Y ] Var[X] + Cov(X, Y ) + Var[Y ]. Of course, if X and Y are independent, then Cov(X, Y ) 0 and hence Var[X + Y ] Var[X] + Var[Y ]. Proof. Proof of (a). Recall the definition of joint expectation: E[X + Y ] y y x x (x + y)p X,Y (x, y) x xp X,Y (x, y) + yp X,Y (x, y) x y x ( ) x p X,Y (x, y) + ( ) y p X,Y (x, y) y y x xp X (x) + y yp Y (y) E[X] + E[Y ]. Proof of (b). Var[X + Y ] E[(X + Y ) ] E[X + Y ] E[(X + Y ) ] (µ X + µ Y ) E[X + XY + Y ] (µ X + µ X µ Y + µ Y ) E[X ] µ X + E[Y ] µ Y + (E[XY ] µ X µ Y ) Var[X] + Cov(X, Y ) + Var[Y ]. Correlation Coefficient Definition 5. Let X and Y be two random variables. The correlation coefficient is ρ Cov(X, Y ) Var[X]Var[Y ] (5.) Correlation coefficient provides a convenient way of assessing the relationship between two random variables. The following proposition outlines its properties. c 8 Stanley Chan. All Rights Reserved. 5

Theorem 5. The correlation coefficient ρ has the properties that: When X Y (fully correlated), ρ +. When X Y (negatively correlated), ρ. When X and Y are independent, ρ 0. However, if ρ 0, it does not imply that X and Y are independent. Proof. When X Y, ρ Var[X] Var[X]Var[X]. When X Y, ρ E[X( X)] E[X]E[ X] Var[X]Var[ X]. When X and Y are independent, then Cov(X, Y ) 0. A counter example for the converse can be found in Theorem 3(c). In general a correlation coefficient is always bounded between - and. Theorem 6. Correlation coefficient always satisfies ρ. (5.) Proof. We prove this result by Cauchy inequality. Cauchy inequality states that E[XY ] E[X ]E[Y ]. Therefore, we have Cov(X, Y ) E[(X µ X )(Y µ Y )] E[(X µ X ) ]E[(Y µ Y ) ] Var[X]Var[Y ]. Hence, we have Cov(X, Y ) Var[X]Var[Y ]. c 8 Stanley Chan. All Rights Reserved. 6

5.5 Conditional Expectation When dealing with two dependent random variables, sometimes we would like to determine the expectation of a random variable when the second random variable takes a particular state. The conditional expectation is a formal way of doing so. Definition 6. The conditional expectation of X given Y y is E[X Y y] x xp X Y (x y) (5.) for the discrete random variables, and E[X Y y] for the continuous random variables. There are a few points to note here: xf X Y (x y)dx (5.3) In E[X Y y], the expectation is taken over X. In other words, we are exploring the randomness of X. To evaluate the conditional expectation, the PDF is f X Y (x y). The random variable Y is fixed at Y y. Thus, there is no randomness associated with Y. The resulting object E[X Y y] is a function of y because the random variable X has been eliminated by the expectation. Conditional expectation is meaningful only when X and Y are dependent. If X and Y are independent, then f X Y (x y) f X (x) and so E[X Y y] E[X]. That is the conditional expectation does not really depend on y. If we do not specify a particular value Y takes, then we refer to E[X Y ], which is a random variable in Y. One of the most useful results in conditional expectation is the following theorem. Theorem 7 (Law of Total Expectation). E[X] y E[X Y y]p Y (y), or E[X] E[X Y y]p Y (y)dy. (5.4) c 8 Stanley Chan. All Rights Reserved. 7

Proof. We only prove the discrete case, as the continuous case can be proved by replacing summation with integration. E[X] xp X (x) ( ) x p X,Y (x, y) xp X Y (x y)p Y (y) x x y x y ( ) xp X Y (x y) p Y (y) E[X Y y]p Y (y). y x y Corollary. Let X and Y be two random variables. Then, E[X] E [E[X Y ]]. (5.5) Proof. The previous theorem states that E[X] y E[X Y y]p Y (y). If we treat E[X Y y] as a function of y, e.g., h(y), then E[X] y E[X Y y]p Y (y) y h(y)p Y (y) E[h(Y )] E [E[X Y ]]. Remark: To be slightly more clear, the two expectations in Equation (5.5) are E[X] E Y [ EX Y [X Y ] ], i.e., the inner expectation is taken over f X Y, whereas the outer expectation is f Y. Example. Consider a joint PMF given by the following table. Find E[X Y 0 ] and E[X Y 0 4 ]. Y 0 4 0 0 0 5 5 8 36 5 0 0 8 36 0.0 0. 0 00 X Solution. To find the conditional expectation, we first need to know the conditional PMF. 0 0 ] 6 p X Y (x 0 ) [ 3 p X Y (x 0 4 ) [ 0 0 c 8 Stanley Chan. All Rights Reserved. 8 3 6].

Therefore, the conditional expectations are ( ) ( ) ( ) E[X Y 0 ] (0 ) + (0 ) + () 3 6 ( ) ( ) ( ) E[X Y 0 4 ] () + (0) + (00) 3 6 From the conditional expectations we can also find E[X]: 3 600 3 6. E[X] E[X Y 0 ]p Y (0 + E[X Y 0 4 ]p Y (0 4 ) ( ) ( ) ( ) ( ) 3 5 3 + 3.5875. 600 6 6 6 Example. Consider two random variables X and Y. The random variable X is Gaussian distributed with X N (µ, σ ). The random variable Y has a conditional distribution Y X N (X, X ). Find E[Y ]. Solution. We know that the two PDFs are f X (x) (x µ) e σ, and f Y X (y x) πσ The conditional expectation of Y given X is E[Y X x] yf Y X (y x)dy (y x) e x. πx (y x) y e x dy x. πx The last equality holds because we are computing the expectation of a Gaussian random variable with mean x. Finally, applying the law of total expectation we can show that E[Y ] E[Y X x]f X (x)dx x πσ e (x µ) σ dx µ. Application: MMSE Estimator. (Optional) Consider a pair of random variables (X, Y ). We observed this pair of random variables. Can we determine the relationship between them? That is, can we design a function g such that we can minimize the error min g E[(Y g(x)) ]. c 8 Stanley Chan. All Rights Reserved. 9

We may assume that we know the distributions f X (x), f Y (y), and f Y X (y x). The solution to this problem is called the minimum mean squared error (MMSE) estimator. Theorem 8. The MMSE estimator is a function g which minimizes the mean squared error: g argmin E[(Y g(x)) ], g and is given by Proof. By law of total expectation, we have that E[(Y g(x)) ] g (x) E[Y X x]. (5.6) E[(Y g(x)) X x]f Y X (y x)dy. Since all terms in this integration are non-negative, we can minimize the overall by minimizing the inner expectation. The inner expectation is E[(Y g(x)) X x]. When conditioned on X x, the function g(x) g(x), and is independent of Y. Therefore, we can treat g(x) c for some constant c and try to determine c. This means that we want to find c to minimize c argmin c argmin c E[(Y c) X x] (y c) f Y X (y x)dy. Take derivative with respect to c and set it to zero yields ( d ) (y c) f Y X (y x)dy 0 (y c)f Y X (y x)dy 0. dc which implies that c yf Y X (y x)dy E[Y X x]. Therefore, the inner expectation is minimized when g(x) E[Y X x]. 5.6 Sum of Two Random Variables One typical problem we encounter in engineering is that given two random variables X and Y, what is the PDF of the sum, i.e., X + Y? Such problem arises naturally when we want to evaluate the average of a number of random variables, e.g., the sample mean of a collection c 8 Stanley Chan. All Rights Reserved.

of data points. In this section we will discuss a general principle of how to determine the PDF of a sum of two random variables. To start with, we consider two random variable X and Y with PDFs f X (x) and f Y (y) respectively. Let us define the sum as Z X + Y. Our goal is to determine the PDF of Z. Theorem 9. Let X and Y be two independent random variables with PDFs f X (x) and f Y (y) respectively. Let Z X + Y. The PDF of Z is given by f Z (z) (f X f Y )(z) where denotes the convolution. Proof. Let us start by analyzing the CDF of Z. The CDF of Z is F Z (z) P[Z z] z y f X (z y)f Y (y)dy, (5.7) f X (x)f Y (y)dxdy, where the integration limits can be seen from Figure 5.4. Then, by fundamental theorem of calculus, we can show that f Z (z) d dz F Z(z) d dz z y where denotes the convolution. ( d dz z y f X (x)f Y (y)dxdy ) f X (x)f Y (y)dx dy f X (z y)f Y (y)dy (f X f Y )(z), The result of this derivation shows that the PDF of X + Y is the convolution of f X (x) and f Y (y). The following example illustrate how we can compute the convolution. Example. Let X and Y be independent, and let xe x, x 0 f X (x) f Y (y) 0, x < 0 ye y, y 0 0, y < 0. c 8 Stanley Chan. All Rights Reserved.

Figure 5.4: The shaded region highlights the set X + Y Z. Find the PDF of Z X + Y. Solution. Using the results derived above, we see that f Z (z) z f X (z y)f Y (y)dy f X (z y)f Y (y)dy, where the upper limit z came from the fact that x 0. Therefore, since Z X + Y, we must have Z Y X 0 and so Z Y. Substituting the PDFs into the integration yields For z < 0, f Z (z) 0. f Z (z) z 0 (z y)e (z y) ye y dy z3 6 e z, z 0. In general, function of two random variables is not limited to summation. The following example illustrates the case of a product of two random variables. Example. Let X and Y be two independent random variables such that x, if 0 x,, if 0 y, f X (x) and f Y (y) 0, otherwise, 0, otherwise. Let Z XY. Find f Z (z). Solution. The CDF of Z can be evaluated as F Z (z) P[Z z] P[XY z] z y f X (x)f Y (y)dxdy. c 8 Stanley Chan. All Rights Reserved.

Taking the derivative yields z y f Z (z) d dz F Z(z) d dz (a) y f X( z y )f Y (y)dy, f X (x)f Y (y)dxdy where (a) holds by the fundamental theorem of calculus. The upper and lower limit of this integration can be determined by noting that z 0 z y x, which implies that z y. Since y, we have that z y. Therefore, the PDF is z ( ) z f Z (z) y f X f Y (y)dy y For z < 0, f Z (z) 0. z dy ( z), z 0. y 5.7 Two-dimensional Gaussian Covariance Matrix and Joint Gaussian PDF Among many joint distributions, the joint Gaussian is of particular interest because of its usefulness. To define a joint Gaussian distribution, we first define a few notations: [ ] [ ] [ ] X µ Var(X ) Cov(X X, µ, Σ, X ). Cov(X, X ) Var(X ) X µ The vector µ is called the mean vector, and the matrix Σ is called the covariance matrix. It is not difficult to show that the covariance matrix can be defined in the following way. Theorem 0. The covariance matrix Σ is equivalent to Σ E[(X µ)(x µ) T ]. (5.8) Proof. For a two-dimensional random variable, the theorem holds because [[ ] E[(X µ)(x µ) T X µ [X ] E ] ] µ X µ X µ [ ] (X E µ ) (X µ )(X µ ) (X µ )(X µ ) (X µ ) [ ] Var(X, X ) Cov(X, X ). Cov(X, X ) Var(X, X ) c 8 Stanley Chan. All Rights Reserved. 3

Clearly, the definition can be extended to random vector with any finite dimension. We can also prove the following property of the covariance matrix. Theorem. The covariance matrix Σ is symmetric positive semi-definite, i.e., Σ T Σ, and v T Σv 0, v R d. Proof. Symmetry is immediate from the definition, because Cov(X i, X j ) Cov(X j, X i ). The positive semi-definiteness comes from the fact that v T Σv v T E[(X µ X )(X µ X ) T ]v E[v T (X µ X )(X µ X ) T v] E[u T u], let u (X µ X ) T v E[ u ] 0. The PDF of a multi- With these tools in hand, we can now define a joint Gaussian. dimensional Gaussian is given by the following definition. Definition 7. A d-dimensional joint Gaussian has a PDF f X (x) (π)d Σ exp } (x µ)t Σ (x µ), (5.9) where d denotes the dimensionality of the vector x. In this course, we are mostly interested in the case when d. As a special case, if we assume that X and X are independent, then we can show the following result. Theorem. Let x [X, X ] T. If X and X are independent, then ( f X (x) exp (x } ) ( µ ) exp (x } ) µ ), (5.30) (π)σ (π)σ σ i.e., the product of two D Gaussians. σ c 8 Stanley Chan. All Rights Reserved. 4

Proof. To show this result, we note that if X and X are independent, then Σ [ ] [ ] Var(X ) Cov(X, X ) Var(X ) 0 Cov(X, X ) Var(X ) 0 Var(X ) [ ] σ 0 0 σ. The determinant Σ is Σ σ σ. Therefore, (x µ) T Σ (x µ) [ x µ x µ ] [ σ (x µ ) σ 0 σ + (x µ ). σ Substituting these results into Equation 5.9 yields the desired result. ] [x ] 0 µ x µ Geometric Interpretation Geometrically, the mean µ and the covariance matrix Σ can be interpreted as the center and the radius of the ellipse representing the Gaussian. Figure 5.5 illustrate three examples. As one can observe in these examples, the mean vector µ controls the center of the Gaussian. The radius and orientation of the Gaussian is controlled by the covariance matrix. Figure 5.5: The center and the radius of the ellipse is determined by µ and Σ. The precise relation of the radius and orientation of the Gaussian is determined by eigenc 8 Stanley Chan. All Rights Reserved. 5

vectors and eigenvalues of Σ. Definition 8. The covariance matrix Σ can be decomposed as Σ UΛU T, (5.3) for some unitary matrix U and diagonal matrix Λ. The columns of U are called the eigenvectors, and the entries of Λ are called the eigenvalues. If we write out the definition of the eigenvector and eigenvalue, we can see that (at least for the two-dimensional case): [ ] [ ] Σ UΛU T u u λ 0 u T 0 λ u T. The column vector u defines the direction of the major axis, and u defines the direction of the minor axis. The values λ and λ define the radii of the axes, respectively. See Figure 5.6 for illustration. Figure 5.6: The center and the radius of the ellipse is determined by µ and Σ. Maximum-a-Posteriori Classifier Consider a dataset of two classes C and C. We assume that all data within each class follows a Gaussian distribution. More specifically, we assume that X C N (µ, Σ ), and X C N (µ, Σ ). Suppose we are given testing data point x, how do we design a classifier to classify this data point? To answer this question, we first need to determine the two PDFs. Assume that the probability of obtaining C is π and the probability of obtaining C is π. That is, f C (C ) π, c 8 Stanley Chan. All Rights Reserved. 6

and f C (C ) π, with π + π. The conditional PDFs are given by f X C (x C ) (π)d Σ exp } (x µ ) T Σ (x µ ) f X C (x C ) (π)d Σ exp } (x µ ) T Σ (x µ ) One possible way of designing a classifier is to test the posterior distribution, and check f C X (C x) f C X (C x). (5.3) If f C X (C x) f C X (C x), we claim that the class is C. Otherwise it is C. By Bayes theorem, we can rewrite the posterior distribution as Substituting the Gaussians we have f X C (x C )f C (C ) f X C (x C )f C (C ). π (π)d Σ e (x µ )T Σ (x µ ) π (π)d Σ e (x µ )T Σ (x µ ). The comparison defined by this posterior distribution is called the maximum-a-posteriori (MAP) classification. Definition 9. The maximum-a-posteriori (MAP) classification is a test to check whether f X C (x C )f C (C ) f X C (x C )f C (C ). (5.33) To demonstrate how the MAP classification can be used in practice, we consider a special case when Σ Σ, and π π /. Theorem 3. Let X C N (µ, Σ ), and X C N (µ, Σ ). Suppose that Σ Σ Σ, and π π /. Then the MAP classifier of C and C is w T x + x 0 0, (5.34) where w Σ (µ µ ), and x 0 µ T Σ µ + µ T Σ µ }. Proof. When Σ Σ, and π π /, the MAP classifier can be simplified as e (x µ )T Σ (x µ ) e (x µ )T Σ (x µ ), (5.35) c 8 Stanley Chan. All Rights Reserved. 7

which implies that (x µ ) T Σ (x µ ) (x µ ) T Σ (x µ ). (5.36) Note that the sign is flipped because there is a / term in the exponential. Rewriting the terms we obtain an equivalent expression x T Σ (µ µ ) µ T Σ µ + µ T Σ µ }. (5.37) If we define w Σ (µ µ ), and x 0 µ T Σ µ + µ T Σ µ }, the above expression can be simplified as w T x + x 0 0. (5.38) The result above shows a linear classifier. Given a data point x, all we need to do is to project x by w, and then check whether the intercept w T x + x 0 is less than or greater than 0. If it is less than 0, then we claim that the class is C. Figure 5.7: Classifying two classes of data points. c 8 Stanley Chan. All Rights Reserved. 8