The Multivariate Normal Distribution. In this case according to our theorem

Size: px

Start display at page:

Download "The Multivariate Normal Distribution. In this case according to our theorem"

Ashley Simon
5 years ago
Views:

1 The Multivariate Normal Distribution Defn: Z R 1 N(0, 1) iff f Z (z) = 1 2π e z2 /2. Defn: Z R p MV N p (0, I) if and only if Z = (Z 1,..., Z p ) T with the Z i independent and each Z i N(0, 1). In this case according to our theorem f Z (z 1,..., z p ) = 1 2π e z2 i /2 = (2π) p/2 exp{ z T z/2} ; superscript t denotes matrix transpose. Defn: X R p has a multivariate normal distribution if it has the same distribution as AZ+µ for some µ R p, some p q matrix of constants A and Z MV N q (0, I). 35

2 p = q, A singular: X does not have a density. A invertible: derive multivariate normal density by change of variables: X = AZ + µ Z = A 1 (X µ) So X Z = A Z X = A 1. f X (x) = f Z (A 1 (x µ)) det(a 1 ) = exp{ (x µ)t (A 1 ) T A 1 (x µ)/2} (2π) p/2 det A Now define Σ = AA T and notice that and Thus f X is Σ 1 = (A T ) 1 A 1 = (A 1 ) T A 1 det Σ = det A det A T = (det A) 2.. exp{ (x µ) T Σ 1 (x µ)/2} (2π) p/2 (det Σ) 1/2 ; the M V N(µ, Σ) density. Note density is the same for all A such that AA T = Σ. This justifies the notation MV N(µ, Σ). 36

3 For which µ, Σ is this a density? Any µ but if x R p then x T Σx = x T AA T x = (A T x) T (A T x) = p 1 y 2 i 0 where y = A T x. Inequality strict except for y = 0 which is equivalent to x = 0. Thus Σ is a positive definite symmetric matrix. Conversely, if Σ is a positive definite symmetric matrix then there is a square invertible matrix A such that AA T = Σ so that there is a MV N(µ, Σ) distribution. (A can be found via the Cholesky decomposition, e.g.) When A is singular X will not have a density: a such that P (a T X = a T µ) = 1; X is confined to a hyperplane. Still true: distribution of X depends only on Σ = AA T : if AA T = BB T then AZ + µ and BZ + µ have the same distribution. 37

4 Expectation, moments Defn: If X R p has density f then E(g(X)) = any g from R p to R. g(x)f(x) dx. FACT: if Y = g(x) for a smooth g (mapping R R) E(Y ) = = yf Y (y) dy g(x)f Y (g(x))g (x) dx = E(g(X)) by change of variables formula for integration. This is good because otherwise we might have two different values for E(e X ). Linearity: E(aX + by ) = ae(x) + be(y ) for real X and Y. 38

5 Defn: The r th moment (about the origin) of a real rv X is µ r = E(X r ) (provided it exists). We generally use µ for E(X). Defn: The r th central moment is µ r = E[(X µ) r ] We call σ 2 = µ 2 the variance. Defn: For an R p valued random vector X µ X = E(X) is the vector whose i th entry is E(X i ) (provided all entries exist). Fact: same idea used for random matrices. Defn: The (p p) variance covariance matrix of X is Var(X) = E [ (X µ)(x µ) T ] which exists provided each component X i has a finite second moment. 39

6 Example moments: If Z N(0, 1) then E(Z) = ze z2 /2 dz/ 2π = e z2 /2 2π = 0 and (integrating by parts) E(Z r ) = so that zr e z2 /2 dz/ 2π = zr 1 e z2 /2 2π + (r 1) zr 2 e z2 /2 dz/ 2π µ r = (r 1)µ r 2 for r 2. Remembering that µ 1 = 0 and µ 0 = z0 e z2 /2 dz/ 2π = 1 we find that { 0 r odd µ r = (r 1)(r 3) 1 r even. 40

7 If now X N(µ, σ 2 ), that is, X σz + µ, then E(X) = σe(z) + µ = µ and µ r (X) = E[(X µ) r ] = σ r E(Z r ) In particular, we see that our choice of notation N(µ, σ 2 ) for the distribution of σz + µ is justified; σ is indeed the variance. Similarly for X M V N(µ, Σ) we have X = AZ + µ with Z MV N(0, I) and E(X) = µ and Var(X) = E { (X µ)(x µ) T } = E { AZ(AZ) T } = AE(ZZ T )A T = AIA T = Σ. Note use of easy calculation: E(Z) = 0 and Var(Z) = E(ZZ T ) = I. 41

8 Moments and independence Theorem: If X 1,..., X p are independent and each X i is integrable then X = X 1 X p is integrable and E(X 1 X p ) = E(X 1 ) E(X p ). Moment Generating Functions Defn: The moment generating function of a real valued X is M X (t) = E(e tx ) defined for those real t for which the expected value is finite. Defn: The moment generating function of X R p is M X (u) = E[e ut X ] defined for those vectors u for which the expected value is finite. 42

9 Example: If Z N(0, 1) then M Z (t) = 1 2π = 1 2π = 1 2π = e t2 /2 etz z2 /2 dz e (z t)2 /2+t 2 /2 dz e u2 /2+t 2 /2 du Theorem: (p = 1) If M is finite for all t in a neighbourhood of 0 then 1. Every moment of X is finite. 2. M is C (in fact M is analytic). 3. µ k = dk dt km X(0). Note: C means has continuous derivatives of all orders. Analytic means has convergent power series expansion in neighbourhood of each t ( ɛ, ɛ). The proof, and many other facts about mgfs, rely on techniques of complex variables. 43

10 Characterization & MGFs Theorem: Suppose X and Y are R p valued random vectors such that M X (u) = M Y (u) for u in some open neighbourhood of 0 in R p. Then X and Y have the same distribution. The proof relies on techniques of complex variables. 44

11 MGFs and Sums If X 1,..., X p are independent and Y = X i then mgf of Y is product mgfs of individual X i : E(e ty ) = i E(e tx i) or M Y = M Xi. (Also for multivariate X i.) Example: If Z 1,..., Z p are independent N(0, 1) then E(e ai Z i ) = E(e a iz i ) i = i e a2 i /2 = exp( a 2 i /2) Conclusion: If Z MNV p (0, I) then M Z (u) = exp( u 2 i /2) = exp(ut u/2). Example: If X N(µ, σ 2 ) then X = σz + µ and M X (t) = E(e t(σz+µ) ) = e tµ e σ2 t 2 /2. 45

12 Theorem: Suppose X = AZ + µ and Y = A Z + µ where Z MV N p (0, I) and Z MV N q (0, I). Then X and Y have the same distribution if and only iff the following two conditions hold: 1. µ = µ. 2. AA T = A (A ) T. Alternatively: if X, Y each MVN then E(X) = E(Y) and Var(X) = Var(Y) imply that X and Y have the same distribution. Proof: If 1 and 2 hold the mgf of X is ( E e tt X ) = E (e tt (AZ+µ ) ( = e tt µ E e (AT t) T Z ) = e tt µ+(a T t) T (A T t) = e tt µ+t T Σt Thus M X = M Y. Conversely if X and Y have the same distribution then they have the same mean and variance. 46

13 Thus mgf is determined by µ and Σ. Theorem: If X MV N p (µ, Σ) then there is A a p p matrix such that X has same distribution as AZ + µ for Z MV N p (0, I). We may assume that A is symmetric and nonnegative definite, or that A is upper triangular, or that Ba is lower triangular. Proof: Pick any A such that AA T = Σ such as PD 1/2 P T from the spectral decomposition. Then AZ + µ MV N p (µ, Σ). From the symmetric square root can produce an upper triangular square root by the Gram Schmidt process: if A has rows a T 1,..., at p then let v p be a p / a T p a p. Choose v p 1 proportional to a p 1 bv p where b = a T p 1 v p so that v p 1 has unit length. Continue in this way; you automatically get a T j v k = 0 if j < k. If P has columns v 1,..., v p then P is orthogonal and AP is an upper triangular square root of Σ. 47

14 Variances, Covariances, Correlations Defn: The covariance between X and Y is Cov(X, Y) = E { (X µ X )(Y µ Y ) T } This is a matrix. Properties: Cov(X, X) = Var(X). Cov is bilinear: Cov(AX + BW, Y) =ACov(X, Y) + BCov(W, Y) and Cov(X, CY + DZ) =Cov(X, Y)C T + Cov(X, Z)D T 48

15 Properties of the M V N distribution 1: All margins are multivariate normal: if and Σ = X = µ = [ X1 X 2 [ µ1 µ 2 ] ] [ Σ11 Σ 12 Σ 21 Σ 22 then X MV N(µ, Σ) X 1 MV N(µ 1, Σ 11 ). 2: MX + ν MV N(Mµ + ν, MΣM T ): affine transformation of MVN is normal. 3: If Σ 12 = Cov(X 1, X 2 ) = 0 then X 1 and X 2 are independent. 4: All conditionals are normal: the conditional distribution of X 1 given X 2 = x 2 is MV N(µ 1 + Σ 12 Σ 1 22 (x 2 µ 2 ), Σ 11 Σ 12 Σ 1 22 Σ 21) ] 49

16 Proof of (1): If X = AZ + µ then X 1 = [I 0] X for I the identity matrix of correct dimension. So X 1 = ([I 0] A) Z + [I 0] µ Compute mean and variance to check rest. Proof of (2): If X = AZ + µ then MX + ν = MAZ + ν + Mµ Proof of (3): If then u = [ u1 u 2 ] M X (u) = M X1 (u 1 )M X2 (u 2 ) 50

17 Proof of (4): first case: assume Σ 22 has an inverse. Define Then [ W X 2 W = X 1 Σ 12 Σ 1 22 X 2 ] = Thus (W, X 2 ) T where Σ = [ I Σ12 Σ I ] [ X1 X 2 is MV N(µ 1 Σ 12 Σ 1 22 µ 2, Σ ) [ Σ11 Σ 12 Σ 1 22 Σ Σ 22 ] ] 51

18 Now joint density of W and X factors f W,X2 (w, x 2 ) = f W (w)f X2 (x 2 ) By change of variables joint density of X is f X1,X 2 (x 1, x 2 ) = cf W (x 1 Mx 2 )f X2 (x 2 ) where c = 1 is the constant Jacobian of the linear transformation from (W, X 2 ) to (X 1, X 2 ) and M = Σ 12 Σ 1 22 Thus conditional density of X 1 given X 2 = x 2 is f W (x 1 Mx 2 )f X2 (x 2 ) f X2 (x 2 ) = f W (x 1 Mx 2 ) As a function of x 1 this density has the form of the advertised multivariate normal density. 52

19 Specialization to bivariate case: Write Σ = where we define Note that [ σ 2 1 ρσ 1 σ 2 ρσ 1 σ 2 σ 2 2 ρ = Cov(X 1, X 2 ) Var(X 1 )Var(X 2 ) σ 2 i = Var(X i) ] Then W = X 1 ρ σ 1 σ 2 X 2 is independent of X 2. The marginal distribution of W is N(µ 1 ρσ 1 µ 2 /σ 2, τ 2 ) where τ 2 =Var(X 1 ) 2ρ σ 1 Cov(X 1, X 2 ) σ 2 ( + ρ σ ) 2 1 Var(X 2 ) σ 2 53

20 This simplifies to σ 2 1 (1 ρ2 ) Notice that it follows that 1 ρ 1 More generally: any X and Y : 0 Var(X λy ) = Var(X) 2λCov(X, Y ) + λ 2 Var(Y ) RHS is minimized at Minimum value is λ = Cov(X, Y ) Var(Y ) 0 Var(X)(1 ρ 2 XY ) where ρ XY = Cov(X, Y ) Var(X)Var(Y ) defines the correlation between X and Y. 54

21 Multiple Correlation Now suppose X 2 is scalar but X 1 is vector. Defn: Multiple correlation between X 1 and X 2 over all a 0. R 2 (X 1, X 2 ) = max ρ a T X 1,X 2 2 Thus: maximize Cov 2 (a T X 1, X 2 ) Var(a T X 1 )Var(X 2 ) = at Σ 12 Σ 21 a ( a T Σ 11 a ) Σ 22 Put b = Σ 1/2 11 a. For Σ 11 invertible problem is equivalent to maximizing where b T Qb b T b Q = Σ 1/2 11 Σ 12 Σ 21 Σ 1/2 11 Solution: find largest eigenvalue of Q. 55

22 Note where is a vector. Set Q = vv T v = Σ 1/2 11 Σ 12 vv T x = λx and multiply by v T to get v T x = 0 or λ = v T v If v T x = 0 then we see λ = 0 so largest eigenvalue is v T v. Summary: maximum squared correlation is R 2 (X 1, X 2 ) = vt v = Σ 21Σ 1 11 Σ 12 Σ 22 Σ 22 Achieved when eigenvector is x = v = b so a = Σ 1/2 11 Σ 1/2 11 Σ 12 = Σ 1 11 Σ 12 Notice: since R 2 is squared correlation between two scalars (a t X 1 and X 2 ) we have 0 R 2 1 Equals 1 iff X 2 is linear combination of X 1. 56

23 Correlation matrices, partial correlations: Correlation between two scalars X and Y is ρ XY = Cov(X, Y ) Var(X)Var(Y ) If X has variance Σ then the correlation matrix of X is R X with entries R ij = Cov(X i, X j ) Var(Xi )Var(X j ) = Σ ij Σii Σ jj If X 1, X 2 are MVN with the usual partitioned variance covariance matrix then the conditional variance of X 1 given X 2 is Σ 11 2 = Σ 11 Σ 12 Σ 1 22 Σ 21 From this define partial correlation matrix R 11 2 = (Σ 11 2 ) ij Σ11 2 ) ii Σ 11 2 ) jj Note: these are used even when X 1, X 2 are NOT MVN 57

STAT 450. Moment Generating Functions

STAT 450. Moment Generating Functions STAT 450 Moment Generating Functions There are many uses of generating functions in mathematics. We often study the properties of a sequence a n of numbers by creating the function a n s n n0 In statistics