The Multivariate Normal Distribution Defn: Z R 1 N(0, 1) iff f Z (z) = 1 2π e z2 /2. Defn: Z R p MV N p (0, I) if and only if Z = (Z 1,..., Z p ) T with the Z i independent and each Z i N(0, 1). In this case according to our theorem f Z (z 1,..., z p ) = 1 2π e z2 i /2 = (2π) p/2 exp{ z T z/2} ; superscript t denotes matrix transpose. Defn: X R p has a multivariate normal distribution if it has the same distribution as AZ+µ for some µ R p, some p q matrix of constants A and Z MV N q (0, I). 35
p = q, A singular: X does not have a density. A invertible: derive multivariate normal density by change of variables: X = AZ + µ Z = A 1 (X µ) So X Z = A Z X = A 1. f X (x) = f Z (A 1 (x µ)) det(a 1 ) = exp{ (x µ)t (A 1 ) T A 1 (x µ)/2} (2π) p/2 det A Now define Σ = AA T and notice that and Thus f X is Σ 1 = (A T ) 1 A 1 = (A 1 ) T A 1 det Σ = det A det A T = (det A) 2.. exp{ (x µ) T Σ 1 (x µ)/2} (2π) p/2 (det Σ) 1/2 ; the M V N(µ, Σ) density. Note density is the same for all A such that AA T = Σ. This justifies the notation MV N(µ, Σ). 36
For which µ, Σ is this a density? Any µ but if x R p then x T Σx = x T AA T x = (A T x) T (A T x) = p 1 y 2 i 0 where y = A T x. Inequality strict except for y = 0 which is equivalent to x = 0. Thus Σ is a positive definite symmetric matrix. Conversely, if Σ is a positive definite symmetric matrix then there is a square invertible matrix A such that AA T = Σ so that there is a MV N(µ, Σ) distribution. (A can be found via the Cholesky decomposition, e.g.) When A is singular X will not have a density: a such that P (a T X = a T µ) = 1; X is confined to a hyperplane. Still true: distribution of X depends only on Σ = AA T : if AA T = BB T then AZ + µ and BZ + µ have the same distribution. 37
Expectation, moments Defn: If X R p has density f then E(g(X)) = any g from R p to R. g(x)f(x) dx. FACT: if Y = g(x) for a smooth g (mapping R R) E(Y ) = = yf Y (y) dy g(x)f Y (g(x))g (x) dx = E(g(X)) by change of variables formula for integration. This is good because otherwise we might have two different values for E(e X ). Linearity: E(aX + by ) = ae(x) + be(y ) for real X and Y. 38
Defn: The r th moment (about the origin) of a real rv X is µ r = E(X r ) (provided it exists). We generally use µ for E(X). Defn: The r th central moment is µ r = E[(X µ) r ] We call σ 2 = µ 2 the variance. Defn: For an R p valued random vector X µ X = E(X) is the vector whose i th entry is E(X i ) (provided all entries exist). Fact: same idea used for random matrices. Defn: The (p p) variance covariance matrix of X is Var(X) = E [ (X µ)(x µ) T ] which exists provided each component X i has a finite second moment. 39
Example moments: If Z N(0, 1) then E(Z) = ze z2 /2 dz/ 2π = e z2 /2 2π = 0 and (integrating by parts) E(Z r ) = so that zr e z2 /2 dz/ 2π = zr 1 e z2 /2 2π + (r 1) zr 2 e z2 /2 dz/ 2π µ r = (r 1)µ r 2 for r 2. Remembering that µ 1 = 0 and µ 0 = z0 e z2 /2 dz/ 2π = 1 we find that { 0 r odd µ r = (r 1)(r 3) 1 r even. 40
If now X N(µ, σ 2 ), that is, X σz + µ, then E(X) = σe(z) + µ = µ and µ r (X) = E[(X µ) r ] = σ r E(Z r ) In particular, we see that our choice of notation N(µ, σ 2 ) for the distribution of σz + µ is justified; σ is indeed the variance. Similarly for X M V N(µ, Σ) we have X = AZ + µ with Z MV N(0, I) and E(X) = µ and Var(X) = E { (X µ)(x µ) T } = E { AZ(AZ) T } = AE(ZZ T )A T = AIA T = Σ. Note use of easy calculation: E(Z) = 0 and Var(Z) = E(ZZ T ) = I. 41
Moments and independence Theorem: If X 1,..., X p are independent and each X i is integrable then X = X 1 X p is integrable and E(X 1 X p ) = E(X 1 ) E(X p ). Moment Generating Functions Defn: The moment generating function of a real valued X is M X (t) = E(e tx ) defined for those real t for which the expected value is finite. Defn: The moment generating function of X R p is M X (u) = E[e ut X ] defined for those vectors u for which the expected value is finite. 42
Example: If Z N(0, 1) then M Z (t) = 1 2π = 1 2π = 1 2π = e t2 /2 etz z2 /2 dz e (z t)2 /2+t 2 /2 dz e u2 /2+t 2 /2 du Theorem: (p = 1) If M is finite for all t in a neighbourhood of 0 then 1. Every moment of X is finite. 2. M is C (in fact M is analytic). 3. µ k = dk dt km X(0). Note: C means has continuous derivatives of all orders. Analytic means has convergent power series expansion in neighbourhood of each t ( ɛ, ɛ). The proof, and many other facts about mgfs, rely on techniques of complex variables. 43
Characterization & MGFs Theorem: Suppose X and Y are R p valued random vectors such that M X (u) = M Y (u) for u in some open neighbourhood of 0 in R p. Then X and Y have the same distribution. The proof relies on techniques of complex variables. 44
MGFs and Sums If X 1,..., X p are independent and Y = X i then mgf of Y is product mgfs of individual X i : E(e ty ) = i E(e tx i) or M Y = M Xi. (Also for multivariate X i.) Example: If Z 1,..., Z p are independent N(0, 1) then E(e ai Z i ) = E(e a iz i ) i = i e a2 i /2 = exp( a 2 i /2) Conclusion: If Z MNV p (0, I) then M Z (u) = exp( u 2 i /2) = exp(ut u/2). Example: If X N(µ, σ 2 ) then X = σz + µ and M X (t) = E(e t(σz+µ) ) = e tµ e σ2 t 2 /2. 45
Theorem: Suppose X = AZ + µ and Y = A Z + µ where Z MV N p (0, I) and Z MV N q (0, I). Then X and Y have the same distribution if and only iff the following two conditions hold: 1. µ = µ. 2. AA T = A (A ) T. Alternatively: if X, Y each MVN then E(X) = E(Y) and Var(X) = Var(Y) imply that X and Y have the same distribution. Proof: If 1 and 2 hold the mgf of X is ( E e tt X ) = E (e tt (AZ+µ ) ( = e tt µ E e (AT t) T Z ) = e tt µ+(a T t) T (A T t) = e tt µ+t T Σt Thus M X = M Y. Conversely if X and Y have the same distribution then they have the same mean and variance. 46
Thus mgf is determined by µ and Σ. Theorem: If X MV N p (µ, Σ) then there is A a p p matrix such that X has same distribution as AZ + µ for Z MV N p (0, I). We may assume that A is symmetric and nonnegative definite, or that A is upper triangular, or that Ba is lower triangular. Proof: Pick any A such that AA T = Σ such as PD 1/2 P T from the spectral decomposition. Then AZ + µ MV N p (µ, Σ). From the symmetric square root can produce an upper triangular square root by the Gram Schmidt process: if A has rows a T 1,..., at p then let v p be a p / a T p a p. Choose v p 1 proportional to a p 1 bv p where b = a T p 1 v p so that v p 1 has unit length. Continue in this way; you automatically get a T j v k = 0 if j < k. If P has columns v 1,..., v p then P is orthogonal and AP is an upper triangular square root of Σ. 47
Variances, Covariances, Correlations Defn: The covariance between X and Y is Cov(X, Y) = E { (X µ X )(Y µ Y ) T } This is a matrix. Properties: Cov(X, X) = Var(X). Cov is bilinear: Cov(AX + BW, Y) =ACov(X, Y) + BCov(W, Y) and Cov(X, CY + DZ) =Cov(X, Y)C T + Cov(X, Z)D T 48
Properties of the M V N distribution 1: All margins are multivariate normal: if and Σ = X = µ = [ X1 X 2 [ µ1 µ 2 ] ] [ Σ11 Σ 12 Σ 21 Σ 22 then X MV N(µ, Σ) X 1 MV N(µ 1, Σ 11 ). 2: MX + ν MV N(Mµ + ν, MΣM T ): affine transformation of MVN is normal. 3: If Σ 12 = Cov(X 1, X 2 ) = 0 then X 1 and X 2 are independent. 4: All conditionals are normal: the conditional distribution of X 1 given X 2 = x 2 is MV N(µ 1 + Σ 12 Σ 1 22 (x 2 µ 2 ), Σ 11 Σ 12 Σ 1 22 Σ 21) ] 49
Proof of (1): If X = AZ + µ then X 1 = [I 0] X for I the identity matrix of correct dimension. So X 1 = ([I 0] A) Z + [I 0] µ Compute mean and variance to check rest. Proof of (2): If X = AZ + µ then MX + ν = MAZ + ν + Mµ Proof of (3): If then u = [ u1 u 2 ] M X (u) = M X1 (u 1 )M X2 (u 2 ) 50
Proof of (4): first case: assume Σ 22 has an inverse. Define Then [ W X 2 W = X 1 Σ 12 Σ 1 22 X 2 ] = Thus (W, X 2 ) T where Σ = [ I Σ12 Σ 1 22 0 I ] [ X1 X 2 is MV N(µ 1 Σ 12 Σ 1 22 µ 2, Σ ) [ Σ11 Σ 12 Σ 1 22 Σ 21 0 0 Σ 22 ] ] 51
Now joint density of W and X factors f W,X2 (w, x 2 ) = f W (w)f X2 (x 2 ) By change of variables joint density of X is f X1,X 2 (x 1, x 2 ) = cf W (x 1 Mx 2 )f X2 (x 2 ) where c = 1 is the constant Jacobian of the linear transformation from (W, X 2 ) to (X 1, X 2 ) and M = Σ 12 Σ 1 22 Thus conditional density of X 1 given X 2 = x 2 is f W (x 1 Mx 2 )f X2 (x 2 ) f X2 (x 2 ) = f W (x 1 Mx 2 ) As a function of x 1 this density has the form of the advertised multivariate normal density. 52
Specialization to bivariate case: Write Σ = where we define Note that [ σ 2 1 ρσ 1 σ 2 ρσ 1 σ 2 σ 2 2 ρ = Cov(X 1, X 2 ) Var(X 1 )Var(X 2 ) σ 2 i = Var(X i) ] Then W = X 1 ρ σ 1 σ 2 X 2 is independent of X 2. The marginal distribution of W is N(µ 1 ρσ 1 µ 2 /σ 2, τ 2 ) where τ 2 =Var(X 1 ) 2ρ σ 1 Cov(X 1, X 2 ) σ 2 ( + ρ σ ) 2 1 Var(X 2 ) σ 2 53
This simplifies to σ 2 1 (1 ρ2 ) Notice that it follows that 1 ρ 1 More generally: any X and Y : 0 Var(X λy ) = Var(X) 2λCov(X, Y ) + λ 2 Var(Y ) RHS is minimized at Minimum value is λ = Cov(X, Y ) Var(Y ) 0 Var(X)(1 ρ 2 XY ) where ρ XY = Cov(X, Y ) Var(X)Var(Y ) defines the correlation between X and Y. 54
Multiple Correlation Now suppose X 2 is scalar but X 1 is vector. Defn: Multiple correlation between X 1 and X 2 over all a 0. R 2 (X 1, X 2 ) = max ρ a T X 1,X 2 2 Thus: maximize Cov 2 (a T X 1, X 2 ) Var(a T X 1 )Var(X 2 ) = at Σ 12 Σ 21 a ( a T Σ 11 a ) Σ 22 Put b = Σ 1/2 11 a. For Σ 11 invertible problem is equivalent to maximizing where b T Qb b T b Q = Σ 1/2 11 Σ 12 Σ 21 Σ 1/2 11 Solution: find largest eigenvalue of Q. 55
Note where is a vector. Set Q = vv T v = Σ 1/2 11 Σ 12 vv T x = λx and multiply by v T to get v T x = 0 or λ = v T v If v T x = 0 then we see λ = 0 so largest eigenvalue is v T v. Summary: maximum squared correlation is R 2 (X 1, X 2 ) = vt v = Σ 21Σ 1 11 Σ 12 Σ 22 Σ 22 Achieved when eigenvector is x = v = b so a = Σ 1/2 11 Σ 1/2 11 Σ 12 = Σ 1 11 Σ 12 Notice: since R 2 is squared correlation between two scalars (a t X 1 and X 2 ) we have 0 R 2 1 Equals 1 iff X 2 is linear combination of X 1. 56
Correlation matrices, partial correlations: Correlation between two scalars X and Y is ρ XY = Cov(X, Y ) Var(X)Var(Y ) If X has variance Σ then the correlation matrix of X is R X with entries R ij = Cov(X i, X j ) Var(Xi )Var(X j ) = Σ ij Σii Σ jj If X 1, X 2 are MVN with the usual partitioned variance covariance matrix then the conditional variance of X 1 given X 2 is Σ 11 2 = Σ 11 Σ 12 Σ 1 22 Σ 21 From this define partial correlation matrix R 11 2 = (Σ 11 2 ) ij Σ11 2 ) ii Σ 11 2 ) jj Note: these are used even when X 1, X 2 are NOT MVN 57