2. Multivariate Distributions

Size: px

Start display at page:

Download "2. Multivariate Distributions"

Theodora Spencer
6 years ago
Views:

1 2. Multivariate Distributions Random vectors: mean, covariance matrix, linear transformations, dependence measures (a short introduction on the probability tools for multivariate statistics). Multidimensional normal distribution, mixture models (some well-known examples of multivariate probability distributions). Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 1

2 2.1. Random vectors Multivariate data are the result of observing a random vector, a vector X = (X 1,..., X p ) whose components X j, j = 1,..., p, are random variables (r.v.) on the same probability space (Ω, A, P). Similarly, a random matrix is a matrix whose elements are r.v. The probability distribution of a random vector or matrix is characterized by the joint distribution of its components. In particular, the distribution function of a random vector X is F (x 1,..., x p ) = P{X 1 x 1,..., X p x p }, for (x 1,..., x p ) R p. In general, we will only work with continuous random vectors, whose probability distribution is characterized by the density function f = f (x 1,..., x n ), satisfying 1 f (x 1,..., x p ) 0 for all (x 1,..., x p ) R p ; 2 f (x 1,..., x p )dx 1... dx p = 1; R p 3 f (x 1,..., x p ) = p F (x 1,..., x p ). x 1... x p Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 2

3 The marginal distribution of each component X j, j = 1,..., p, is its probability distribution as an individual random variable. Its density function is: f j (x j ) = f (x 1,..., x p )dx 1... dx j 1 dx j+1... dx p, for x j R. R p 1 More generally, given the partition X (1) X =, X (2) with X (1) = (X 1,..., X r ) and X (2) = (X r+1,..., X p ), the marginal density of X (1) is f X (1)(x 1,..., x r ) = f (x 1,..., x p )dx r+1... dx p. R p r Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 3

4 Two random matrices X 1 and X 2 are independent if the elements of X 1 (as a collection of r.v.) are independent of the elements of X 2. (The elements within X 1 or X 2 need not be independent.) In particular, given the partition X = [X (1), X (2) ], the vectors X (1) and X (2) are independent if F (x 1,..., x p ) = F X (1)(x 1,..., x r ) F X (2)(x r+1,..., x p ), for all x 1,..., x p, or, equivalently, if f (x 1,..., x p ) = f X (1)(x 1,..., x r ) f X (2)(x r+1,..., x p ), for all x 1,..., x p. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 4

5 2.1.1 Expectation The expected value of a random vector (resp. matrix) is the vector (resp. matrix) of expected values of each of its components (the marginal expectations). For the random vector X = (X 1,..., X p ), µ := E(X) = (E(X 1 ),..., E(X p )) = (µ 1,..., µ p ), where µ j := E(X j ) = R x f j(x) dx. The expectation is a linear function: 1 If A is a q p constant matrix, X is a p-dimensional random vector and b is a q-dimensional constant vector, then E(AX + b) = AE(X) + b. 2 If X and Y are random matrices of the same dimension, then E(X + Y) = E(X) + E(Y). 3 If X is a q p random matrix and A, B are constant matrices of adequate dimensions, then E(AXB) = AE(X)B. If X 1 and X 2 are conformable independent matrices, then E(X 1 X 2 ) = E(X 1 )E(X 2 ). Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 5

6 2.1.2 Covariance matrix The variance-covariance matrix (or simply covariance matrix) of a random vector X = (X 1..., X p ) with expectation µ is Σ = V(X) := E((X µ)(x µ) ) = E(XX ) µµ σ 11 σ σ 1p σ 21 σ σ 2p =..., σ p1 σ p2... σ pp where σ jj = V(X j ) is the variance of the r.v. X j and σ jk = Cov(X j, X k ) is the covariance of X j and X k, j, k = 1,..., p. Then Σ is a symmetric matrix. Some properties of the covariance matrix: 1 If A is a q p constant matrix, X is a p-dimensional random vector and b is a q-dimensional constant vector, then V(AX + b) = AV(X)A. 2 Σ = V(X) is always nonnegative definite. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 6

7 2.1.3 Correlation matrix Let X = (X 1,..., X p ) be a random vector with covariance matrix Σ and with 0 < σ jj = V(X i ) <, i = 1..., p. Define D := diag(σ 11,..., σ pp ). Then the correlation matrix of X is 1 ρ ρ 1p ρ ρ 2p ρ =... = D 1/2 ΣD 1/2, ρ p1 ρ p where ρ jk is the correlation of X j and X k, j, k = 1,..., p, and D 1/2 := diag(σ 1/2 11,..., σ 1/2 pp ). Observe that, if Z := D 1/2 (X µ), where µ = E(X), then V(Z) = ρ. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 7

8 2.1.4 Dependence measures More generally, the (cross-)covariance between the p-dimensional random vector X 1 and the q-dimensional random vector X 2, with means µ 1 and µ 2 respectively, is the p q matrix given by Cov(X 1, X 2 ) = E((X 1 µ 1 )(X 2 µ 2 ) ) Some properties of the cross-covariance: 1 If A and B are constant matrices and c and d are constant vectors, then Cov(AX 1 + c, BX 2 + d) = ACov(X 1, X 2 )B. 2 If X 1, X 2 and X 3 are random vectors, then Cov(X 1 + X 2, X 3 ) = Cov(X 1, X 3 ) + Cov(X 2, X 3 ). 3 If X 1 and X 2 are independent, then Cov(X 1, X 2 ) = 0 p q. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 8

9 Pearson s product-moment covariance measures linear dependence and, for the multivariate normal distribution, diagonal covariance matrix implies independence of the random vector components. In general, however, Pearson s correlation matrix does not characterize independence. Székely et al. (2007) introduced two dependence coefficients, distance covariance and distance correlation, that measure all types of dependence between random vectors X and Y of arbitrary (and possibly different) dimensions. Suppose that X in R p and Y in R q are random vectors. The characteristic function of X is ( ˆf X (t) := E e i t,x ) = e i t,x df X (x). R p Let ˆf Y be the characteristic function of Y, and denote the joint characteristic function of (X, Y ) by ˆf X,Y. Then X and Y are independent if and only if ˆf X,Y = ˆf XˆfY. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 9

10 Distance covariance is defined as a measure of the discrepancy between ˆf X,Y and ˆf X and ˆf Y : ˆf X,Y (t, s) ˆf X (t)ˆf Y (s) 2 w = ˆf X,Y (t, s) ˆf X (t)ˆf Y (s) 2 w(t, s) dt ds. R p+q The only integrable weight function w that makes this definition scale and rotation invariant is proportional to the reciprocal of t 1+p p s 1+q q, where p here denotes the Euclidean distance in R p. The distance covariance between random vectors X and Y with E X p < and E Y q < is the square root of V 2 (X, Y) = 1 Rp+q ˆf X,Y (t, s) ˆf X (t)ˆf Y (s) 2 c p c q t 1+p p s 1+q dt ds, (1) q with c p := π(p+1)/2 ). Γ ( p+1 2 Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 10

11 Similarly, distance variance is defined as the square root of V 2 (X) = V 2 (X, X). The distance correlation between random vectors X and Y with E X p < and E Y q < is the square root of R 2 (X, Y) := V 2 (X, Y) V 2 (X)V 2 (Y), V2 (X)V 2 (Y) > 0, 0, V 2 (X)V 2 (Y) = 0. (2) Theorem 3 in Székely et al. (2007): If E X p < and E Y q <, then 0 R 1, and R(X, Y) = 0 if and only if X and Y are independent. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 11

12 For an observed random sample {(X i, Y i ), i = 1,..., n} of (X, Y) natural estimators of the unknown characteristic functions are ˆf X n (t) := e i t,x dfx n (x) = 1 n e i t,x i, ˆf Y n R p n (s) := 1 n n and ˆf n X,Y (t, s) := 1 n i=1 n e i t,x i +i s,y i, i=1 where FX n denotes the empirical distribution function of X 1,..., X n. i=1 e i s,y i The empirical distance covariance is defined as the square root of V 2 n(x, Y) := ˆf n X,Y (t, s) ˆf n X (t)ˆf n Y (s) 2 w. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 12

13 Székely et al. (2007) used the asymptotic properties of the empirical distance covariance to test the independence of X and Y: H 0 : H 1 : X and Y are independent X and Y are dependent Corollary 2 of Székely et al. (2007): If E X p < and E Y q < and X and Y are independent, then n V2 n(x, Y) S 2 d Q, (3) n where Q is a certain, known quadratic form of centered Gaussian random variables with E(Q) = 1 and S 2 := 1 n 2 n i,k=1 X i X k p 1 n 2 n Y i Y k q. i,k=1 The test statistic (3) is a particular case of the so-called energy statistics, functions of distances between statistical observations (see Székely and Rizzo 2013). Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 13

14 2.2. Examples of multidimensional distributions Multidimensional normal distribution The random vector X = (X 1,..., X p ) follows a p-dimensional normal distribution with mean µ and covariance matrix Σ, and we denote it by X N p (µ, Σ), if its density function is f (x; µ, Σ) = 1 (2π) p/2 Σ 1/2 e (x µ) Σ 1 (x µ)/2, (4) where x = (x 1,..., x p ) and < x i <, i = 1,..., p. Example (Bivariate normal density): We evaluate the bivariate (p = 2) normal density in terms of the individual parameters µ 1 = E(X 1 ), µ 2 = E(X 2 ), σ 11 = V(X 1 ), σ 22 = V(X 2 ) and ρ 12 = Cor(X 1, X 2 ) = σ 12 / σ 11 σ 22. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 14

15 The determinant and inverse of the matrix ( ) ( ) σ11 σ Σ = 12 σ = 11 σ11 σ 22 ρ 12 σ 12 σ 22 σ11 σ 22 ρ 12 σ 22 are respectively Σ = σ 11 σ 22 (1 ρ 2 12 ) and Σ 1 = ( 1 σ 11 σ 22 (1 ρ 2 12 ) σ 22 σ 11 σ 22 ρ 12 σ 11 σ 22 ρ 12 σ 11 ) Thus (x µ) Σ 1 (x µ) = ( 1 σ = (x 1 µ 1, x 2 µ 2 ) 22 ) ( ) σ 11 σ 22 ρ 12 σ 11 σ 22 (1 ρ 2 12 ) x1 µ 1 σ 11 σ 22 ρ 12 σ 11 x 2 µ 2 1 [ = σ 11 σ 22 (1 ρ 2 12 ) σ22 (x 1 µ 1 ) 2 + σ 11 (x 2 µ 2 ) 2 2ρ 12 σ11 σ 22 (x 1 µ 1 )(x 2 µ 2 ) ] [ 1 (x1 µ 1 ) 2 = 1 ρ 2 + (x 2 µ 2 ) 2 ] x 1 µ 1 x 2 µ 2 2ρ σ 11 σ 22 σ11 σ22 Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 15

16 Consequently, the bivariate normal density is f (x 1, x 2 ) = 1 2π σ 11 σ 22 (1 ρ 2 12 ) { [ 1 (x1 µ 1 ) 2 exp 2(1 ρ 2 12 ) + (x 2 µ 2 ) 2 ]} x 1 µ 1 x 2 µ 2 2ρ 12 σ 11 σ 22 σ11 σ22 Observe that, if ρ 12 = 0 (X 1 and X 2 are uncorrelated), then f (x 1, x 2 ) = = = 1 2π σ 11 σ22 exp 1 2π σ11 exp = f 1 (x 1 ) f 2 (x 2 ). { 1 2 { 1 [ (x1 µ 1 ) 2 2 (x 1 µ 1 ) 2 σ 11 } + (x 2 µ 2 ) 2 ]} = σ π σ22 exp { 1 2 σ 22 (x 2 µ 2 ) 2 } Since the joint density f (x 1, x 2 ) can be expressed as the product of the marginal densities, we conclude that X 1 and X 2 are actually independent r.v. σ 22 Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 16

17 split.screen(c(2,3)) screen(1) ## bivariate normal pdf library(mvtnorm) x = y = seq(-5, 5, length = 50) f = function(x,y) { dmvnorm(cbind(x,y)) } z = outer(x, y, f) par(mai=c(0.1,0.1,0.1,0.1)) persp(x, y, z, theta=5, phi=50, expand=0.5, col="lightblue") screen(2) ## contours of the bivariate normal pdf x = y = seq(-5, 5, length = 150) z = outer(x, y, f) par(mai=c(0.5,0.5,0.5,0.5)) contour(x, y, z, nlevels=20, col=rainbow(20)) screen(3) ## normal data X = rmvnorm(n=100,sigma=matrix(c(1,0,0,1), ncol=2)) par(mai=c(0.5,0.5,0.5,0.5)) plot(x[,1],x[,2], pch=19,xlab=expression(x [1]),ylab=expression(x[2])) screen(4) x = y = seq(-5, 5, length = 50) Sigma = matrix(c(1,0.7,0.7,1), ncol=2) f = function(x,y) { dmvnorm(cbind(x,y),sigma= Sigma) } z = outer(x, y, f) par(mai=c(0.1,0.1,0.1,0.1)) persp(x, y, z, theta=5, phi=50, expand=0.5, col="lightblue") screen(5) ## contours of the bivariate normal pdf x = y = seq(-5, 5, length = 150) z = outer(x, y, f) par(mai=c(0.5,0.5,0.5,0.5)) contour(x, y, z, nlevels=20, col=rainbow(20)) screen(6) ## normal data X = rmvnorm(n=100,sigma=sigma) par(mai=c(0.5,0.5,0.5,0.5)) plot(x[,1],x[,2], pch=19,xlab=expression(x [1]),ylab=expression(x[2])) Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 17

18 x x x x 4 4 z 4 4 Advanced Course in Statistics. Lecturer: Amparo Baı llo x x1 x2 x z y 4 y x x x2 x zz z z y yyy y z x Multivariate Distributions 0 x

19 Properties of the multivariate normal distribution Let X N p (µ, Σ). 1 The normal density has a global maximum at µ and is symmetric with respect to µ in the sense that f (µ + a) = f (µ a) for all a R d. 2 Linear combinations of a multivariate normal are also normally distributed: if A is a (q p) constant matrix and d is a (q 1) constant vector, then AX + d N q (Aµ + d, AΣA ). Consequently, all subsets of the components of X are normally distributed. 3 Zero correlation between [ normal ] vectors is equivalent to X1 independence: if X =, then X 1 and X 2 are X 2 independent if and only if Cov(X 1, X 2 ) = 0. 4 If Σ > 0, there exists a linear transformation of X with mean 0 and covariance matrix equal to the identity. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 19

20 5 Contours of constant density for the multivariate distribution are ellipsoids centered at the population mean. If Σ > 0, then (a) the level sets of the probability density f are the ellipsoids given by {x R d : (x µ) Σ 1 (x µ) = c 2 }. These ellipsoids are centered at µ and have axes ±c λ i e i, where (λ i, e i ), i = 1,..., p, are the eigenvalue-eigenvector pairs of Σ. (b) (X µ) Σ 1 (X µ) follows a χ 2 p distribution. Thus, P{(X µ) Σ 1 (X µ) χ 2 p;α} = 1 α, for any 0 < α < 1. The Mahalanobis distance d M of a point x R p to the mean µ of a p-dimensional distribution with covariance matrix Σ is defined by d 2 M (x) = (x µ) Σ 1 (x µ). It is a statistical distance in the sense that it takes into account the variability of the distribution (unlike the Euclidean distance). Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 20

21 6 If X N p (µ, Σ), then any linear combination of variables a X = a 1 X 1 + a 2 X 2 + a p X p is distributed as N(a µ, a Σa). Also, if a X is distributed as N(a µ, a Σa) for every a R p, then X must follow a N p (µ, Σ). 7 Let X 1,..., X n be mutually independent N p (µ j, Σ) random vectors. Let c 1..., c n be real constants. Then V = c 1 X c n X n n n follows a N p c j µ j, Σ distribution. j=1 j=1 8 Given X 1,..., X n a random sample from X N p (µ, Σ), the maximum likelihood estimators (m.l.e.) of µ and Σ are respectively ˆµ = X : 1 n c 2 j n X i and ˆΣ = S n = 1 n i=1 n (X i X)(X i X). i=1 Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 21

22 9 The Central Limit Theorem: Let X 1,..., X n be independent observations from a population with mean µ and nonsingular covariance matrix Σ. Then n( X µ) d n N p(0, Σ) and n( X µ) S 1 ( X µ) d n χ2 p. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 22

23 The normality assumption on a sample from X can be assessed by examining the univariate marginal distributions of the components of X, which should be Gaussian; examining the bivariate scatterplots of the pairs of components of X, which should have an elliptical appearance; checking if the Mahalanobis distances di 2 follow a χ 2 p distribution. = (x i x) S 1 n (x i x) If the data are clearly non-normal, we can consider the possibility of taking nonlinear transformations of the variables. There are multiple proposals in the literature to test the multivariate normality assumption (see Székely and Rizzo 2005; McAssey 2013 and references therein). Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 23

24 Example: Mass, snout-vent length and hind limb span of 25 lizards var var 2 var Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 24

25 Example: Concentration of Selenium in the teeth and liver of 20 whales (Delphinapterus leucas) at Mackenzie Delta, Northwest Territories, in var var Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 25

26 2.2.2 Distributions associated to the multivariate normal Correspondences between the univariate and the multivariate situations: Univariate case Multivariate case N(µ, σ) N p (µ, Σ) χ 2 n W p (Σ, n) F (m, n) Λ(p, a, b) t T 2 Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 26

27 Wishart distribution Given a random sample of independent random vectors X 1,..., X n from a N p (0, Σ) distribution, the Wishart distribution W p (Σ, n) is that of the random p p matrix Q = n X i X i. i=1 Properties: 1 If Q 1 W p (Σ, n 1 ) and Q 2 W p (Σ, n 2 ) are independent, then Q 1 + Q 2 W p (Σ, n 1 + n 2 ). 2 Fisher s Theorem: If X 1,..., X n are independent N p (µ, Σ) random vectors, then i) the sample mean vector X and the sample covariance matrix S n are independent; ii) X N p (µ, 1 n Σ); iii) ns n W p (Σ, n 1). Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 27

28 Wilks Lambda This is the distribution of the following determinants ratio Λ = A A + B = 1 I + A 1 Λ(p, a, b), B where A W p (Σ, a) and B W p (Σ, b) are independent, Σ is non singular and a p. Properties: 1 Bartlett s approximation: For large a, ( a + b p + b + 1 ) log Λ(p, a, b) χ 2 pb 2. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 28

29 Hotelling s T 2 It is the distribution of the r.v. T 2 = nz Q 1 Z T 2 (p, n), where Z N p (0, I) and Q W p (I, n) are independent. Properties: 1 If p = 1, then T 2 (1, n) is the square of a Student t distribution with n degrees of freedom. 2 n p + 1 T 2 (p, n) = F (p, n p + 1) np 3 Hotelling s distribution is invariant under affine transformations, that is, if X N p (µ, Σ) and R W p (Σ, n) are independent, then n(x µ) R 1 (X µ) T 2 (p, n). Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 29

30 4 Given a random sample of independent random vectors X 1,..., X n from a N p (0, Σ) distribution, then n( X µ) S 1 n ( X µ) T 2 (p, n 1). 5 Let X 1,..., X n1 and Y 1,..., Y n2 be two random samples of independent random vectors from a N p (µ 1, Σ) and a N p (µ 2, Σ) respectively. If µ 1 = µ 2, then where n 1 n 2 n 1 + n 2 ( X Ȳ) S 1 p ( X Ȳ) T 2 (p, n 1 + n 2 2), S p = (n 1 S x,n1 + n 2 S y,n2 )/(n 1 + n 2 ) (5) is the pooled covariance matrix. These two properties will be used in hypothesis tests about mean vectors of Gaussian distributions. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 30

31 Inferences about the mean Case 1. Let X 1,..., X n be a sample of independent random vectors from a N p (µ, Σ). Fix µ 0 R p and consider the test Under H 0 the test statistic or, equivalently, H 0 : µ = µ 0. (6) n( X µ 0 ) S 1 n ( X µ 0 ) T 2 (p, n 1), n p p ( X µ 0 ) S 1 n ( X µ 0 ) F (p, n p). This provides us with a rejection region for the test (6). Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 31

32 Case 2. Let X 1,..., X n1 and Y 1,..., Y n2 be two independent samples of independent random vectors from a N p (µ 1, Σ) and a N p (µ 2, Σ) respectively. Consider the test Under H 0, the test statistic H 0 : µ 1 = µ 2. (7) n( X Ȳ) S 1 p ( X Ȳ) T 2 (p, n 1 + n 2 2), where S p is given in (5). This is equivalent to n 1 + n 2 p 1 p(n 1 + n 2 2) n 1 n 2 n 1 + n ( X Ȳ) S 1 p ( X Ȳ) F (p, n 1+n 2 p 1). 2 This provides us with a rejection region for the test (7). Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 32

33 Case 3. Assume we have g data matrices from g independent multivariate normal populations Sample Size Mean Covariance Distribution X 1 n 1 p x 1 S 1 N p (µ 1, Σ) X 2 n 2 p x 2 S 2 N p (µ 2, Σ)..... X g n g p x g S g N p (µ g, Σ) The global sample mean vector and sample covariance matrix are x = 1 n g i=1 n i x i, S = 1 n g g n i S i, with n = i=1 g n i. i=1 Consider the test H 0 : µ 1 = µ 2 =... = µ g. (8) Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 33

34 Let us introduce the following matrices g B = n i ( x i x)( x i x) (between-groups dispersion) W = g i=1 i=1 k=1 n (x ik x i )(x ik x i ) = g n i S i i=1 (intra-groups dispersion). Under H 0, B W p (Σ, g 1) and W W p (Σ, n g) are independent. The test statistic Λ = W Λ(p, n g, g 1), W + B can be approximated by the F distribution via Rao s asymptotic approximation: If Λ Λ(p, a, b), then 1 Λ1/β Λ 1/β αβ 2γ pb F (pb, αβ 2γ), where α = a + b p + b + 1, β 2 = p2 b p 2 + b 2 5, γ = pb 2. 4 Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 34

35 2.2.3 Mixture models Let k > 0 be an integer. A p-dimensional random vector X has a k-component finite mixture distribution if its probability density (or mass) function is given by f (x) = k π j f j (x), (9) j=1 where f j, j = 1,..., k, are probability densities (or mass functions) and 0 π j 1, j = 1,..., k, are constants such that k j=1 π j = 1. The f j are the component densities of the mixture and the π j are the mixing proportions or weights. In the definition of a mixture model, the number k of components is considered fixed, but in many applications the value of k is unknown and has to be inferred from the data. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 35

36 The key to generate random vectors with density (9) is as follows. We define the discrete r.v. Z taking values 1, 2,..., k, with probabilities π 1, π 2,..., π k respectively. We suppose that the conditional density of X given Z = j is given by f j. Then the unconditional density of X is (9). Equivalently, we can define the discrete random vector Z = (Z 1,..., Z k ), with the Z j s taking value 0 or 1, k j=1 Z j = 1 and π j equal to the probability that component Z j in Z is 1. Then Z follows a multinomial distribution with parameters (π 1,..., π k ) and we suppose that f j is the conditional density of X given that the j-th component of Z is 1. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 36

37 In many applications the component densities f j are specified to belong to some parametric family. The resulting model is called a parametric mixture. In particular, frequently the component densities are assumed to belong to the same parametric family, such as the mixtures of Gaussian densities. Parametric mixture models can be viewed as a semiparametric compromise between a single parametric family (case k = 1) and a nonparametric model such as kernel density estimation (case k = n). Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 37

38 Example: Mixture of three bivariate normal distributions n = 200 # Sample size Size = 1 # Number of non-zero values of the multinomial components Prob = c(0.5,0.3,0.2) # Mixing weights NumComp = length(prob) # Number of components in mixture C = rmultinom(n, Size, Prob) # Sample from the multinomial SizeComp = apply(c,1,sum) # Sample size for each component X = matrix(rep(0,n*2),nrow = n) library(mvtnorm) X[C[1,]==1,] = rmvnorm(n=sizecomp[1],sigma=matrix(c(1,0,0,1),ncol=2)) X[C[2,]==1,] = rmvnorm(n=sizecomp[2],mean=c(3,5),sigma=matrix(c(3,1,1,1),ncol=2)) X[C[3,]==1,] = rmvnorm(n=sizecomp[3],mean=c(4,-3),sigma=matrix(c (0.5,0.1,0.1,2),ncol=2)) panel.hist = function(x,...) { usr <- par("usr"); on.exit(par(usr)) par(usr = c(usr[1:2], 0, 1.5) ) h <- hist(x, plot = FALSE) breaks <- h$breaks; nb <- length(breaks) y <- h$counts; y <- y/max(y) rect(breaks[-nb], 0, breaks[-1], y, col = "cyan",...) } pairs(x, cex = 1.5, pch = 20, bg = "light blue", diag.panel = panel.hist, cex.labels = 2, font.labels = 2) Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 38

39 var var 2 Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 39

40 Thus, a mixture is a candidate distribution to model a population with several subpopulations. Example: Times between Old Faithful eruptions (Y, var 2) and duration of eruptions (X, var 1) var var 2 Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 40

41 2.3. Maximum likelihood estimation Let x 1,..., x n denote a sample of a multivariate parametric model with density (or mass) function f (x; ψ), where ψ = (ψ 1,..., ψ k ) denotes the vector of unknown parameters. The maximum likelihood estimator (m.l.e.) of ψ is ˆψ, the maximizer of the likelihood function n L(ψ; x 1,..., x n ) = f (x i ; ψ). MLE for the Gaussian distribution Let X 1,..., X n be a random sample from a normal population with mean µ and covariance Σ. Then the m.l.e. of µ and Σ are respectively ˆµ = X and ˆΣ = S n = 1 n i=1 n (X i X)(X i X) i=1 Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 41

42 Proof (Johnson and Wichern 2007): The likelihood is n 1 L(µ, Σ) = (2π) p/2 Σ 1/2 e (x i µ) Σ 1 (x i µ)/2 = i=1 1 (2π) np/2 e Σ n/2 The mle of µ is the minimizer of n (x i µ) Σ 1 (x i µ) i=1 = = = n tr [ Σ 1 (x i µ)(x i µ) ] i=1 n i=1 (x i µ) Σ 1 (x i µ)/2. n tr [ Σ 1 ( (x i x)(x i x) + n( x µ)( x µ) )] j=1 n tr [ Σ 1 (x i x)(x i x) ] + n( x µ)σ 1 ( x µ) j=1 Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 42

43 Since Σ 1 is positive definite, the distance ( x µ)σ 1 ( x µ) > 0 unless µ = x. Thus the mle of µ is ˆµ = x. It remains to maximize (over Σ) L(ˆµ, Σ) = 1 (2π) np/2 Σ n/2 e tr[σ 1 n j=1 (x i x)(x i x) ]/2. Auxiliary result: Given a p p symmetric positive definite matrix B and a scalar b > 0, it holds that, for all positive definite Σ (p p), 1 Σ b e tr(σ 1 B)/2 1 B b (2b)pb e bp. Equality holds if Σ = (1/2b)B. We apply this auxiliary result with b = n/2 and B = n j=1 (x i x)(x i x) and conclude that the maximum occurs at ˆΣ = S n. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 43

44 MLE for a parametric mixture model Consider a parametric mixture model f (x; ψ) = k π j f j (x; θ j ), (10) j=1 where ψ = (π 1,..., π k 1, ξ ) and ξ is the vector containing all the parameters in θ 1,..., θ k known a priori to be different. We want to obtain the m.l.e. of the parameters in model (10) based on a sample x 1,..., x n from f. The log-likelihood for ψ is log L(ψ; x 1,..., x n ) = n k log( π j f j (x i ; θ j )). i=1 Computing the m.l.e. would require solving the likelihood equation j=1 log L(ψ) = 0, ψ not an easy task (see Section 2.8 in McLachlan and Peel 2000). Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 44

45 The Expectation-Maximization (EM) algorithm of Dempster et al. (1977) provides an iterative scheme to be followed for computing the m.l.e. of the parameters ψ in a parametric mixture. The EM algorithm is designed for incomplete data, so the key is to consider the mixture data x 1,..., x n as incomplete, since the associated component label vectors, z 1,..., z n, are not available. Here z i = (z i1,..., z ik ) is a k-dimensional vector with z ij = 1 or 0 according to whether x i did or did not arise from the j-th component of the mixture, i = 1,..., n, j = 1,..., k. The complete data sample is therefore declared to be x c1,... x cn, where x ci = (x i, z i ). Then the complete-data log-likelihood for ψ is given by log L c (ψ) = n k z ij (log π j + log f j (x i ; θ j )). i=1 j=1 Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 45

46 E-Step: The algorithm starts with an initial guess ψ (0) for ψ. In general, denote by ψ (g) the approximated value of ψ after the g-th iteration of the algorithm. The E-step requires computing the conditional expectation of log L c (ψ) given the sample x 1,..., x n and under the current approximation for ψ: Q(ψ; ψ (g) ) = E ψ (g)(log L c (ψ) x 1,..., x n ) = n i=1 j=1 k E ψ (g)(z ij x 1,..., x n )(log π j + log f j (x i ; θ j )). It can be proved that, for i = 1,..., n, j = 1,..., k, E ψ (g)(z ij x 1,..., x n ) = P ψ (g){z ij = 1 x 1,..., x n ) = π (g) j k j=1 π(g) j f j (x i ; θ (g) j ) f j (x i ; θ (g) ) j := τ (g) ij. This is the posterior probability that the i-th member of the sample, X i, belongs to the j-th component of the mixture. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 46

47 Then Q(ψ; ψ (g) ) = n k i=1 j=1 τ (g) ij (log π j + log f j (x i ; θ j )). M-Step: The updated estimate ψ (g+1) is obtained as the global maximizer of Q(ψ; ψ (g) ) with respect to ψ. Specifically, π (g+1) j = 1 n n i=1 τ (g) ij and ξ (g+1) is obtained as an appropriate root of n k i=1 j=1 τ (g) ij log f j (x i ; θ j ) ξ = 0. The E- and M-steps are repeatedly alternated until the difference L(ψ (g+1) ) L(ψ (g) )( 0) is small enough. Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 47

48 In the case of a normal mixture with heteroscedastic components f (x, ψ) = k π j f (x; µ j, Σ j ), j=1 the M-step update ξ (g+1) has a closed form: and Σ (g+1) j = µ (g+1) j = n i=1 τ (g) ij n i=1 τ (g) ij x i n i=1 τ (g) ij (x i µ (g+1) j )(x i µ (g+1) n i=1 τ (g) ij j ) Remark: We have assumed that the number k of components in the mixture fitted to the sample is known or fixed in advance. There are techniques for choosing the optimal number of components (see, e.g., Chapter 6 in McLachlan and Peel 2000; Claeskens and Hjort 2008). Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 48.

49 We can use the R package mclust for normal mixture fitting to a data set. Example: Times between Old Faithful eruptions (Y ) and duration of eruptions (X ). Datos = read.table( Datos-geyser.txt,header=TRUE) XY = cbind(datos$x,datos$y) # Normal mixture fitting with 2 components faithfuldens = densitymclust(xy,g=2,modelnames="vvv") summary(faithfuldens, parameters = TRUE) Density estimation via Gaussian finite mixture modeling Mclust VVV (ellipsoidal, varying volume, shape, and orientation) model with 2 components: log.likelihood n df BIC ICL Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 49

50 Clustering table: Mixing probabilities: Means: [,1] [,2] [1,] [2,] Variances: [,,1] [,1] [,2] [1,] [2,] [,,2] [,1] [,2] [1,] [2,] Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 50

51 plot(faithfuldens,xy,xlab="x",ylab="y") Y X Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 51

52 plot(faithfuldens, type = "persp", col = grey(0.8)) Density Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 52

53 References Claeskens, G. and Hjort, N.L. (2008). Model Selection and Model Averaging. Cambridge University Press. Dempster, A.P., Laird, N.M. and Rubin, D.B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B, 39, Johnson, R.A. and Wichern, D.W. (2007). Applied Multivariate Statistical Analysis. Prentice Hall. McAssey, M.P. (2013). An empirical goodness-of-fit test for multivariate distributions. Journal of Applied Statistics, 40, McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley. Peña, D. (2002). Análisis de datos multivariantes. McGraw-Hill. Székely, G.J. and Rizzo, M.L. (2005). A new test for multivariate normality. Journal of Multivariate Analysis, 93, Székely, G.J. and Rizzo, M.L. (2013). Energy statistics: a class of statistics based on distances. Journal of Statistical Planning and Inference, 143, Székely, G.J., Rizzo, M.L. and Bakirov, N.K. (2007). Measuring and testing independence by correlation of distances. Annals of Statistics, 35, Advanced Course in Statistics. Lecturer: Amparo Baíllo 2. Multivariate Distributions 53

Notes on Random Vectors and Multivariate Normal

MATH 590 Spring 06 Notes on Random Vectors and Multivariate Normal Properties of Random Vectors If X,, X n are random variables, then X = X,, X n ) is a random vector, with the cumulative distribution