Stat 206: the Multivariate Normal distribution

Size: px

Start display at page:

Download "Stat 206: the Multivariate Normal distribution"

Blanche Grant
5 years ago
Views:

1 Stat 6: the Multivariate Normal distribution James Johndrow (adapted from Iain Johnstone s notes) Introduction The multivariate normal distribution plays a central role in multivariate statistics for several reasons 1. It is considerably more mathematically tractable than other multivariate distributions.. The multivariate generalization of the central limit theorem 1 says 1 see the notes from lecture that properly centered and scaled sums of random vectors are asymptotically multivariate normal. Since many scientifically meaningful multivariate statistics are sums of vectors, this means that many statistics have approximately a multivariate normal distribution in large samples. 3. Much more general sampling models can be constructed using multivariate normal distributions as building blocks (e.g. mixtures). This is where we will begin to depart from the book a bit, and this departure will persist throughout much of the course. The book devotes a lot of space to multivariate tests that generally rely on the assumption of joint normality, and often additional assumptions, to derive the distribution of the test statistic. For a variety of reasons, this assumption is practically stronger than the commonly used assumption of Gaussian residuals in the linear model, and somewhat harder to work around. As a result, tests that require joint normality often aren t very useful, and we won t spend a lot of time on them. On the other hand, the book devotes relatively little attention to the virtues of the multivariate normal distribution as the asymptotic distribution of many statistics of interest and the ease with which more complicated sampling models can be constructed from it. We will devote somewhat more time to these issues, the latter mainly in the context of clustering. Definition The p-dimensional multivariate normal distribution with mean µ and covariance Σ has density f(x) = πσ 1/ exp( (µ x) 1 Σ 1 (µ x)/), where x, µ P R p and Σ is a symmetric and positive-definite matrix. When a random vector X has this density, we write X N(µ, Σ).

2 stat 6: the multivariate normal distribution Since by definition, densities integrate to one over their support, we can ascertain that ż xpr p exp( (µ x) 1 Σ 1 (µ x))dx = Σ/π 1/, the multivariate generalization of the Gaussian integral. Some authors define the multivariate normal by its moment-generating function: E[e t1x ] = exp(µ 1 t + t 1 Σt/). The multivariate normal distribution is, in a fundamental sense, the distribution of affine transformations of independent Gaussians. Theorem 1. Let Z 1,..., Z p be iid N(, 1), let Σ = UΛU 1 be a positive-definite, symmetric matrix, and let µ P R p be a real vector. Then if Z = (Z 1,..., Z p ) 1 and Σ 1/ = UΛ 1/ U 1 is the square root of Σ, we have Proof. By independence, X = µ + Σ 1/ Z N(µ, Σ). f(z 1,..., z p ) = ź j f j (z j ) = (π) p/ exp ( z 1 z/ ). Make the transformation x = µ + Σ 1/ z, so that z = Σ 1/ (x µ). Since Bz Bx = B Σ 1/ Bx (x µ) = Σ 1/, the Jacobian determinant is Σ 1/, giving ( ) f(x) = Σ 1/ (π) p/ exp (Σ 1/ (x µ)) 1 (Σ 1/ (x µ))/ = πσ 1/ exp ( (x µ) 1 Σ 1 (x µ)/ ). Using this and the following result on the mean and covariance of linear functions, we can obtain the mean and covariance of the multivariate normal. Theorem. Let X be a random p-vector with mean µ and covariance Σ, and let A be a q ˆ p matrix and a a p-vector. Then E[a + AX] = a + Aµ cov(a + AX) = AΣA 1.

3 stat 6: the multivariate normal distribution 3 The proof is by direct calculation, and is omitted as the task is rather dull. So this gives us Corollary 1. If X N(µ, Σ) then E[X] = µ and cov(x) = Σ. Feel free to grind through it if interested, it is a good test of facility with matrix operations. Proof. By Theorem 1, X = µ + Σ 1/ Z for Z N(, I). By Theorem, E[X] = µ + Σ 1/ E[Z] = µ cov(x) = Σ 1/ I(Σ 1/ ) 1 = Σ 1/ Σ 1/ = Σ. This implies something very strong about the multivariate normal distribution that is not typical of multivariate distributions in general. Theorem 3. There exists exactly one multivariate normal random variable X with mean µ and covariance Σ. Proof. Theorem 1 combined with the fact that two random variables with the same density are identical. Thus, analogous to the univariate case, a multivariate normal distribution is completely characterized by its first two moments. We also have Corollary. If X N(µ, Σ) and A is full-rank, then a + AX N(a + Aµ, AΣA 1 ). Proof. Analogous to the proof of Theorem 1. We conclude this section with a warning. It is possible to have two dependent random vectors X 1 and X, each of which is marginally multivariate normal, but which are not jointly multivariate normal. We will revisit this later in the notes. Geometry of the normal density and Gaussian dependence structure Another reason for the use of Mahalanobis topology in multivariate statistics is that it describes the shape of the multivariate normal. Let x, y P R p be two points that have identical values of the multivariate normal density f with mean µ and covariance Σ. Then f(x) = πσ 1/ exp( (x µ) 1 Σ 1 (x µ)/) = πσ 1/ exp( (y µ) 1 Σ 1 (y µ)/) = f(y)

4 stat 6: the multivariate normal distribution 4 so exp( (x µ) 1 Σ 1 (x µ)/) = exp( (x µ) 1 Σ 1 (x µ)/) (x µ) 1 Σ 1 (x µ) = (y µ) 1 Σ 1 (y µ). Therefore the sets of all points having identical value of the density 3 are given by 3 The level sets of f X c = tx : (x µ) 1 Σ 1 (x µ) = c u = tx : x µ Σ = cu for the Mahalanobis distance x µ Σ, the equation of an ellipsoid centered at µ. The following allows us to compute the probability associated with the interior of any level set Remark 1. If X N(µ, Σ), then (X µ) 1 Σ 1 (X µ) χ p, a χ random variable with p degrees of freedom. Proof. Σ 1/ (X µ) N(, I), so (X µ) 1 Σ 1 (X µ) = ř j Z j for Z j iid N(, 1). Therefore, the region (x µ) 1 Σ 1 (x µ) ď F χ p (1 α), where F χ p is the CDF of χ p, is an elliptical set containing 1 α of the probability assigned by the multivariate normal distribution. What does the multivariate normal density look like? Let s make some plots for the case of p = of a bivariate normal density with mean and covariance for ρ =,.5,.5,.95. Σ = ( 1 ρ ρ 1 ) y z x These plots show contours of equal probability, with the value of the density proportional to the color of the contour. In other words, Figure 1: Level sets of the bivariate normal distribution with covariance matrix having diagonal elements 1 and off-diagonal elements as indicated in subfigure title.

5 stat 6: the multivariate normal distribution 5 we are visualizing the level sets of f and the value of f on that level set. As you can see, as the covariance/correlation increases, the major axis of the level sets rotates toward the line y = x and the shape becomes elongated. 4 This fits with our intuitive notion of correlation. When correlation is high between random variables X and Y, they tend to take similar values. Negative correlation also induces the expected shape: the density concentrates around the major axis of the level set as the correlation grows more negative, and the orientation of the major axis rotates toward the line y = x. That is, negatively correlated, jointly normal random variables tend to have values that are similar in magnitude, but opposite in sign. 4 that is, the eccentricity of the ellipse increases. y x z There is an interesting connection between the level sets of f and the eigenvalues of Σ. 5 Remark. Suppose X N(µ, Σ), λ 1,..., λ p are the eigenvalues of Σ in descending order, and e 1,..., e p are the associated eigenvectors. Then the c -level set X c of the density f of the random variable X is an ellipsoid centered at µ with axes c a λ j e j. Figure : Level sets of the bivariate normal density for negative correlations. 5 Stated without proof, but see the book for examples. Think about this remark for a moment in the context of the previous two figures. When ρ =, the level sets of f are actually circles, so the lengths of the two axes are the same. As ρ Ñ 1, the major axis elongates and rotates toward the line y = x, whereas as ρ Ñ 1, the rotation is toward the line y = x. You may speculate that in the limit as ρ = 1 or 1, the density is zero except at points where y = x. This intuition is essentially correct, and in fact the density is concentrating on a line as ρ increases. Now, let s interpret this in light of Remark with an example: Example 1. The matrix ( )

6 stat 6: the multivariate normal distribution 6 has linearly dependent columns, so it cannot have two nonzero eigenvalues. It s easy to figure out what the eigenvectors and eigenvalues are for this matrix ( ) ( 1 1 1/? ) ( 1 1 1/? 1/? ) = 1/? ( ) ( 1/? 1/? ) = ( 1/? 1/? so the eigenvectors are (1/?, 1/? ) 1 with eigenvalue and ( 1/?, 1/? ) 1 with eigenvalue. The matrix ( ) has the same eigenvectors, but (1/?, 1/? ) 1 has eigenvalue and ( 1/?, 1/? ) 1 has eigenvalue. In other words, the column spans of these matrices are one-dimensional linear spaces (lines), and these two lines are orthogonal. In summary, high correlation makes the density look linear in certain dimensions. We ll come back to this a bit later. ), Parameter estimation Parameter estimation is often done by maximum likelihood. The maximum likelihood estimators are available in closed form as Remark 3. The maximum likelihood estimators of the mean µ and covariance Σ of the multivariate normal distribution are pµ = sx pσ = S n. The book has a proof, though I think it is not the most straightforward one. Marginal and conditional distributions The marginal and conditional distributions of a multivariate normal are also very nice. We will partition vectors and matrices in the following way. For a random p-vector X, put X = (Y, Z) 1, with Y a random q-vector, and Z a random (p q)-vector. Write the mean of X as µ = (α, β) 1, with α a q-vector, and β a (p q)-vector, and the covariance of X as Σ = ( Ψ Λ 1 ) Λ Φ

7 stat 6: the multivariate normal distribution 7 for a q ˆ q matrix Ψ, a q ˆ (p q) matrix Λ, and a (p q) ˆ (p q) matrix Φ. Also, when considering a partitioned vector or matrix, will be understood to mean that one block of the vector or matrix is all zero. For example ( ) Ψ Λ Σ = Λ 1 means that the lower-right (p q) ˆ (p q) block of Σ consists entirely of zeros. We have the following Theorem 4. Suppose X = (Y, Z) 1 N(µ, Σ), then Y N(α, Ψ) and Z N(β, Φ). Proof. Corollary applied to A = (I q, ) with A a q ˆ p matrix. In other words, if a random vector has a multivariate normal distribution, we can read off the marginal distribution of any subset of its coordinates by taking the corresponding entries of µ and Σ. Conditional distributions are also convenient. 6 Theorem 5. Suppose X = (Y, Z) 1 N(µ, Σ), then Y Z = z N(α + ΛΦ 1 (z β), Ψ ΛΦ 1 Λ 1 ) See the book for a proof. Finally, we have a well-known result on the correspondance between correlation and independence for jointly normal random variables. 6 given without proof, and with my apologies to anyone who worries about Borel s paradox. It s worth noting that the conditional mean is actually what one obtains under a multivariate linear regression of y on z. Theorem 6. Suppose X = (Y, Z) 1 N(µ, Σ), then Y K Z if and only if Λ = is a q ˆ (p q) matrix of zeros. This is a rather unique property of the multivariate normal. It is not hard to generate examples of dependent random variables with zero correlation: Example. Define a random -vector as follows. First, choose θ Uniform( π, π), then set (X, Y ) 1 = (sin(θ) + ϵ 1, cos(θ) + ϵ ) 1. for ϵ 1, ϵ independent and marginally N(, τ ). Then cor(x, Y ) =, but X M Y. Let s see what this looks like thet <- runif(1,-pi,pi) x <- sin(thet) + rnorm(1,,.1) y <- cos(thet) + rnorm(1,,.1) df <- data.frame(x=x,y=y) ggplot(df,aes(x=x,y=y)) + geom_point(size=.3)

8 1. y stat 6: the multivariate normal distribution.5..5 The sample correlation in this example is.7, but clearly, X and Y are not independent, since they are functions of the same random variable θ. Here is another example, this time where X and Y are marginally normal, but not jointly normal. Example 3. Let X N (, 1) and ξ a discrete random variable, independent of X, with probability mass function x Figure 3: Data sampled from the distribution in the previous example. P[ξ = 1] = P[ξ = 1] = 1/. Put Y = ξx. Then Y N (, 1) marginally, cor(x, Y ) =, but X MY. Again, let s sample from this distribution and visualize So having normal marginal distributions is not enough for joint normality. Marginals tell us nothing at all about dependence, and in fact it is quite easy to generate an arbitrary joint distribution with normal marginals. Suppose X is a random vector with arbitrary joint distribution, and marginal (continuous) CDFs F1,..., Fp, so that P[Xj ď t] = Fj (t). Let Φ 1 be the standard normal quantile function.7 Then Y = (Φ 1 (F (X1 )),..., Φ 1 (F (Xp )))1 has Yj N (, 1) for every j, but in general Y is not jointly normal.8 As such, the suggestion in the book that one assess normality by checking or testing for normality of the marginals seems misplaced. One can easily apply some transformation to the marginals to make them very nearly normal, but this does not mean that the resulting vector is even close to jointly normal. That said, transforming the marginals to be approximately normal is an important first step to assessing the plausibility of joint normality, for example by making bivariate scatter plots. These bivariate plots won t look remotely normal if the marginals are far from normal, but that doesn t indicate that there is no simple transformation to joint normality. The book suggests a few transformations that often make the marginals close to normal:.5 y x <- rnorm(1) u <- runif(1) xi <- -1*(u<.5) + 1*(u>.5) y <- xi*x df <- data.frame(x=x,y=y) ggplot(df,aes(x=x,y=y)) + geom_point(size=.3) x Figure 4: Data sampled from the distribution in the previous example. 7 aka inverse CDF 8 Some of you may recognize this as a simple consequence of Sklar s theorem regarding copulas. In fact, every copula can give rise to a joint distribution with normal marginals, but only a Gaussian copula can give rise to a multivariate normal distribution.

9 stat 6: the multivariate normal distribution 9 z = log(x/(1 x)) for x P [, 1], z = a (x) for x P t, 1,...u z = 1 log((1 + x)/(1 x)) for x P [ 1, 1]. Another useful family are the power transformations, of which the Box-Cox transformations z = xλ 1, λ with z = log(x) for λ =, are best known. The parameter λ is chosen to maximize the normal likelihood, with the sample mean and covariance of the transformed data substituted for the mean and covariance (see the book for details). Finally, one can transform marginals using the empirical CDF. Recall that if X F, then the random variable F (X) has a uniform distribution on [, 1], and that if G 1 is a inverse CDF (quantile function), then G 1 (U) G for U Uniform(, 1). This is generally referred to as the probability integral transform. Suppose we have observations on a univariate random variable X, and we know the CDF of X is F. Then, for a sample x 1,..., x n of data, we could transform each observation to get u 1 = F (x 1 ),..., u n = F (x n ), which would be a sample from a uniform distribution. Then, we could make the second transformation Φ 1 (u 1 ),..., Φ 1 (u n ) where Φ 1 is the standard normal quantile function to get a sample from a univariate normal distribution. In general, we don t know F, but we can estimate it by the empirical CDF pf (t) = n 1 ÿ n 1 txiďtu, the proportion of observed data less than or equal to any value t. R has a built-in function to compute p F. Then, we can transform to approximately standard normal data by i=1 z = Φ 1 ( p F (x)). Distributions of sample mean and covariance When a sample is iid multivariate normal, the sample mean is also normal, and the sample covariance has a Wishart distribution. Theorem 7. Suppose X 1,..., X n (a collection of n vectors of length

10 stat 6: the multivariate normal distribution 1 p) are iid multivariate normal with mean µ and covariance Σ. Then sx N(µ, n 1 Σ) (n 1)S W p (n 1, Σ), and X s K (n 1)S. W p (ν, Σ) has density f(φ) = p Σ n/ Γ p (n/) Φ (n p 1)/ exp( tr(σ 1 Φ)/). The Wishart distribution is a distribution on matrix-valued random variables, and arises as the distribution of ř i Z izi 1 for Z i N(, Σ). Its parameters are degrees of freedom ν and a positive-definite scale matrix Σ, and we write Φ W p (ν, Σ) to indicate a random p ˆ p matrix has the Wishart distribution with parameters ν and Σ. 9 We can generate random matrices from a Wishart distribution in R using the package MCMCpack, like this. 9 Note the difference in the book s notation p <- n <- 5 Sigma <- matrix(c(1,.9,.9,1),,) Phi <- rwish(n-1,sigma) print(phi) [,1] [,] [1,] [,] We also have three strong asymptotic results mentioned in a previous lecture under the much weaker condition that the sampling process has finite mean and covariance. Theorem 8. Suppose X 1,..., X n is a random sample from a distribution with finite mean µ and covariance Σ. The following results hold for n Ñ 8 with p fixed: sx n = n 1 nÿ i=1 X i i.p. Ñ µ S, S n i.p. Ñ Σ? n( s Xn µ) D Ñ N(, Σ) (the multivariate CLT) n( s X n µ) 1 S 1 n ( s X n µ) D Ñ χ p. The book provides some justification of the first two. The proof of the multivariate central limit theorem is not trivial, but among the simpler proofs uses characteristic functions, and variations of it can be found in most asymptotics textbooks. I would add to the book s (informal) justifications for these results that the last line apparently follows from Slutsky s theorem once we have the CLT.

11 stat 6: the multivariate normal distribution 11 Assessing normality e.g. Box-Cox transformations, use of the empirical CDF axp.45 ba load('../../datasets-other/djia/djia.rdata') df <- djia.ldr[,3:6] n <- nrow(df) for (j in 1:4) { Fj <- ecdf(df[,j]) df[,j] <- qnorm(fj(df[,j])-1/(*n)) } ggpairs(df,lower=list(continuous=wrap("points",size=.1))) cat The book makes a number of suggestions of how to assess joint normality. As we ve already discussed, I won t focus much on the marginals. If one plans to model data as jointly normal, of course it will be necessary to transform the marginals to make them nearly normal. This is all very standard and procedures for doing so1 are covered in univariate statistics. The harder issue is whether the data are plausibly jointly normal, even after transforming the marginals. Let s get back to some real data. Here are some bivariate plots of the first four stocks from the djia data. Before plotting, I use the empirical CDF to transform the marginals to be approximately normal, so when looking at the scatter plots, any non-normal appearance is entirely due to the nature of dependence, not non-normality of the marginals..39 (xi x sn )1 (Sn ) 1 (xi x sn ), which we already know is asymptotically χ distributed. Let s try one of these tests, Mardia s test, on the components of the djia data that we plotted above. Mardia s test actually consists of two test statistics, a skewness and a kurtosis statistic. These are based on multivariate generalizations of higher moments, and the intuition is that since the multivariate normal is completely determined by its first two moments (mean and covariance), higher order moments should be a function of µ, Σ. Measuring the discrepancy between observed and expected values of the higher-order moments based on estimates of µ and Σ gives us a test statistic that is approximately χ distributed under the null. We can perform the test with the MVN package. 4 4 csco Does that look bivariate Gaussian to you? Probably not, in my view. But this is just exploratory. There are a number of tests for multivariate normality that the book does not discuss. Some of these tests are implemented in the R package MVN. The test statistics and associated distribution theory are messy, so we won t go into that here.11 A brief intuition is that most such test statistics are some function of the squared Mahalanobis distance 4 4 axp 44 ba 44 cat 44 csco Figure 5: Bivariate scatter plots and marginal density plots for four stocks in the djia data. 11 See web/packages/mvn/vignettes/mvn. pdf if interested 4

12 stat 6: the multivariate normal distribution 1 mardiatest(df,qqplot=f) Mardia's Multivariate Normality Test data : df g1p : chi.skew : p.value.skew : e-6 gp : z.kurtosis : p.value.kurt : chi.small.skew : p.value.small : e-6 Result : Data are not multivariate normal The package nicely points out the obvious conclusion based on the output: we reject the null at the.5 level (or the.1 level, or any level people tend to choose, really). These data appear not to be jointly normal. A few important notes. (1) these tests are based on the asymptotic distribution of the test statistic, so p values will be approximate in small samples. Fortunately, the sample size here is relatively large compared to (p + p)/ (the number of free paramaeters in the covariance and mean put together). Also, the djia data are not iid, since they exhibit serial dependence. Thus, the p-values in the output are too small. The way to think about this is that because there is dependence between sequential observations, each observation carries less information than an observation from an iid sample. Nonetheless, the conclusion of the test is so clear-cut, that I suspect it is robust to the failure to account for autocorrelation. How about the lizard data? Let s remake that plot from the first lecture, this time transforming the marginals. lizard <- read.csv('../../datasets/t1-3.dat',sep=' ',header=f) lizard <- lizard[,c(3,5,7)] names(lizard) <- c('mass','svl','hls') n <- nrow(lizard); p <- ncol(lizard) for (j in 1:p) { Fj <- ecdf(lizard[,j]) lizard[,j] <- qnorm(fj(lizard[,j])-1/(*n)) } ggpairs(lizard,lower=list(continuous=wrap("points",size=.1)))

13 stat 6: the multivariate normal distribution 13 These have high pairwise correlation, but they look more plausibly normal. Let s try the test. mardiatest(lizard,qqplot=f) Mardia's Multivariate Normality Test data : lizard g1p : chi.skew : p.value.skew : gp : z.kurtosis : p.value.kurt : chi.small.skew : p.value.small : Result : Data are multivariate normal mass svl hls mass svl hls Figure 6: bivariate plots of the lizard data It looks like these data probably are multivariate normal. 1 Again, a word of caution: there are only 5 observations here, so the p values are somewhat approximate. Does this make scientific sense? I think so. A gross generalization is that things like body dimensions what we are measuring for the lizards often are nearly normally distributed in the population. 13 On the other hand, things like stock prices tend not to be. The nature of dependence in financial markets simply isn t well-approximated by a multivariate normal distribution. 14 It s important to keep in mind that even when the multivariate normal is not a good fit for the data, it does not mean that the sample mean and sample covariance are not good summaries of the data. Indeed, many of the methods we ll use in this course don t assume multivariate normality. Multivariate normality (or asymptotic multivariate normality) is somewhat more important in testing, for example, than it is when using methods like PCA for data reduction and summary. References 1 That is, we fail to reject the null hypothesis that they are multivariate normal 13 One of the earliest studies of bivariate association used data on height and weight 14 one possible explanation for this is that the multivariate normal distribution has no tail dependence, that is, no matter how high the correlation, the largest observations for each coordinate tend to be nearly independent in large samples. On the other hand, we expect the largest (or smallest) observations from financial time series to be the most dependent, so intuitively the multivariate normal seems like a bad fit.

Stat 206: Sampling theory, sample moments, mahalanobis

Stat 206: Sampling theory, sample moments, mahalanobis topology James Johndrow (adapted from Iain Johnstone s notes) 2016-11-02 Notation My notation is different from the book s. This is partly because