Stat 206: Sampling theory, sample moments, mahalanobis topology James Johndrow (adapted from Iain Johnstone s notes) 2016-11-02 Notation My notation is different from the book s. This is partly because I am going to be writing on the board, and having less complicated notation makes that easier. I am also just a minimalist when it comes to notation. My notation is: symbol description what it is X upper case letter matrix, random variable x lower case letter vector Compare this to the book s notation: symbol description what it is X upper case bold letter matrix (except data) X upper case bold letter vector random variable x lower case bold letter vector x lower case non-bold letter scalar X bigger upper case bold letter data matrix There are clearly pros and cons. Here are possible points where confusion may arise using my notation: 1. Random variables. I ll use X to refer both random variables and the data matrix. It will usually be obvious from the context which I am talking about. In particular, if I write X f(x), E[X], et cetera i.e. anytime I make a probability statement I am referring to the random variable. 2. Subscripting. We will sometimes talk about a collection of random vectors X 1,..., X n where each X i is a vector. We will also talk about the data matrix X ij, which might look like the jth entry of the ith random vector. Again, it will hopefully be clear from context which I mean, and if not, I ll make an effort to point it out. As a rule of thumb, something with a single subscript i will usually be the ith vector in a collection, and something with a single subscript j will (usually) refer to the jth component of a vector (also see the next point). 3. Indexing. the book uses j to index observations and k to index variables. I will use i to index observations and j to index variables. The notation I use is more common in statistics, so if I tried
stat 206: sampling theory, sample moments, mahalanobis topology 2 to switch to use the book s notation I would inevitably fall back into my usual habit, causing even more confusion. I apologize in advance for having to keep track of different notation. Random sampling By and large we will assume that our data x = (x 1,..., x p ) 1 are independent realizations of a vector random variable X with a density f : R d Ñ R on R p, that is 1 X f. When we write X f, we mean the data distribution has a density satisfying ş f(x)dx = 1. 2 R p We commonly need to partition X as X = ( X 1 X 2 ) 1 backtick ( 1 ) will refer to the transpose of a vector or matrix, i.e. the object with row and column indices switched. By default, vectors are column vectors, so their transposes are row vectors. 2 A common situation in which independence is violated is in time series applications or longitudinal studies, but the principles we learn by studying independence can be applied to develop methods for non-independent samples. where X 1, X 2 are random vectors of dimension p 1, p 2 with p 1 + p 2 = p. Then the marginal density of X 1 is ż f 1 (x 1 ) = R p 2 f(x 1, x 2 )dx 2 and the conditional density of x 2 given x 1 = x 0 1 is f(x 2 x 1 ) = f(x0 1, x 2 ) f 1 (x 0 1 ). Statistical independence occurs when f(x 2 x 0 1) = f(x 2 ) for all x 0 1 P R p 1. When X 1 is independent of X 2 we write X 1 K X 2. Theorem 1. If X 1 K X 2 then f(x) = f 1 (x 1 )f 2 (x 2 ). Bayes theorm allows us to reverse conditional probabilities. Suppose we have random variables (Θ, X) and Θ f(θ) is the prior, X Θ f(x θ) is the likelihood or sampling model then the joint density of (Θ, X) is f(θ)f(x θ) and the marginal density of X is ż f(x) = f(θ)f(x θ)dθ. Then
stat 206: sampling theory, sample moments, mahalanobis topology 3 Theorem 2 (Bayes). The posterior density of Θ X is the conditional distribution of parameters given observables, and is given by f(θ x) = f(x θ)f(θ). f(x) Note: each of Θ and X could be a multivariate vector or a discrete quantity (though in the latter case, we would replace with pmfs). The mean µ and variance of the vector variable X (when they exist) 3 are definined analogously to the univariate case 1. The population mean vector µ = EX has components ż µ j = x j f(x)dx. 3 In general we assume both the mean and variance exist and are finite 2. The population covariance matrix Σ = cov(x) = E[(X µ)(x µ) 1 ] The matrix Σ is p ˆ p and has entries σ jk = cov(x j, X k ) = E[(X j µ j )(X k µ k )] ż = (x j µ j )(x k µ k )f(x)dx. It follows that Σ is symmetric i.e. Σ = Σ 1 and non-negative definite (defined formally in next lecture). If we only wish to specify the first and second order moments of a random vector X, it is convenient to write X (µ, Σ), keeping in mind that this does not specify a particular distribution for X. Some key properties of means and covariances that we use frequently are Remark 1. Proof. Taking the expectation Σ = EXX 1 µµ 1 (X µ)(x µ) 1 = XX 1 µx 1 Xµ 1 + µµ 1. Σ = E[XX 1 ] µex 1 EXµ 1 + µµ 1 = E[XX 1 ] µµ 1.
stat 206: sampling theory, sample moments, mahalanobis topology 4 Another property is linearity Theorem 3 (linearity of expectation (vector)). E[AX + b] = AE[X]b = Aµ + b cov(ax) = A cov(x)a 1 = AΣA 1. Proof. [ ] ÿ E[(AX) j ] = E a jk X k = ÿ a jk E[X k ] k k (1) = (AEX) k = (Aµ) k (2) Now E[(AX Aµ)(AX Aµ) 1 ] = E[A(X µ)(x µ) 1 A 1 ] = A[E(X µ)(x µ) 1 ]A 1 = AΣA 1, where the next to last step, if written fully, would involve repeated use of linearity of expectation as in (1). Linear combinations are just a special case. If a P R p is a constant vector then a 1 X has moments Ea 1 X = a 1 µ var(a 1 X) = a 1 Σa = ÿ j,k a j σ jk a k. Paritioning vectors and matrices We often want to partition vectors and matrices in similar fashion to what we did for random variables. If we partition X as before, e.g X = ( X 1 X 2 then the mean of X and covariance matrix are partitioned conformably µ = ( µ 1 µ 2 ) ) ( ) Σ 11 Σ 22, Σ =. Σ 21 Σ 22 Writing this out in a little more detail, we have
stat 206: sampling theory, sample moments, mahalanobis topology 5 ( ) ( ) µ = EX = EX 1 µ 1 = EX 2 µ 2 ( ) Σ = E(X µ)(x µ) 1 X 1 µ ( 1 = E (X 1 µ 1 ) 1 X 2 µ 2 ) (X 2 µ 2 ) 1 ( ) = E(X E(X 1 µ 1 )(X 1 µ 1 ) 1 E(X 1 µ 1 )(X 2 µ 2 ) 1 2 µ 2 )(X 1 µ 1 ) 1 E(X 2 µ 2 )(X 2 µ 2 ) 1 ( ) = Σ 11 Σ 12 Σ 21 Σ 22. Notice that, by symmetry, Σ 12 = Σ 1 21. It is sometimes useful to consider instead the correlation between components of X, which has the same interpretation regardless of the marginal variance of X: ρ jk = cor(x j, X k ) = cov(x j, X k ) a var(xj ) a P [ 1, 1], var(x k ) and the correlation matrix, often denoted P, the p ˆ p matrix with entries P jk = ρ jk. If V = diag(σ 11, σ 22,..., σ pp ), 4 where σ jj are the diagonal entries of Σ (the marginal variances), then we can express P as 4 this notation means the diagonal entries are given by the values inside the parentheses and all the off-diagonal entries are zero P = V 1/2 ΣV 1/2, where V 1/2 = diag(σ 1/2 11,..., σ 1/2 pp ). Sample moments We can now give some basic properties of the sample mean and sample covariance. The sample mean sx is given by sx = 1 n x i = ( 1 n x i1,..., 1 n ) 1 x ip = (sx 1,..., sx p ) 1. Since the x i are iid realizations of a random variable X f, E[ s X] = E [ 1 n ] X i = 1 n E[X i ] = (E[X 1 ],..., E[X p ]) 1 = (µ 1,..., µ p ) 1
stat 206: sampling theory, sample moments, mahalanobis topology 6 so the expectation of the sample mean is the mean µ of the random vector X with density f. The sample covariance matrix S n is defined as (S n ) jk = 1 n (x ij sx j )(x ik sx k ). So we can express the sample covariance matrix as the sum of matrices: S n = 1 n (x i sx)(x i sx) 1. Now we ll state an important result about the sample moments and prove part of it (see the book for the rest of the proof). Theorem 4. The covariance of the sample mean is cov( s X) = 1 n Σ and the expectation of the sample covariance is E[S n ] = Σ n 1 Σ so n(n 1) 1 S n = S is an unbiased estimator of Σ. Proof. We prove the second part, see the book for the first part. ( ) (X i sx)(x i sx) 1 = (X i sx)(x ÿ n i ) 1 (X i sx) = X i (X i ) 1 nx s X s 1, sx 1 since ř n (X i sx) = 0 and ( X) s 1 = n 1 ř n (X i) 1. So then [ ] E (X i sx)(x i sx) 1 = E[X i (X i ) 1 ] ne[ X s X s 1 ]. Now applying Remark 1, we have E[S n ] = n 1 E[X i (X i ) 1 ] E[ X s X s 1 ] = Σ + µµ 1 (n 1 Σ + µµ 1 ) = n 1 n Σ.
stat 206: sampling theory, sample moments, mahalanobis topology 7 We also can define the sample correlation matrix R by R jk = s jk? sjj? skk, with s jk = (S) jk. If we put D 1/2 = diag(s 1/2 11,..., s 1/2 pp ) then R = D 1/2 SD 1/2. Finally, note that the law of large numbers and the central limit theorem also work for vectors. We will give these results without proof, if interested there are many references. For our purposes, it is important just to know that these key asymptotic results also hold for vectors. Theorem 5 (Multivariate weak law of large numbers). Let X 1,..., X n be a sequence of iid length p random vectors with finite mean µ. Let sx n = n 1 ř n X i. Then P [ˇˇ s Xn µˇˇ ě ϵ ] Ñ 0 as n Ñ 8 for all ϵ ą 0, where x = ř p j=1 x j is the L 1 norm. Theorem 6 (Multivariate central limit theorem). Let X 1,..., X n be a sequence of iid length p random vectors with finite mean µ and finite covariance Σ. Then? n( s Xn µ) D Ñ No(0, Σ). In Theorem 6 No(0, Σ) is the multivariate normal distribution with mean 0 and covariance Σ, which we will soon characterize. Finally, we will briefly mention the notion of generalized variance, which the book (and numerous other sources) defines as S, the determinant of S. This is a sensible way to summarize the variability of the sample in a single number, and we will revisit it after we have reviewed a bit more linear algebra, which make its properties easier to understand. Vector norms and Mahalanobis topology When dealing with univariate random variables, the notion of magnitude is relatively straightforward. We generally take the magnitude of a real number to be its absolute value, which is of course equal to the Euclidean or L 2 norm? x 2 when x is unidimensional. However, for vectors the definition of magnitude is more subtle. One notion we have already mentioned is the L 1 norm, x = ř j x j, the sum of the
stat 206: sampling theory, sample moments, mahalanobis topology 8 absolute values of the entries. But this isn t equivalent to the L 2 norm for vectors d ÿ x 2 = (xx, xy) 1/2 = x 2 j ď ÿ j j b x 2 j = ÿ j x j by the triangle inequality, where xx, xy = x 1 x is the inner (or dot) product. The L 2 norm is arguably the default way to measure the magnitude of vectors, and it induces a metric on R p via d ÿ d(x, y) = x y 2 = (x j y j ) 2. (3) This is referred to as the Euclidean metric, which corresponds to the familiar straight line distance in R p. When considering distance between data points, it may not make sense to use the Euclidean metric. To understand why, it helps to know a little about quadratic forms. Definition 1. For a p-vector x and a p ˆ p symmetric matrix Λ, a quadratic form is given by the matrix product x 1 Λx. The expectation of quadratic forms is simple Theorem 7. Let X be a random vector with finite mean µ and finite covariance Σ. Then E[X 1 ΛX] = tr(λσ) + µ 1 Λµ. Proof. Note: we use properties of the trace of a matrix that will be discussed in the next lecture. j E[X 1 ΛX] = tr(e[x 1 ΛX]) = E[tr(X 1 ΛX)] = E[tr(ΛXX 1 )] = tre[λxx 1 ] = tr(λe[xx 1 ]) = tr(λ(σ + µµ 1 )) = tr(λσ) + tr(λµµ 1 ) = tr(λσ) + tr(µ 1 Λµ). In theorem 7, tr is the trace of a matrix, which is the sum of its diagonal elements. Using this result, we can understand the average distance between points in an iid sample.
stat 206: sampling theory, sample moments, mahalanobis topology 9 Theorem 8 (Expected Euclidean distance). Suppose X, Y are independent, identically distributed random variables with mean µ and covariance Σ. Then E[ X Y 2 2] = ÿ j σ jj. Proof. E[ X Y 2 2] = E[(X Y ) 1 I(X Y )] = tr(iσ) + (E[X Y ]) 1 I(E[X Y ]) = tr(σ) = ÿ j σ jj. Thus, the Euclidean distance between sample points will depend on the variances. This is an undesirable property, since we d like to be able to interpret distances between sample points in the same way for all samples we d like to have a common scale that means something similar no matter how our data were generated. So it makes sense to have a distance metric that scales by the inverse of the variances. We ll actually do a bit more than that, and focus on Mahalanobis distances. 5 Definition 2 (Mahalanobis distance). Given a p ˆ p symmetric, positive-definite matrix Λ, the Mahalnobis distance x y Λ between p-vectors x and y with respect to Λ is given by 5 Don t worry about the terms symmetric and positive definite for now, we ll soon define them. x y Λ = a (x y) 1 Λ 1 (x y) = d Λ (x, y). If we put Λ = Σ in the definition above, and measure distances using x y Σ, we have E[ X Y 2 Σ = E[(X Y ) 1 Σ 1 (X Y )] = tr(i) = n, no matter the value of Σ. 6 Additional motivation for using x y Σ will be offered when we study the multivariate normal distribution. How do distances in the Mahalanobis metric compare to the straight-line distances that we are used to? The best way to answer this is geometric. In the usual Euclidean metric, the set of all points x equidistant from a single point y are those that lie on the perimeters of circles centered at y, and the equation of all points a distance m from y is given by the equation of the circle (x y) 1 (x y) = m. In the Mahalanobis metric, the set of all points equidistant from y is given by an ellipse with axis lengths proportional to the inverse variances and orientation that is determined by the off-diagonal entries of Σ 1. 6 The notation Σ 1 refers to the inverse of the matrix Σ, that is, the matrix such that when multiplied by Σ gives the identity.
stat 206: sampling theory, sample moments, mahalanobis topology 10 Another way to think about this is that the action of a matrix Σ on a vector x via the matrix product Σ 1/2 x is to rotate and stretch the original vector. 7 You can think equivalently in terms of rotating and scaling the coordinate axes in p-dimensional space. Figure 1 shows the set of points that are distance 1 from the origin in the Euclidean metric (a circle) and the set of points equidistant from the origin in the Mahalanobis metric d Λ for ( ) 1.9 Λ =.9 1 7 don t know what Σ 1/2 is? Don t worry, we ll get to that shortly pts <- ellipse(matrix(c(1,.9,.9,1),2,2),centre=c(0,0)) df <- data.frame(x=pts[,1],y=pts[,2]) pts2 <- ellipse(matrix(c(1,0,0,1),2,2),centre=c(0,0)) df2 <- data.frame(x=pts2[,1],y=pts2[,2]) df$cor <-.9 df2$cor <- 0 df <- rbind(df,df2) df$cor <- as.factor(df$cor) ggplot(df,aes(x=x,y=y,col=cor)) + geom_path() y 2 1 0 1 cor 0 0.9 2 Summary We have covered a number of properties of random vectors and multivariate samples. Importantly, what we have done so far required only (1) iid observations of a random variable X with a density f, (2) that X has finite mean and covariance. We made no other assumptions about X or f. We will soon shift focus to the study of the multivariate normal distribution. Because µ and Σ play an important role in understanding the multivariate normal, it is easy to lose sight of the fact that the sample mean and sample covariance have meaning and certain statistical properties regardless of whether f is the density of a multivariate normal. Keep this in mind as we move along. 2 1 0 1 2 x Figure 1: the set of points equidistant from the origin in the Euclidean metric (cor=0) and the Mahalanobis metric defined in text (cor=.9) References