An introduction to multivariate data

Size: px

Start display at page:

Download "An introduction to multivariate data"

Polly Blankenship
5 years ago
Views:

1 An introduction to multivariate data Angela Montanari 1 The data matrix The starting point of any analysis of multivariate data is a data matrix, i.e. a collection of n observations on a set of p characters X 1,..., X k,..., X p (they may be numeric variables, or binary variables, or suitably coded categorical variables): x x 1k... x 1p X = x i1... x ik... x ip x n1... x nk... x np The element x ik represents the value of the k-th variable on the i-th observed unit. Each row of X corresponds to an observed unit. Each column of X corresponds to an observed variable. The n statistical units can be thought of as as many points in the p-dimensional space R p. The following data matrix (24 3) contains data on length, width and height (in mm) of the carapace from 24 male painted turtles (Jolicoeur and Mosimann, 1960). There is one row per turtle, and one column per variable. Multivariate analysis is concerned either with studying the relationships between variables, or with studying the similarities between units. In the first set of methods we will consider: angela.montanari@unibo.it 1

2 Table 1: Data matrix length width height

3 - Principal component analysis - Factor analysis - Discriminant analysis In the second set we will deal with - Clustering methods. Starting from the data matrix X a series of different matrices can be derived. We will first concentrate on the matrices dealing with relationships between variables, leaving the theme of measuring dissimilarities between units to when we will deal with clustering. 2 The average vector When dealing with numeric variables, we might be interested in associating to each variable its arithmetic mean. The p means can be collected in a p dimensional vector x 1... x = x k... = x p ( ) 1 n 1 n X = 1 n X 1 n 3 The mean centered data matrix In certain applications it might be useful to express the variables as deviations from the mean. The data matrix becomes: x x 1k... x 1p X = x i1... x ik... x ip x n1... x nk... x np where x ik = x ik x k. 3

4 ( ) ( 1 X = X 1 n x = X 1 n n 1 n X = I n 1 ) n 1 n1 n X = AX where A is the so called centering matrix. A is squared n n, symmetric and idempotent. Each column of X has zero sum (zero mean). The matrix X defines a translation of the origin of the original reference system. The shape of the point cloud remains unchanged, but the origin of the axes is moved to x. 4 The standardized data matrix If one wants to eliminate the effect of different scales on the observed variables, one can resort to the standardized data matrix z z 1k... z 1p Z = z i1... z ik... z ip z n1... z nk... z np where z ik = x ik V ar(xk ) = x ik x k V ar(xk ) If we denote by D the p p diagonal matrix having the variances of the observed variables on the main diagonal, the standardized data matrix can be defined as Z = XD 1/2. Each column of Z has zero mean and unit variance. 5 The covariance matrix The covariance between two variables X k and X h is defined as: Cov (X k, X h ) = n (x ik x k )(x ih x h )/n = 1 n 4 n ( x ik x ih ) = 1 n x k x h

5 where x k and x h are the k-th and the h-th columns of X respectively. It is worth remembering that the covariance between a variable and itself is but the variance of the variable itself. Variances and covariances can then be summarized in the so called covariance matrix S: S = 1 n X X = 1 ( X 1n x ) ( X 1n x ) = n V ar(x 1 )... Cov(X 1, X k )... Cov(X 1, X p ) = Covar(X k, X 1 )... V ar(x k )... Cov(X k, X p ) = Covar(X p, X 1 )... Covar(X p, X k )... V ar(x p ) s s 1k... s 1p = s k1... s kk... s kp s p1... s pk... s pp where the diagonal elements are the variances and the off-diagonal elements are the covariances. The covariance matrix has many relevant properties: it is squared (p p); it is symmetric; it is positive semi definite; its trace is the so called total variance tr(s) = p k=1 V ar(x k). In order to have an intuition of the reason why the covariance matrix is positive semi definite consider the simple case where two variables only have been observed. Because of the symmetry property, their covariance matrix is [ ] s11 s S = 12 s 12 s 22 This matrix is positive semi definite if its determinant is greater than or equal to 0: det S = s 11 s 22 s

6 After dividing both sides of the inequality by s 11 s 22 we obtain 1 s2 12 s 11 s s 2 12 This inequality is always true as s 11 s 22 = r12 2 is the squared correlation coefficient between X 1 and X 2 which, by definition, can only take values between 0 and 1, both included. If r12 2 is equal to 1, S is positive semi definite; for all values of r12 2 other than 1, S is positive definite. 6 The correlation matrix The covariance between two standardized variables Z k and Z h is defined as: Cov (Z k, Z h ) = n (z ik z k )(z ih z h )/n = n z ik z ih /n = z k z h /n because of the zero mean property of standardized variables. If we replace z ik and z ih by their expressions as a function of the observed variables (see page 4) we obtain n n Cov (Z k, Z h ) = z ik z ih /n = (x ik x k )(x ih x h )/n = V ar(xk )V ar(x h ) = Cov(X k, X h ) V ar(xk )V ar(x h ) = r kh. This means that the covariance between two standardized variables coincides with their correlation. The correlation of a variable with itself is equal to 1, as is the variance of a standardized variable. In matrix form we have 1... r 1k... r 1p R = 1 n Z Z = n D 1/2 X XD 1/2 = D 1/2 SD 1/2 = r k r kp r p1... r pk... 1 R is the correlation matrix; it has many relevant properties: it is squared (p p); 6

7 it is symmetric; it is positive semi definite (as it is the covariance matrix of the standardized variables); all its diagonal elements are equal to 1; therefore its trace is tr(r) = p. 7 Multivariate random variables and derived linear combinations The data matrix X may be thought of as describing a sequence of n empirical realizations of a p-dimensional random vector x. In the following we will first describe multivariate statistical methods considering random vectors (i.e. at the population level) and then we will derive their sample counterpart. Let s assume we are dealing with a p-dimensional random vector x; we will denote by µ its p-dimensional expectation and by Σ its p p covariance matrix. It is worth remembering that, for mean centered random variables, Σ = E(xx ). Most of the multivariate statistical methods we will deal with in the following are based on linear combinations of the components of a random vector. We will define as y = a x such a linear combination where a is the p- dimensional vector of coefficients. Note that y is a scalar random variable. In case the interest is in more than one linear combination, say m, the vectors of coefficients will be the columns of a p m matrix A; the m linear combinations will be the components of the m-dimensional random vector y : y = A x. Linear combinations defined by an orthogonal matrix A describe an axes rotation in the multidimensional space. Again the simpler two variable case may help in understanding why. Figure 1 presents the coordinates of point P, both in the original reference system X 1, X 2 and in the new reference system Y 1, Y 2 obtained after rotating the system X 1, X 2 by an angle α. The coordinates of the point P in the original reference system are (x 1, x 2 ). The coordinates of the same point in the rotated reference system (y 1, y 2 ) 7

8 X 2 Y2 x 2 P Y 1 y 1 y 2 D x 1 X 1 Figure 1: Axes rotation can be obtained from the original ones as y 1 = x 1 cos α + x 2 sin α y 2 = x 1 sin α + x 2 cos α or, with a notation coherent with the one we used before as: y 1 = a 1 x y 2 = a 2 x where, because of the property sin 2 α + cos 2 α = 1 both a 1 and a 2 are unit norm vectors. 8

9 The rotated coordinates are therefore a linear combination of the original ones. The expected value and the variance of a single linear combination will be: and E(y) = E(a x) = a E(x) = a µ V (y) = V (a x) = a V (x)a = a Σa. The expected value and the variance of multiple linear combinations will be: and E(y) = E(A x) = A E(x) = A µ V (y) = V(A x) = A V(x)A = A ΣA. Exercise 1. Given a bi-dimensional random vector x = (x 1, x 2 ) with expected value µ = (µ 1, µ 2 ) and covariance matrix [ ] σ11 σ Σ = 12 σ 12 σ 22 consider the two linear combinations y1 = x 1 x 2 and y 2 = x 1 + x 2 and derive the expected value and the covariance (the solution will be provided in class). Exercise 2. Consider three independent standardized variables Z 1, Z 2, Z 3. Assume you transform them as follows obtaining three new variables Y 1, Y 2, Y 3 : Y 1 = Z 1 Y 2 = Y Z 2 Y 3 = 10Z 3 Derive the covariance matrix of the new Y variables. 9

10 Because of the properties of the covariance matrix, the variance of a linear combination is a positive semi definite quadratic form. It is interesting to study its properties as the vector a varies. For this purpose let s consider again the simple case consisting of two variables only. Expectation and covariance matrix are the same as in Exercise 1. After suitably performing the scalar product we obtain: V (y) = a Σa = [ ] [ ] [ ] σ a 1 a 11 σ 12 a1 2 = a 2 σ 12 σ 22 a 2 1σ a 1 a 2 σ 12 + a 2 2σ 22 If we read this variance as a function of a 1 we easily recognize, in the polynomial of degree 2, the equation of a parabola. The coefficient of a 2 1 is positive since it is a variance; the equation describes therefore a concave up parabola. The same happens if we read the variance as a function of a 2. This means that, as a varies, V (y) does never reach a finite maximum, but only a minimum. 10

Principal component analysis

Principal component analysis Angela Montanari 1 Introduction Principal component analysis (PCA) is one of the most popular multivariate statistical methods. It was first introduced by Pearson (1901) and