A Peak to the World of Multivariate Statistical Analysis

Size: px

Start display at page:

Download "A Peak to the World of Multivariate Statistical Analysis"

Ursula Gertrude Dean
5 years ago
Views:

1 A Peak to the World of Multivariate Statistical Analysis Real

2 Contents Real Real

3 Real

4 Why is it important to know a bit about the theory behind the methods? Real

5 Real Figure: Multivariate data is laughing with you (or at you).

6 Real

7 Important Points in Description of the research questions Description of the dataset Univariate and bivariate statistical analysis to present the variables Application of multivariate statistical methods to answer research questions (justification and output) Conclusions and answers to the question raised at the beginning Critical evaluation of the analysis Remember that No findings is a finding! Real

8 The first step of all statistical analysis is the univariate and bivariate analysis. First calculate the univariate means, medians, variances, max and min values. Then calculate the correlation coefficients. And take a look at your data literally! Make histograms of continuous variables and pie charts of categorical variables. Make boxplots to detect univariate outliers, and make scatter plots to detect bivariate structures. Real Note that visualization is not always easy when the data contains a large number of individuals, but do not skip plotting your data! It is very important that you get familiar with your data before you conduct any large multivariate analysis.

9 Real

10 PCA-transformation Principal Component Analysis (PCA) looks for few linear combinations of p variables, losing in the process as little information as possible. More precisely, is an orthogonal linear transformation that transforms a p-variate random vector to a new coordinate system such that, the obtained new variables are uncorrelated, and the greatest possible variance lies on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on. Real

11 PCA-transformation Let x denote a p-variate random vector with finite mean E[x] = µ, and finite covariance matrix E[(x µ)(x µ) T ] = Σ. The Principal Component Transformation is the transformation x y = Γ T (x µ), where Γ R p p is orthogonal, Γ T ΣΓ = Λ = diag(λ 1,,λ p ) is diagonal and λ 1 λ p. Real The ith component of y is called the ith principal component of x.

12 Real

13 Principal Components Theorem Let x denote a p-variate random vector with finite mean vector µ, and finite covariance matrix Σ. Let λ 1 λ 2 λ p denote the eigenvalues of Σ, and let y i denote the ith principal component of x. Then 1. E[y i ] = 0, 2. var(y i ) = E[y 2 i ] = λ i, 3. cov(y i, y j ) = E[y i y j ] = 0, i j, 4. var(y 1 ) var(y p ) 0. Real

14 Proof. Let x denote a p-variate random vector with finite mean vector µ, and finite covariance matrix Σ. Let y = Γ T (x µ), where Γ R p p is orthogonal, Γ T ΣΓ = Λ = diag(λ 1,,λ p ) and λ 1 λ p. Let γ i denote the ith column vector of Γ. Now 1. and 2., 3., 4. E[y i ] = E[γ T i (x µ)] = E[γ T i x] E[γ T i µ] = γ T i E[x] γ T i µ = γ T i µ γ T i µ = 0, Real E[(y E[y])(y E[y]) T ] = E[yy T ] = E[Γ T (x µ)(γ T (x µ)) T ] = Γ T E[(x µ)((x µ)) T ]Γ = Γ T ΣΓ = Λ.

15 Maximizing Variance Theorem Let x denote a p-variate random vector with finite mean vector µ, and finite covariance matrix Σ, and let y 1 denote the first principal component of x. Assume that a R p, a T a = 1. Then var(y 1 ) var(a T x). Real

16 Proof. Let x denote a p-variate random vector with finite mean vector µ, and finite covariance matrix Σ. Let y = Γ T (x µ), where Γ R p p is orthogonal, Γ T ΣΓ = Λ = diag(λ 1,,λ p ) is diagonal and λ 1 λ p. Let γ i denote the ith column of Γ. Assume that a R p, a T a = 1. Since the set {γ 1,...,γ p } is an orthonormal basis of R p, the vector a can be given as a = c 1 γ 1 + +c p γ p. Now, since γ T i γ i = 1, and γ T i γ j = 0 if j i, we have that Real var(a T x) = a T Σa = p c j γj T j=1 ( p λ i γ i γi T i=1 ) p c k γ k = k=1 p λ i ci 2, i=1 and since a satisfies a T a = 1, we have that p i=1 c2 i = 1. Thus, since λ 1 is the largest eigenvalue, the variance var(a T x) is maximized when c 1 = 1, and c i = 0, i 1, and consequently a = γ 1. This completes the proof.

17 Maximizing Variance Theorem Let x denote a p-variate random vector with finite mean vector µ, and finite covariance matrix Σ, and let y k denote the kth principal component of x. Let b R p, b T b = 1. Assume that b T x is uncorrelated with the first k 1 principal components of x. Then var(y k ) var(b T x). Real

18 Proof. This is homework! (The proof is almost identical to the previous proof. Just note that if b T x is uncorrelated with the first k 1 principal components of x, then b can be given as linear combination of the vectors γ k,...,γ p.) Real

19 Total Variance The sum of the first k eigenvalues divided by the sum of all eigenvalues λ 1 + +λ k λ 1 + +λ p represents the proportion of total variance explained by the first k principal components. (Total variation is here understood as the trace of Σ.) Real Note that if y = Γ T (x µ), then p k x = µ+γy = µ+ y i γ i µ+ y i γ i. i=1 i=1

20 How many components to choose? Some rules of thumb: Choose as many components as is needed in order to explain at least 90% (or 80% or 95 %) of the total variance. Real Leave out the components that correspond to "small" eigenvalues.

21 Real

22 Sample version of PCA is obtained by replacing the covariance matrix and the mean vector by their sample estimates. Each p-variate data point is transformed using the sample mean vector and the eigenvector matrix of the sample covariance matrix. Real

23 Sample PCA Let X denote a n p data matrix of n independent and identically distributed p-variate observations x 1, x 2,..., x n from some continuous distribution with finite mean vector µ, and finite covariance matrix Σ. Let x denote the sample mean vector and let G denote the eigenvector matrix of the sample covariance matrix ˆΣ, where the column vectors of G are the eigenvectors of ˆΣ such that the first vector corresponds to the largest eigenvalue, the second column vector corresponds to the second largest eigenvalue, and so on. Real The sample is now given by Y = (X 1 n x T )G. (Note that now y r = G T (x r x).)

24 Real

25 Dimension reduction Outlier detection Clustering Dimension reduction in regression analysis... Real

26 Real

27 Simulated In this example, we simulated a sample from bivariate normal distribution with mean (3, 2) T, and covariance matrix [ ] B = Real was performed. After PCA, the greatest variation is seen in the first axis.

28 t(x)[,2] Real t(x)[,1] Figure: Bivariate normal distribution.

29 t(zhat2)[,2] Real t(zhat2)[,1] Figure: Bivariate normal distribution after PCA.

30 Real Real

31 - Growth To see how PCA works in practice, let s take a look at a real data example. The data set used in this example is part of a larger sample of height measurements that were collected retrospectively from health centers and schools for construction of the Finnish growth charts. The used data set comprised 525 boys and 571 girls, fullterm, healthy singletons, followed until approximately age 19, with measurements from three to 44 occasions. Real

32 The original observations were used to estimate each individual growth curve from birth to age 19 by fitting splines. The individuals that did not have enough measurements for fitting the splines were excluded. After that, the remaining observations consisted of 829 (481 boys and 348 girls) estimated height curves. The measurements (based on estimated curves) at ages 8, 9, 10, 11, 12, 13, 14, 15, 16, 17 and 18 years were used in the analysis. Thus PCA was applied to a 11-dimensional sample with 829 observations. Real

33 PCA was first used for dimension reduction. The first principal component explained 77 %, the second 17 % and the third 4 % of the variance of the data. Thus the first, second and third principal component together already explained 98 % of the variance, and dimension was reduced to three. Real

34 Three Principal Components Mean height comp 2. comp Real comp Age Figure: Mean curve of the estimated data points and the three first principal component curves (the three first column vectors of Γ). The first principal component curve puts emphasis on overall growth (shape of the curve is similar to the mean curve), the second on late growth, and the third on growth around age 14.

35 To see how the method works on the individual level, the estimated height growth curves of one randomly chosen boy and one randomly chosen girl were presented as sums of their principal component curves. The method seems to work very well also on individual level. In these examples only two principal are needed for being very close to the curve based on splines. Real

36 Height spline average 1 comp 2 comp 3 comp Real Age Figure: Estimated growth curve of one randomly chosen boy.

37 Height spline average 1 comp 2 comp 3 comp Real Age Figure: Estimated growth curve of one randomly chosen girl.

38 Scatter plot after PCA was considered to see if PCA can separate the gender groups. Real

39 Comp Comp.2 Real Comp Figure: Scatter plot after PCA. Dark grey squares are used for the boys and light grey triangles for the girls. PCA does not work perfectly in separating the two groups, but one can still see clear differences between the groups. Boys grow later than girls! (Notice the outlying points.)

40 Real

41 Some Principal Components are not in general independent PCA is a very nonrobust method. Traditional PCA is not suitable for qualitative variables. is invariant under orthogonal transformations up to heterogeneous sign changes, but it is not affine invariant. In fact, is highly sensitive for scaling of the variables. For PCA and beyond, take the course MS-E2112 Multivariate Statistical Analysis. Real

42 Real

43 I K. V. Mardia, J. T. Kent, J. M. Bibby, Multivariate Analysis, Academic Press, London, 2003 (reprint of 1979). Real

44 II R. V. Hogg, J. W. McKean, A. T. Craig, to Mathematical Statistics, Pearson Education, Upper Sadle River, R. A. Horn, C. R. Johnson, Matrix Analysis, Cambridge University Press, New York, R. A. Horn, C. R. Johnson, Topics in Matrix Analysis, Cambridge University Press, New York, Real

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1:, Multivariate Location Contents , pauliina.ilmonen(a)aalto.fi Lectures on Mondays 12.15-14.00 (2.1. - 6.2., 20.2. - 27.3.), U147 (U5) Exercises