Radoslav Harman. Department of Applied Mathematics and Statistics

Similar documents
Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Singular Value Decomposition

Linear Regression and Its Applications

TAMS39 Lecture 10 Principal Component Analysis Factor Analysis

Linear Algebra (part 1) : Vector Spaces (by Evan Dummit, 2017, v. 1.07) 1.1 The Formal Denition of a Vector Space

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Chapter 4: Factor Analysis

Lecture 2: Linear Algebra Review

Tutorial on Principal Component Analysis

Eigenvalues and diagonalization

9.1 Orthogonal factor model.

Introduction to Machine Learning

Lecture Notes 1: Vector spaces

Foundations of Matrix Analysis

Lecture notes: Applied linear algebra Part 1. Version 2

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Machine Learning. Principal Components Analysis. Le Song. CSE6740/CS7641/ISYE6740, Fall 2012

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Lecture 8. Principal Component Analysis. Luigi Freda. ALCOR Lab DIAG University of Rome La Sapienza. December 13, 2016

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Lecture 2: Linear operators

Properties of Matrices and Operations on Matrices

STAT 100C: Linear models

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Contents. 2.1 Vectors in R n. Linear Algebra (part 2) : Vector Spaces (by Evan Dummit, 2017, v. 2.50) 2 Vector Spaces

Lecture 7: Positive Semidefinite Matrices

3.1. The probabilistic view of the principal component analysis.

Least Squares Optimization

1 A factor can be considered to be an underlying latent variable: (a) on which people differ. (b) that is explained by unknown variables

Bare minimum on matrix algebra. Psychology 588: Covariance structure and factor models

Stat 159/259: Linear Algebra Notes

Factor Analysis (10/2/13)

Recall the convention that, for us, all vectors are column vectors.

I L L I N O I S UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

Problem Set (T) If A is an m n matrix, B is an n p matrix and D is a p s matrix, then show

Lecture 4: Principal Component Analysis and Linear Dimension Reduction

Chapter 6: Orthogonality

1. Introduction to Multivariate Analysis

Principal Component Analysis

R = µ + Bf Arbitrage Pricing Model, APM

1 Vectors. Notes for Bindel, Spring 2017 Numerical Analysis (CS 4220)

Principal component analysis

Chapter 7: Symmetric Matrices and Quadratic Forms

1. The Polar Decomposition

The following definition is fundamental.

Table of Contents. Multivariate methods. Introduction II. Introduction I

Multivariate Statistical Analysis

Lecture 8: Linear Algebra Background

Multivariate Gaussian Analysis

A TOUR OF LINEAR ALGEBRA FOR JDEP 384H

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

Notes on the matrix exponential

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principal Component Analysis

Intermediate Social Statistics

4.1 Eigenvalues, Eigenvectors, and The Characteristic Polynomial

The Hilbert Space of Random Variables

Lecture 7 Spectral methods

14 Singular Value Decomposition

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Least Squares Optimization

Basic Concepts in Matrix Algebra

Ross (1976) introduced the Arbitrage Pricing Theory (APT) as an alternative to the CAPM.

1. Addition: To every pair of vectors x, y X corresponds an element x + y X such that the commutative and associative properties hold

Factor Analysis. Robert L. Wolpert Department of Statistical Science Duke University, Durham, NC, USA

Vectors and Matrices Statistics with Vectors and Matrices

STATISTICAL LEARNING SYSTEMS

FACTOR ANALYSIS AND MULTIDIMENSIONAL SCALING

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

Principal Component Analysis (PCA) Our starting point consists of T observations from N variables, which will be arranged in an T N matrix R,

Boolean Inner-Product Spaces and Boolean Matrices

E2 212: Matrix Theory (Fall 2010) Solutions to Test - 1

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Dimensionality Reduction Techniques (DRT)

Least Squares Optimization

Lecture 3: Review of Linear Algebra

MATRICES ARE SIMILAR TO TRIANGULAR MATRICES

Lecture 3: Review of Linear Algebra

Chapter 7. Canonical Forms. 7.1 Eigenvalues and Eigenvectors

1. General Vector Spaces

UCLA STAT 233 Statistical Methods in Biomedical Imaging

More Linear Algebra. Edps/Soc 584, Psych 594. Carolyn J. Anderson

Unsupervised Learning: Dimensionality Reduction

EE731 Lecture Notes: Matrix Computations for Signal Processing

Exercise Sheet 1.

Solution Set 7, Fall '12

Lecture: Face Recognition and Feature Reduction

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

Exercises * on Principal Component Analysis

Mathematical foundations - linear algebra

7 Principal Component Analysis

Math Camp Lecture 4: Linear Algebra. Xiao Yu Wang. Aug 2010 MIT. Xiao Yu Wang (MIT) Math Camp /10 1 / 88

Stochastic Design Criteria in Linear Models

6-1. Canonical Correlation Analysis

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Notes on Random Vectors and Multivariate Normal

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 2

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

Knowledge Discovery and Data Mining 1 (VO) ( )

Transcription:

Multivariate Statistical Analysis Selected lecture notes for the students of Faculty of Mathematics, Physics and Informatics, Comenius University in Bratislava Radoslav Harman Department of Applied Mathematics and Statistics April 19, 2018

1 Principal Components Analysis 1.1 Mathematical background We assume that the reader is already familiar with fundamental notions and results of matrix algebra and multivariate probability, but we will give a brief review of some of the facts that are particularly important for multivariate statistics. Recall that a p p matrix Σ is non-negative denite 1, if it is symmetric and satises a T Σa 0 for any vector a R p. If Σu = λu for some λ R and u R p, u 0, then u is an eigenvector of Σ and λ is the eigenvalue of Σ corresponding to u. Vectors u 1,..., u p form an orthonormal system if u 1,..., u p are mutually orthogonal and they are normalized such that u i = 1 for all i = 1,..., p. Matrix U of the type p p is an orthogonal matrix if UU T = I p, where I p denotes the p p identity matrix. That is, U is orthogonal if and only if the columns of U form an orthonormal system of vectors. The linear mapping corresponding to an orthogonal matrix is a rotation or a composition of reection and rotation. Theorem 1.1 (Spectral decomposition of a non-negative denite matrix). For any non-negative denite p p matrix Σ there exists an orthonormal system u 1,..., u p of eigenvectors. Matrix Σ can be written in the form Σ = p λ i u i u T i = UΛU T, (1) i=1 where λ i is the eigenvalue of Σ corresponding to the eigenvector u i for all i = 1,..., p, U = (u 1,..., u p ) is the orthogonal matrix of normalized eigenvectors and Λ = diag(λ 1,..., λ p ) is the diagonal matrix with the eigenvalues on the diagonal. If λ 1 > λ 2 > > λ p, then the eigenvectors u 1,..., u p are uniquely dened (up to a possible change of the sign). A p p matrix Σ is positive denite, if it is symmetric and satises a T Σa > 0 for any vector 0 a R p. A matrix Σ is positive denite if and only if Σ is a non-negative denite non-singular matrix which is if and only if Σ is a non-negative denite matrix with all eigenvalues strictly positive. An orthogonal projector on a k-dimensional linear space A R p is the unique symmetric matrix P of the type p p such that P y A for all y R p, 1 A non-negative denite matrix is sometimes called positive semidenite. 1

P x = x for all x A, and x P x is orthogonal to P x for all x R p, which we denote (x P x) P x. If A is a p k matrix with rank k, where k p, then A T A is a non-singular matrix and P = A(A T A) 1 A T is the orthogonal projector on the linear space C(A) generated by the columns of A. For a p-dimensional random vector X = (X 1,..., X p ) p, the variancecovariance matrix is a p p matrix Σ with elements Σ ij = cov(x i, X j ), i, j = 1,..., p. The variance-covariance matrix is always non-negative denite and, typically 2, it is also non-singular, i.e., positive denite. Geometrically, Σ determines the shape of the multivariate data generated as independent samples of X. More generally, by Cov we will denote the matrix of all mutual covariances of the components of a pair of random vectors X and Z. For multivariate statistical analysis, it is very important to know how the variance-covariance matrix and the matrix Cov changes under linear transformations of the random vector(s). Theorem 1.2 (The eect of a linear transformation on the variance-covariance matrix). If X is a p-dimensional random vector with covariance matrix Σ and A is an m p matrix, then the m-dimensional random vector Y = AX has variance-covariance matrix AΣA T. More generally: If X is a p-dimensional random vector, Z is an r-dimensional random vector, A is an m p matrix and B is an k r matrix, then Cov(AX, BZ) = ACov(X, Z)B T. Principal components are based on a rotation (i.e., a specic linear transformation) of an underlying random vector as detailed in the next section. 1.2 Theoretical principal components Let µ be the mean value vector and let Σ be the variance-covariance matrix of a random vector X = (X 1,..., X p ) T which corresponds to p-dimensional measurements or observations on n objects 3. Let u 1,..., u p be an orthonormal system of eigenvectors of Σ, let λ 1... λ p be the corresponding eigenvalues and let U = (u 1,..., u p ); cf. Theorem 1.1. Vectors u 1,..., u p determine what we will call principal variance 2 For instance if the random vector X has a distribution continuous with respect to the Lebesgue measure in R p. 3 We will only consider random variables with nite mean values and variances, that is, we will assume that all random vectors have well-dened, nite variance-covariance matrices. 2

directions and λ 1,..., λ p determine the variances in the principal directions. Principal variance directions can be illustrated on an example of a random vector with a two-dimensional normal distribution. For instance, if ( ) 3.25 1.30 Σ =, 1.30 1.75 then u 1 (0.87; 0.50) T, u 2 ( 0.50; 0.87) T, λ 1 4 and λ 2 1. In Figure 1 we see 5000 points randomly generated from N 2 ((0, 0) T, Σ). Arrows denote the directions of vectors u 1 a u 2. The length of the arrows is proportional to λ1 and λ 2. Clearly, eigenvectors and eigenvalues of the variance-covariance matrix Σ capture important aspects of the shape of the distribution of X. The essence of principal components analysis is the rotation of X (or a random sample), to the coordinate system determined by the eigenvectors of Σ (or by the eigenvectors of the sample variance-covariance matrix S n, see Subsection 1.3). An immediate consequence of Theorems 1.1 and 1.2 is: Theorem 1.3 (De-correlation of a random vector). Random vector Y = U T (X µ) has the zero mean value and its variance-covariance matrix is V ar(y) = diag(λ 1,..., λ p ). That is, the components of Y are uncorrelated, their variances are λ 1 λ p and their standard deviations are λ 1 λ p. The previous theorem gives a theoretical basis for the following denition. Denition 1.1 (Principal components of a random vector). Random vector Y = (Y 1,..., Y p ) T from Theorem 1.3 is called the vector of (theoretical) principal components of the random vector X. For i = 1,..., p, the random variable Y i = u T i (X µ) is called the i-th principal component of the random vector X. 4 It is simple to show that for all i {1,..., p} the random vector Y i u i is the orthogonal projection of X onto the 1-dimensional subspace (i.e., a line) dened by all real multiples of u i. In other words, principal components 4 Note that if some of the eigenvalues of Σ are equal, then there is an innite number of possible choices of the corresponding eigenvectors, i.e., feasible denitions of the vectors of principal components. 3

6 4 2 0 2 4 6 4 2 0 2 4 X1 X2 Figure 1: Principal variance directions dened by a pair of orthogonal eigenvectors of a two-dimensional normal distribution. In the gure, the lengths of the eigenvectors is proportional to the standard deviation of the corresponding principal components. Y 1,..., Y p form (random) coordinates of X in the coordinate system dened by the orthonormal vectors u 1,..., u p. Theorem 1.3 states that principal components Y 1,..., Y p of a random vector X are uncorrelated. Thus, the transformation to principal components is occasionally referred to as de-correlation of X. Importantly, variances of principal components are in decreasing order. Note also that the sum of variances of principal components is the same as the sum of sample variances of variables X 1,..., X p which is probably the basis for the expression that (all) principal components explain (all) variation in the data. Mutual relations of the coordinates of the original random vector X and the principal components of X are given by the following theorem. Theorem 1.4 (Relations of original variables and principal components). Let Y = (Y 1,..., Y p ) T be the vector of principal components of the random 4

vector X from Theorem 1.3. Then, for any pair i, j {1,..., p}, we have cov(x i, Y j ) = u ij λ j, ρ(x i, Y j ) = u ij λj /σ i, (2) where u ij = (u j ) i is the i, j-th element of the matrix U (that is, the i-th coordinate of the eigenvector u j ), and σ i = DX i. Proof. From X i = e T i X, where e i is the i-the standard unit vector and from Y j = u T j (X µ) we obtain cov(x i, Y j ) = cov(e T i X, u T j (X µ)) = e T i V ar(x)u j = ( p ) = e T i λ k u k u T k u j = e T i λ j u j = λ j e T i u j = λ j u ij. k=1 The equality denoted by the asterisk follows from the fact that u 1,..., u p are mutually orthogonal and have the unit length. Exercise 1.1. Prove the following claim. Let b 1, b 2 R p, b 1, b 2 0 and let X be a p-dimensional random vector with positive denite covariance matrix. Then the random variables b T 1 X and b T 2 X are uncorrelated if and only if b 1 b 2. The following theorem provides an important optimization/probabilistic interpretation of principal components. Theorem 1.5 (Maximum variance justication of principal components). The rst principal component Y 1 of the p-dimensional random vector X has the largest variance from all normed linear combinations of the components of X. Formally: Var(Y 1 ) = max{var(b T X) : b R p, b = 1} For k 2, the k-th principal component Y k of the random vector X has the largest variance from all normed linear combinations of the components of X that are uncorrelated with Y 1,..., Y k 1. Formally 5 Var(Y k ) = max{var(b T X) : b R p, b = 1, b u 1,..., b u k 1 }. Proof. Let u 1,..., u p be an orthonormal system of eigenvectors of the covariance matrix Σ of the p-dimensional random vector X, and let λ 1 >... > λ p be the corresponding eigenvalues. Let b R p, b = 1, and let b = p i=1 c iu i be the expression of b in the orthonormal v basis u 1,..., u p of the space R p. 5 See also Exercise 1.1. 5

Since the vectors u 1,..., u p are orthonormal, we obtain u T j b = i c iu T j u i = c j for any j {1,..., p}, and i c2 i = i c2 i u T i u i = i (c iu i ) T k (c ku k ) = b T b = 1. Therefore ( p ) p Var(b T X) = b T λ i u i u T i b = λ i c 2 i λ 1 = Var(Y 1 ). (3) i=1 i=1 The inequality denoted by the asterisk follows from the fact that p i=1 c2 i = 1, i.e., p i=1 λ ic 2 i is a weighted average of eigenvalues, which of course cannot be larger than the largest eigenvalue λ 1. Since Y 1 = u 1 X, that is, Y 1 is itself a normed linear combination of the coordinates of X, we obtain the rst part of the theorem. If we have 2 k p and an additional condition b u 1,..., b u k 1, then c i = 0 for all i = 1,..., k 1, which implies Var(b T X) = in a way analogous to (3). p λ i c 2 i = i=1 p λ i c 2 i λ k = Var(Y k ), i=k The transformation of a random vector to principal components has several alternative geometric/optimization interpretations. For instance, let X N p (0, Σ) and by P A denote the orthogonal projector on a linear space A. Let A R p be the k-dimensional hyperplane that optimally ts the distribution of X in the sense of least squares, i.e., A minimizes E( X P A X 2 ). Then, it is possible to show that A is spanned by the eigenvectors u 1,..., u k of Σ corresponding to the k largest eigenvalues. A measure of proportion of variance explained by the rst k principal components Y 1,..., Y k, k p, or a goodness of t measure of the k-dimensional hyperplane A from the previous paragraph, is the quantity α k = λ 1 + + λ k λ 1 + + λ p. The quantity α k is important for the selection of the number of principal components that capture signicant amount of variability of the original p-dimensional data. It turns out that the variance-covariance matrix of multidimensional data is often such that α k is close to 1 for values of k that are small relative to p. Geometrically, this means that the ellipsoid of dispersion is thick in a few orthogonal directions and thin in all others. 6

1.3 Sample principal components In real applications, the mean value µ and the variance-covariance matrix Σ are rarely known, and the theoretical principal components from Denition 1.1 cannot be calculated. Usually, we only have a random sample X 1,..., X n, n p, from an otherwise unknown distribution, and the parameters µ, Σ need to be estimated by the vector of mean values X n = 1 n n i=1 X i and the sample variance-covariance matrix S n = 1 n (X i n X)(X i X) T. i=1 Denition 1.2 (Sample eigenvalues and eigenvectors). Ordered eigenvalues of S n will be called sample eigenvalues and denoted by ˆλ 1 >... > (n) ˆλ (n) p. The corresponding normalized eigenvectors of S n will be called sample eigenvectors and denoted by û (n) 1,..., û (n) p. 6 Note that (from the point of view before collecting the data) the sample (n) eigenvalues ˆλ i are random variables and the sample eigenvectors û (n) i are (n) random vectors. In general, ˆλ i λ i and û (n) i u i as n, but the analysis of the stochastic convergence is usually very non-trivial. For the case of a normal sample, the convergence is given by the following result. 7 Theorem 1.6. For any i {1,..., p}, we have the following convergence in distribution: n 1 (ˆλ(n) i λ i ) N(0, 2λ 2 i ), ( (n) n 1 (û i u i ) N p 0, ) λ i λ j (λ i λ j ) 2 u j u T j. i j Thus, if the sample size is large enough (n p), the values of and û (n) i can usually be taken as reliable approximations of the eigenvalues 6 In theory, some of the eigenvalues of S n could be equal, but for the case of an independent sample of size n p from a continuous p-dimensional distribution, which is typical for applications, the probability of this event is zero. Moreover, even in the case ˆλ (n) i of distinct eigenvalues, the eigenvectors are not uniquely dened in the sense that if û (n) i is a normalized eigenvector, so is û (n) i. We assume that we have some consistent way of selecting which of these two eigenvectors is the normalized eigenvector. It would be counter-productive to be too mathematically rigorous at this point. 7 Note, however, that in applications the normality is usually not required; the aim of the principal component analysis is almost always a reduction of dimensionality, with the aim to compress the data or to discover new knowledge, see Subsection 1.4, not hypothesis testing. 7 ˆλ (n) i

and eigenvectors of Σ, and the speed of convergence of the estimators is approximately n. Note, however, that the variances and covariances of the limiting distributions depend on the estimated parameters. Moreover, the variance covariance matrix of û (n) i 's tends to be large, if some of the eigenvalues are very similar. For actual realizations x 1,..., x n of the random vectors X 1,..., X n8, it is often useful to compute the vectors y i = (û (n) 1,..., û (n) p ) T (x i x), i = 1,..., n, which are called principal components scores. These vectors are estimates of the coordinates determined by the principal component directions of x i, i = 1,..., n. In other words, assume that A is the k-dimensional hyperplane that best ts the data in the sense of least squares in the p-dimensional space 9 and let z 1,..., z n be the orthogonal projections of the feature vectors x 1,..., x n onto A. Then, the rst k coordinates of the scores y 1,..., y n correspond to the coordinates of z 1,..., z n in the hyperplane A. They can be used to represent the n objects in the k-dimensional space, usually in plane (k = 2). The proportion α k, k {1,..., p}, of the variance explained by the rst k principal components can be estimated by ˆα (n) k = ˆλ (n) 1 + + ˆλ (n) 1 + + ˆλ (n) k ˆλ (n) p. For the random variables ˆα (n) k, a theorem similar to 1.6 can be proved for the case of normality, stating that the random variables ˆα (n) k converge to the values α k with a speed proportional to n. ˆλ (n) i The values, and ˆα (n) k form the basis of several rules of thumb for choosing an appropriate number k, i.e., for selecting how many principal components to retain to capture most variability in the data. Often, k is simply chosen to be the smallest number with the property ˆα (n) k > c, where c is some constant, for instance 0.8. The Kaiser's rule suggests to take (n) the smallest k, such that all values ˆλ k are larger than the average of all (n) (n) values ˆλ 1,..., ˆλ p 10. Another popular method is to draw the so-called scree plot which is a piece-wise linear function connecting the points (0, 0), 8 The realizations x 1,..., x n are sometimes called feature vectors of objects. 9 Hyperplane A is sometimes called perpendicular regression hyperplane. 10 (n) Note that the average of eigenvalues ˆλ i, i = 1,..., p, is equal to the average of variances of variables X 1,..., X p, which is in turn equal to tr(s n )/p. 8

(1, ˆα (n) 1 ),...,(p, ˆα p (n) ). If this line forms an elbow at a point k, it suggests that k could be an appropriate number of principal components to summarize the original data. 1.4 Applications of principal components The simplest principal component analysis, as described in this chapter, is intended to be applied on a random sample of p-dimensional vectors without specic structure and without division of variables into the sets of dependent and independent variables. Although some theoretical results about principal components assume normality, it is routinely applied also to non-normal data. 11 The results of principal component analysis are sometimes used as an end to itself, for instance as a means for visualization of multidimensional data using the rst two 12 components of the vectors of scores or for getting an insight into the data based on a possible interpretation of coecient vectors û (n) 1,..., û (n) k. Note, however, that there are also many other methods for the visualisation of multidimensional data, such as various projection pursuit methods, or the so-called multidimensional scaling. Sometimes the results of principal component analysis (usually the vectors of rst k coordinates of scores) form an input to a subsequent statistical procedure that benets from a small-dimensional representation of the originally large-dimensional dataset; an example is reducing the number of explanatory variables for a regression analysis. Principal components analysis is used across all elds that handle multivariate data. A particularly famous application uses principal components for face recognition, see, e.g., http://en.wikipedia.org/wiki/eigenface The most problematic aspect of principal components is its dependence on the scale of individual variables, i.e., principal components are not scale invariant. For instance, if we change the units with which we measure some distance variable (say, from meters to millimetres), the principal components can signicantly change. This is particularly problematic if the variables have 11 Indeed, when the sample principal components are used for visual display of data, i.e., for exploration of the structure of data, assumption of normality makes little sense. Moreover, real-world, highly multidimensional data rarely follow a multivariate normal distribution. 12 It can be expected that in near future 3-dimensional data visualisation based on principal components will also become common. 9

very dierent magnitudes or if we simultaneously use variables of completely dierent nature (such as distances, times and weights) where it is impossible to express all variables in the same units. This is the reason why principal components analysis is sometimes based on the correlation matrix, instead of variance-covariance matrix, that is, all variables are scaled to unit standard deviation before the application of principal components. 2 Multidimensional Scaling Suppose that we study a set of n objects and the relations of the objects are described by an n n matrix D of their mutual dissimilarities. Multidimensional scaling is a class of methods that assign vectors x 1,..., x n R k to the objects, such that the mutual distances of pairs x i, x j are close to dissimilarities D ij of the objects i and j. The usual aim is to discover a hidden structure in the data by means of visualizing the map of dissimilarities in a two or three dimensional space, i.e., k = 2 or k = 3. As we saw in the previous section, a reasonable k-dimensional representation of the data can be obtained from sample principal components, but for the direct application of principal components analysis, we need to know the p-dimensional vectors of features of all objects. The so-called metric multidimensional scaling 13 is closely related to the principal component analysis; here, however, only the matrix D of dissimilarities is required. Note that methods of multidimensional scaling usually do not make assumptions about the probability distribution that generated the data. 2.1 Theory for Classical Multidimensional Scaling First, we will describe a classical solution of the following problem: How do we construct an n-touple x 1,..., x n of points in R k, k n, if we only know Euclidean distances between these points? Clearly, the solution is not unique, because for any solution, an orthogonal rotation or a shift also provides a feasible solution. Therefore, we can search for a solution x 1,..., x n with the center of mass in 0 k, that is, n i=1 x i = 0 k. It is simple to prove the following lemma. Lemma 2.1. Let X = (x 1,..., x n ) T be an n k matrix, let n i=1 x i = 0 k, and let B = XX T. Then (i) the sum of all elements of any row (and any column) of B is zero. (ii) x i x j 2 = B ii + B jj 2B ij. 13 There is also an interesting variant called non-metric multidimensional scaling. 10

Matrix B = XX T from the previous lemma is called the Gram matrix of vectors x 1,..., x n (its ij-th element is the scalar product of x i and x j ). The following theorem 14 shows that the Gram matrix of vectors x 1,..., x n can be computed from the mutual distances of x 1,..., x n, element by element. Theorem 2.1. Let X = (x 1,..., x n ) T be a matrix of the type n k, let n i=1 x i = 0 k and let B = XX T. Denote D ij = x i x j for i, j {1,..., n}. Then ( B ij = 1 Dij 2 1 2 n n Dir 2 1 n r=1 n Dlj 2 + 1 n 2 l=1 n n l=1 r=1 D 2 lr ), (4) for all i, j {1,..., n}. Proof. Fix i, j {1,..., n}. Using Lemma 2.1, we have n n Dir 2 = r=1 n Dlj 2 = l=1 r=1 l=1 n Dlr 2 = n (B ii + B rr 2B ir ) = nb ii + tr(b), (5) r=1 n (B ll + B jj 2B lj ) = nb jj + tr(b), (6) l=1 n (nb rr + tr(b)) = 2ntr(B). (7) r=1 From (7) we obtain tr(b) = (2n) 1 n n r=1 l=1 D2 lr, which can be substituted to (5) and (6), yielding B ii = 1 n B jj = 1 n n r=1 n l=1 D 2 ir 1 2n 2 D 2 lj 1 2n 2 n r=1 n r=1 n Dlr, 2 (8) l=1 n Dlr. 2 (9) l=1 ( ) From Lemma 2.1 we see that B ij = 1 2 D 2 ij B ii B jj, which together with (8) and (9) provides the equality from the statement of the theorem. Exercise 2.1. Show that the matrix B from the previous theorem can be obtained using the following matrix computation. Let P = I n 1 1 n n1 T n 15 and let A be the matrix with elements A ij = 1 2 D2 ij, i, j {1,..., n}. Then B = P AP. 14 sometimes called the Young-Householder theorem 15 Note that P is a projector. 11

Therefore, distances of vectors directly provide the Gram matrix of the vectors. We will show that from the Gram matrix it is a simple step to obtain a solution x 1,..., x n of the original problem. In other words, if B = XX T for some n k matrix X, then we can easily nd some n p matrix X, such that B = X X T. Clearly, the Gram matrix B = XX T is non-negative denite, that is, B = UΛU T, where U = (u 1,..., u n ) is an orthogonal matrix of eigenvectors of B and Λ is a diagonal matrix with eigenvalues λ 1... λ n 0 on the diagonal. Since X is of the type n k, where k n, the rank of B is at most k, i.e., λ k+1 =... = λ n = 0. Now we can easily verify that the matrix X = ( λ 1 u 1,..., λ k u k ) satises B = X X T. That is, the k-dimensional columns x 1,..., x n of XT are the required solutions. It is interesting to note that as a method of dimensionality reduction, multidimensional scaling is essentially equivalent to the principal components analysis. In other words, if we do have p-dimensional vectors x 1,..., x n of features of n objects (or measurements of p variables on n objects), we can decide to compute the matrix D of mutual Euclidean distances of x 1,..., x n, and then use the classical multidimensional scaling with some k p, as described above. Then, the resulting vectors x 1,..., x n R k are just orthogonally rotated vectors of the rst k-coordinates of the principal component scores. 2.2 Application of Classical Multidimensional Scaling For real objects, the matrix D of dissimilarities is usually not based on Euclidean distances of some unknown vectors. Nevertheless, if we tentatively assume that D is a matrix of Euclidean distances of some vectors in R k (or in a k-dimensional subspace of R p ) and imitate the theoretical construction of the vectors x 1,..., x n from the previous subsection, we often obtain a reasonable t with the dissimilarities given by D. If we use the matrix D of dissimilarities, it is always possible to compute the symmetric matrix B with elements given in Theorem 2.1. If D is not a perfect matrix of Euclidean distances, then B is not necessarily non-negative denite. However, if k largest eigenvalues λ 1,..., λ k of B are positive and large compared to the absolute values of the remaining eigenvalues 16, then the mutual Euclidean distances of the k-dimensional rows of ( λ 1 u 1,..., λ k u k ) give a good t with dissimilarities D ij. 16 A common rule of thumb is that k i=1 λ i/ n i=1 λ i > 0.8. 12

Classical multidimensional scaling, as a means to visualize data given by their matrix of dissimilarities, have been used for instance in psychology to create a perceptual map of stimuli, in marketing to crate a product (diss)similarity map, in social networks to create friendship/collaboration maps and so on. Note that there are several more advanced alternatives to the classical multidimensional scaling: the so-called general metric multidimensional scaling and non-metric multidimensional scaling. These methods are beyond the scope of this introductory lecture. 3 Canonical Correlations 3.1 Mathematical background for Canonical Correlations The theory of canonical correlation can be easily explained using the notion of the square-root matrix. Let Σ be a non-negative denite p p matrix and let Σ = p i=1 λ iu i u T i be the decomposition of Σ from Theorem 1.1. Let Σ 1/2 := p i=1 λi u i u T i. Clearly, Σ 1/2 is a non-negative denite matrix satisfying Σ 1/2 Σ 1/2 = Σ, i.e., it is natural to call it the square-root matrix of Σ 17. If Σ is positive denite, its square-root matrix is also positive denite and its inverse Σ 1/2 := (Σ 1/2 ) 1 satises Σ 1/2 = p i=1 λ 1/2 i u i u T i. Let u R q be a vector of norm 1 and let M be a non-negative denite matrix of the type q q. Then u T Mu λ 1 (M), where λ 1 (M) is the largest eigenvalue of M, which is sometimes called the Railegh-Ritz theorem. Similarly, assume that M has m p distinct positive eigenvalues λ 1 (M) >... > λ p (M) (and p m zero eigenvalues). Let 2 k m. If u is orthogonal on k 1 eigenvectors of M corresponding to the k 1 largest eigenvalues of M, then u T Mu λ k (M). Recall also that, if F is any matrix, then the linear space generated by the columns of F is the same as the linear space generated by the columns of the matrix F F T. If A, B are s r matrices then tr(a T B) = tr(ba T ) = tr(ab T ) = tr(b T A). Note that the largest eigenvalue of a non-negative denite matrix of rank 1 is equal to the trace of the matrix. 17 It is also possible to show that Σ 1/2 is unique even in the case when the orthonormal system u 1,..., u p of eigenvectors of Σ is not uniquely dened. 13

3.2 Theoretical Canonical Correlations Consider the random vector X = (X T (1), X T (2)) T where X (1) is the subvector of dimension p and X (1) is the subvector of dimension q. Correspondingly, let the covariance matrix of X be divided into sub-blocks as follows: ( ) Σ11 Σ Σ = 12. Σ 21 Σ 22 We will assume that Σ 11 and Σ 22 are positive denite, i.e., there exist matrices Σ 1/2 11 and Σ 1/2 22. Let B := Σ 1/2 11 Σ 12 Σ 1/2 22, N 1 := BB T = Σ 1/2 11 Σ 12 Σ 1 22 Σ 21 Σ 1/2 11, N 2 := B T B = Σ 1/2 22 Σ 21 Σ 1 11 Σ 12 Σ 1/2 22. Clearly, N 1, N 2 are both non-negative denite. Moreover, N 1, N 2 have the same rank, because rank(n 1 ) = rank(bb T ) = rank(b) = rank(b T ) = rank(b T B) = rank(n 2 ). We will denote the common rank of matrices N 1, N 2 by the symbol m. Lemma 3.1. Let α 1,..., α m be an orthonormal system of eigenvectors of N 1, with corresponding eigenvalues λ 1 (N 1 ) > > λ m (N 1 ) > 0. Similarly, let β 1,..., β m be the orthonormal system of eigenvectors of N 2, with corresponding eigenvalues λ 1 (N 2 ) > > λ m (N 2 ) > 0. Then, for every i = 1,..., m, we have λ i (N 1 ) = λ i (N 2 ) =: λ i and β i = BT α i, or β i = BT α i. λi λi Proof. For the proof that all non-zero eigenvalues of N 1 and N 2 are equal, it is enough to show that each eigenvalue of N 1 is an eigenvalue of N 2 18. For any i {1,..., m}, a normalized eigenvector α i of N 1 = BB T and corresponding eigenvalue λ i (N 1 ) we can write: BB T α i = λ i (N 1 )α i, B T BB T α i = λ i (N 1 )B T α i, N 2 B T α i = λ i (N 1 )B T α i (10) 18 Clearly, then, the symmetry of the problem implies that each eigenvalue of N 2 will be an eigenvalue of N 1. 14

In addition, observe that B T α i 2 = α T i BB T α i = α T i N 1 α i = α T i λ i (N 1 )α i = λ i (N 1 ) > 0, (11) which implies B T α i 0. That is (10) entails that λ i (N 1 ) is an eigenvalue of N 2 and B T α i is a corresponding eigenvector of N 2 (not necessarily normalised). Moreover, since all eigenvalues of N 2 are assumed to be distinct, the corresponding orthonormal system of eigenvectors of N 2 is uniquely determined, up to the possible reversal of the direction. Therefore β i = ± BT α i B T α i = ±BT α i λi, where the second equality follows from (11). Denition 3.1 (Canonical variables and canonical correlations). For i = 1,..., m let a i = Σ 1/2 11 α i, b i = Σ 1/2 22 β i, U i = a T i X (1), V i = b T i X (2), ρ i = λ i, where α i, β i a λ i are dened in Lemma 3.2, such that β i = BT λi α i. Then random variables U 1,..., U m, V 1,..., V m are called canonical variables and numbers ρ 1,..., ρ m, 0,..., 0 19 are called canonical correlations. It is possible to show that the vectors a i of coecients that dene the rst group of canonical variables are eigenvectors of Σ 1 11 Σ 12 Σ 1 22 Σ 21, and canonical correlations ρ i are square roots of corresponding eigenvalues. Analogously, vectors b i of coecients that dene the second group of canonical variables are eigenvectors of Σ 1 22 Σ 21 Σ 1 11 Σ 12 and, as above, canonical correlations ρ i are square roots of corresponding eigenvalues. That is, we can obtain canonical variables and correlations without actually computing the square root matrices; however, the square root matrices are useful for the theoretical results, because they allow us to work exclusively with non-negative denite matrices. Theorem 3.1 (Mutual correlations of canonical variables). For all i, j {1,..., m} we have cov(u i, U j ) = ρ(u i, U j ) = δ ij, cov(v i, V j ) = ρ(v i, V j ) = δ ij, cov(u i, V j ) = ρ(u i, V j ) = δ ij ρ i, 19 The number of zeros here is min(p, q) m, that is all canonical correlations of order higher than m = rank(n 1 ) = rank(n 2 ) are dened to be zero. 15

where δ ij is the Kronecker delta 20. Proof. cov(u i, U j ) = cov(a T i X (1), a T j X (1) ) = a T i Σ 11 a j = α T i Σ 1/2 11 Σ 11 Σ 1/2 11 α j = α T i α j = δ ij and, similarly, we obtain cov(v i, V j ) = δ ij. In particular, this implies V ar(u i ) = V ar(v i ) = 1, that is, the covariances are equal to correlations. We can conclude the proof by observing that: cov(u i, V j ) = cov(a T i X (1), b T j X (2) ) = a T i Σ 12 b j = α T i Σ 1/2 11 Σ 12 Σ 1/2 22 β j = α T i Bβ j = λ i β T i β j = δ ij ρ i, where for the equality denoted by the asterisk we used β i = BT α i λi. In a manner similar to principal components, canonical variables and canonical correlations have a probabilistic justication: Theorem 3.2 (Maximum correlation justication of canonical variables and correlations). Canonical variables U 1 = a T 1 X (1) a V 1 = b T 1 X (2) have the maximum possible correlation among all pairs a T X (1), b T X (2) of non-zero linear combinations of the components of vectors X (1) and X (2). Formally: ρ(u 1, V 1 ) ρ(a T X (1), b T X (2) ) for all 0 a R q, 0 b R q. For k 2, the canonical correlations U k = a T k X (1) and V k = b T k X (2) have the largest correlation coecient among all pairs a T X (1), b T X (2) of nonzero linear combinations of the components of vectors X (1) a X (2) that are uncorrelated with U 1,..., U k 1 and V 1,..., V k 1, respectively. Formally: ρ(u k, V k ) ρ(a T X (1), b T X (2) ) for all 0 a R q, 0 b R q s.t. a T Σ 11 a T i = b T Σ 22 b i = 0 for all i = 1,..., k 1. Proof. As ρ(u 1, V 1 ) = λ 1 and ρ(a T X (1), b T X (2) ) = a T Σ 12 b at Σ 11 a b T Σ 22 b, which is invariant with respect to multiples of vectors a, b, for the proof of the rst part of the theorem it is enough to show that λ 1 (a T Σ 12 b) 2 under 20 δ ii = 1 and δ ij = 0 if i j. 16

the constraint a T Σ 11 a = b T Σ 22 b = 1. Denote u := Σ 1/2 11 a and v := Σ 1/2 22 b and note that a T Σ 11 a = b T Σ 22 b = 1 implies u = v = 1. We obtain: (a T Σ 12 b) 2 = (u T Σ 1/2 11 Σ 12 Σ 1/2 22 v) 2 = u T Σ 1/2 11 Σ 12 Σ 1/2 22 vv T Σ 1/2 22 Σ 21 Σ 1/2 11 u λ 1 (Σ 1/2 11 Σ 12 Σ 1/2 22 vv T Σ 1/2 22 Σ 21 Σ 1/2 11 ) = tr(σ 1/2 11 Σ 12 Σ 1/2 22 vv T Σ 1/2 22 Σ 21 Σ 1/2 11 ) = v T Σ 1/2 22 Σ 21 Σ 1/2 11 Σ 1/2 11 Σ 12 Σ 1/2 22 v = v T N 2 v λ 1 (N 2 ) = λ 1. The inequalities denoted by the asterisk follow from the Raileigh-Ritz inequality. The second part of the theorem follows similarly as the rst one, observing that the conditions b T Σ 22 b i = 0 for all i = 1,..., k 1 imply that v is orthogonal to the eigenvectors of N 2 that correspond to k 1 largest eigenvalues, hence v T N 2 v λ k (N 2 ) = λ k. Note that unlike principal components, canonical correlations ρ 1,..., ρ m do not depend on the units of individual variables, i.e., they are scale invariant. 3.3 Sample Canonical Correlations In practice, we do not know the theoretical variance-covariance matrices of the random vector that we observe and we need to based the estimate of the canonical correlations on a random sample X 1,..., X n from the underlying (p + q)-dimensional distribution. The following denition is motivated by the remark after Denition 3.1. Denition 3.2 (Sample canonical correlations). Let m = min(p, q). Sample canonical correlations ˆρ 1,..., ˆρ m are dened to be the square roots of the m largest eigenvalues of S 1 22 S 21 S 1 11 S 12 (or, equivalently, of S 1 11 S 12 S 1 22 S 21 ), where S 11,S 12, S 21, and S 22 are the p p, p q, q p, and q q sub-blocks of the sample variance-covariance matrix ( ) S11 S S = 12. S 21 S 22 In the case of normal errors and a large sample size, we can test the hypothesis that all theoretical canonical correlations are zero, i.e., that the sub-vector of the rst p variables is independent with the sub-vector of the last q variables: 17

Theorem 3.3. If X 1,..., X n follow N p+q (µ, Σ) and the upper-right p q submatrix Σ 12 of Σ is zero 21, then the statistics W = det(i S 1 22 S 21 S 1 11 S 12 ) has asymptotically the Wilks distribution Λ(p, n 1 q, q) and the statistics Z = (n (p + q + 3)/2) ln(w ) has asymptotically the distribution χ 2 pq. The tests using the Z statistics from the previous theorem is called the Bartlett ξ 2 test. 3.4 Applications of Canonical Correlations Canonical correlations are used in the situations where variables can be logically divided into two distinct groups. For instance, in a psychological survey we can have a set of variables measuring one personality dimension, and a set of variables measuring a second personality dimension. The rst canonical correlation ρ 1 then measures the overall degree of correlation of the rst group of variables with the second group of variables. Note that for q = 1 the rst canonical correlation corresponds to the so-called coecient of multiple correlation, which, in the context of linear regression, is closely related to the square of the coecient of determination. (It is a useful theoretical exercise to simplify the theory of canonical correlations for q = 1.) 4 Factor Analysis In factor analysis, we assume that the random vector X = (X 1,..., X p ) T of observed (or manifest) variables can be described in terms of an m- dimensional random vector F = (F 1,..., F m ) T of hidden (or latent) variables, where m is substantially smaller than p. More precisely, we assume that the following statistical model holds: X i = m a ij F j + U i, i = 1,..., p, (12) j=1 21 This is of course if and only if the lower-left q p submatrix Σ 21 of Σ is zero. 18

or, in a matrix form, X = AF + U. (13) The elements a ij of A are called factor loadings 22, the random vector F is called the vector of common factors, and the random vector U is called the vector of specic factors, specic variates, or uniqueness. The factor loadings are assumed to be unknown parameters of the model. Hence, we consider a model similar to the multivariate regression, but in factor analysis the regressors or explanatory variables F 1,..., F m are assumed to be random, and cannot be directly observed 23. Usual theoretical assumptions are that the common factors F and uniqueness U are uncorrelated, i.e., Cov(F, U) = 0 m p, the common factors themselves are standardized and uncorrelated, which means that the variancecovariance matrix of F is I m, and the variance-covariance matrix of U 24 is assumed to be a diagonal matrix D = diag(d 1,..., d p ). Note that the variances d 1,..., d p are additional unknown parameters of the model. The mean values of F and U are assumed to be 0, which directly implies that the mean value of X is 0. In this text, we also adopt a simplifying assumption that the manifest variables X 1,..., X p are normalized, i.e., V ar(x i ) = 1, which is common in applications 25. The theoretical assumptions, together with the standard transformation rules stated in Theorem 1.2 imply that the correlation matrix Ψ of the vector X 26 is Ψ = AA T + D. (14) Note that the elements of the matrix A are simply correlations between the manifest variables and hidden factors: Theorem 4.1 (Interpretation of factor loadings). Under the model of factor analysis described above, we have a ij = ρ(x i, F j ) for all i = 1,..., p and j = 1,..., m. 22 That it, the p m matrix A is called the matrix of factor loadings. 23 Another dierence is that in factor analysis, we allow the variances of the components of the error vector U to dier, while in regression analysis the errors are usually assumed to be homoscedastic. 24 This is then called the orthogonal model, as opposed to a non-orthogonal model, where no restrictions on the variance-covariance matrix of F are imposed. 25 Especially in some behavioral sciences, where X i represent standardized responses of subjects. 26 Note that since the manifest variables are standardized, the correlation matrix of X coincides with the variance-covariance matrix Σ of X. 19

Proof. Consider the model (13). We have Cov(X, F ) = Cov(AF + U, F ) = ACov(F, F )+Cov(U, F ) = A, taking into account assumptions Cov(F, F ) = I m and Cov(U, F ) = 0 p m. Since the variables of X as well as of F are normalized, we obtain that ρ(x i, F j ) = cov(x i, F j ) = a ij. A crucial observation is that the model (13) is over-parametrized: Based on the observations of the manifest variables X, we will never be able to distinguish between models X = AF + U and X = (AV )(V T F ) + U, where V is any m m orthogonal matrix. In other words, the factor loadings A with factors F provide as good a t to the observations as factor loadings AV with rotated factors V T F 27. The previous point is the crux of the main criticism of factor analysis from the point of view of traditional statistical philosophy. Note that statisticians traditionally assume a single, xed model, which is unknown, yet possible to identify with increasing precision as the sample size increases. Clearly, the model of factor analysis does not t into this view. Rather, in factor analysis, we can adopt the attitude that all models of the form X = (AV )(V T F )+U, where V is an orthogonal matrix, are equally good representations of reality, and it is perfectly justiable to select one of these models that leads to our understanding of the origin of data, based on a clear interpretation of what do the factors V T F represent. In fact, there is a large body of literature on methods called rotation of factors that help us nd a suitable matrix V ; see the section on rotation of factors. 4.1 Estimation of factor loadings The rst step in factor analysis is to use the observations of the manifest variables to nd some appropriate estimates of the parameters a ij and d i. Here, two methods are most popular: the method of maximal likelihood and the method of principal factors which we very briey describe in the following subsection. 4.1.1 Principal factors Consider the formula Ψ = AA T + D. If we knew D = diag(d 1,..., d p ), then we would be able to evaluate the so-called reduced correlation matrix Ψ D = AA T, which would enable us the computation of a matrix A of factor loadings. Note that d i = 1 h 2 i, where h 2 i = m j=1 a2 ij are called 27 Note that if F satises the assumptions of factor analysis, then so does V T F. 20

communalities, i.e., the estimation of the variances of specic factors is essentially equivalent to the estimation of communalities. The method of principal factors uses the sample correlation matrix R as an estimator of Ψ, and ˆd 1 = 1/(R 1 ) 11,..., ˆd p = 1/(R 1 ) pp as (initial) estimators of d 1,..., d p. Thus, we can obtain an estimate R = R diag( ˆd 1,..., ˆd p ) of the reduced correlation matrix, which can be viewed as the sample correlation matrix with diagonal elements replaced by the estimates ĥ2 1 = 1 1/(R 1 ) 11,..., ĥ2 p = 1 1/(R 1 ) pp of the communalities. Then we calculate the eigenvalues λ 1... λ p of R and the corresponding orthonormal eigenvectors u 1,..., u p. If the rst m eigenvalues of R are non-negative, and the remaining eigenvalues are small, we can estimate the matrix of factor loadings as  = ( λ 1 u 1,..., λ m u m ). We remark that once we have Â, we can repeat the process using new estimates m j=1 (Â)2 ij, i = 1,..., p, of communalities and iterate it until possible convergence. In the method of principal factors the number m of factors is considered satisfactory based on analogous criteria as in the method of principal components, for instance, if m j=1 λ j/ p i=1 λ i > 0.8. 4.2 Rotation of Factors As mentioned above, if we have any estimate  of A, and V is any m m orthogonal matrix, then à = ÂV is as good an estimate of A as Â. For a given estimate Â, the methods of factors rotation provide, in a sense, optimal rotation V in order to achieve a suitable form of the new matrix ÂV of factor loadings. A general aim is to achieve the so called simple structure of the matrix of factor loadings. Omitting details, a matrix of factor loadings has a simple structure, if it contains many entries close to 0, 1 and 1, because, viewed as correlations, such factor loadings usually lead to a simple interpretation of common factors. Denition 4.1 (Varimax and Quartimax rotation methods). Varimax rotation of the p m matrix  of factor loadings is the orthogonal matrix V of the type m m that maximizes the value of ( ) f v (V ) = 1 m 1 p m p (ÂV )2 jt 1 2 p p (ÂV )2 kt. (15) t=1 j=1 21 k=1

Quartimax rotation of the p m matrix  of factor loadings is the orthogonal matrix V of the type m m that maximizes the value of ( f q (V ) = 1 m p mp (ÂV )2 jt 1 2 m p mp ks) (ÂV )2. (16) t=1 j=1 s=1 k=1 Let B V be the m p matrix with elements corresponding to the squares of the elements of ÂV, that is, B V = ÂV ÂV, where is the Hadamard (entry-wise) product of matrices. Observe that it is possible to interpret (15) as the average sample variance of the entries of the columns of B V, and (16) can be interpreted as the sample variance of all elements of B V. Clearly, the elements of ÂV are correlations, i.e., the elements of B V are bounded by 0 from below and by 1 from above. Therefore, the maximization of the variance of the elements of B V forces the elements to be close to 0 or 1, i.e., forces the elements of ÂV to be close to the numbers 0, 1, and 1. Using the formula 1 N N i=1 (x i x) 2 = 1 N N i=1 x2 i x 2 for any x 1,..., x N and the observation m p (ÂV )2 ks = ÂV 2 = tr(âv (ÂV )T ) = tr(âât ) =  2, s=1 k=1 the quartimax utility function can be simplied as follows: f q (V ) = 1 m p mp (ÂV )4 jt 1 (mp) 2  4. t=1 j=1 Therefore, the quartimax method maximizes the sum of the fourth powers of factor loadings, hence the name. The problem of nding optimal V in the sense of the varimax or the quartimax rotation method is a dicult problem of mathematical programming and description of the appropriate numerical methods goes beyond this text. However, all advanced statistical packages implement some form of factor rotation optimization. 4.3 Estimation of Factor Scores Lemma 4.1. Let the random vector (X T, F T ) T follows the p+m-dimensional normal distribution with the zero mean value. Then the conditional distribution of F given X = x is N m ( Cov(F, X)Σ 1 X x, Σ F Cov(F, X)Σ 1 X Cov(X, F )) = N m ( A T (AA T + D) 1 x, I m A T (AA T + D) 1 A ). 22

Consider a random sample X 1,..., X n of the p-dimensional observable variables on n objects, satisfying the model of factor analysis. Suppose that our aim is to estimate the realizations of the common factors F 1,..., F n for individual objects; these realizations are called factor scores. Lemma 4.1 motivates the following estimators: ˆF r = ÂT (ÂÂT + ˆD) 1 X r, r = 1,..., n, (17) where  is the selected estimate of the matrix of factor loadings and ˆD is the estimate of the diagonal matrix of variances of specic variates. This kind of estimation of factor scores is sometimes called the method of regression analysis 28. If the number p is very large, it may be dicult to compute the inverse of the p p correlation matrix ˆR = ÂÂT + ˆD 29. However, the matrix ÂT ˆR 1 that appears in the formula (17) can be computed by only inverting a usually much smaller m m matrix, and a p p diagonal matrix: Lemma 4.2. Let R = AA T +D be a non-singular matrix, where A is a p m matrix and D is a p p matrix. Then A T R 1 = (I m + A T D 1 A) 1 A T D 1. Proof. (I m + A T D 1 A) 1 A T D 1 R = (I m + A T D 1 A) 1 A T D 1 (AA T + D) = (I m + A T D 1 A) 1 (A T D 1 A + I m )A T = A T. 5 Model-based Cluster Analysis The main aim of cluster analysis is to reveal relationships or similarities of n objects characterized by either multidimensional vectors of features or by a matrix of mutual dissimilarities. The aim of the partitioning methods of cluster analysis 30 is to divide the objects into k groups called clusters, in 28 The reason is that it can also be motivated by techniques similar to multivariate regression analysis. 29 It was especially dicult in the past. 30 Partitioning methods are sometimes called non-hierarchical methods to highlight the contrast with the hierarchical methods of cluster analysis. The aim of a hierarchical cluster analysis is to create a tree of similarities of objects called dendrogram. These methods are beyond the scope of this text, but I will give a brief description of hierarchical methods at a lecture. 23

such a way that, loosely speaking, the elements within the same cluster are as similar as possible and the elements of dierent clusters as as dissimilar as possible. We will focus on a modern and powerful partitioning method called (normal) model-based cluster analysis 31. First, let us introduce a formal notation. Let n be the number of objects and let k be the required number of clusters, that is, the objects will be denoted by numbers 1,..., n and the clusters by numbers 1,..., k. We will assume that n k; the case n < k is meaningless and in real situations, the number n of clusters is usually greater than k by several orders of magnitude. Denition 5.1 (Clustering and clusters). Let Γ n,k be the set of all vectors γ R n with elements from {1,..., k}, such that each of values 1,..., k occurs at least once in γ. Any vector γ Γ n,k will be called a clustering. The set C j (γ) = {i {1,..., n} : γ i = j} will be called the j-th cluster for the clustering γ 32. The number n j (γ) of elements of C j (γ) will be called the size of the j-th cluster for the clustering γ. If the objects are characterized by features vectors x 1,..., x n R p, we can dene the following characteristics for each cluster C j (γ), γ Γ n,k. Denition 5.2 (Centroid of a cluster, Variance-covariance matrix of a cluster). Let γ Γ n,k and let j {1,..., k}. Centroid of the cluster C j (γ) is 33 x j (γ) = n 1 j (γ) x i. i C j (γ) Variance-covariance matrix of the cluster C j (γ) is S j (γ) = n 1 j (γ) (x i x j (γ))(x i x j (γ)) T. i C j (γ) Partitioning methods dier in the way they dene the notion of optimal clustering γ Γ n,k and in the way they search for this clustering 34. For the model-based clustering, the optimal clustering γ is dened by means of an 31 Traditional and still very popular partitioning methods are the methods of k-means and k-medoids (they will be reviewed briey at a lecture). An interesting method of completely dierent kind is the so-called density-based scan, but there is also an (over)abundance of other clustering methods. 32 That is, for j {1,..., k} and i {1,..., n} the equality γ i = j means that the clustering γ assigns the i-th object into the j-th cluster. 33 Note that the centroid of a cluster is its center of mass. 34 Finding the optimal optimal clustering is typically a dicult problem of discrete optimization. 24

underlying model assumption. More precisely, we assume that the n objects are characterized by feature vectors x 1,..., x n R p that are realizations of random vectors X 1,..., X n, X i N p (µ γi, Σ γi ), where γ Γ n,k, µ 1,..., µ k R p, and Σ 1,..., Σ k S p ++ are unknown parameters. In particular, note that in this interpretation γ is also considered to be a model parameter. In the model-based clustering, we obtain the optimal clustering as an estimate ˆγ of the parameter γ by the principle of maximal likelihood. By Γ R n,k denote the set of all regular clusterings, i.e., clusterings γ Γ n,k such that the matrices S 1 (γ),..., S k (γ) are non-singular. By ˆγ, ˆµ 1,..., ˆµ k, ˆΣ 1,..., ˆΣ k let us denote the model parameters that, on the set Γ R n,k Rp R p S p ++ S p ++, maximize the likelihood function L(γ, µ 1,..., µ k, Σ 1,..., Σ k x 1,..., x n ) = n f µγi,σ γi (x i ), (18) i=1 where f µ,σ (x) = exp ( 1 2 (x µ)t Σ 1 (x µ) ) (2π) p/2 det(σ) ; x R p is the density of the distribution N p (µ, Σ) with a positive denite variancecovariance matrix Σ. In the sequel, we derive an optimization problem of a simple form which provides the maximum likelihood estimate ˆγ, without the need to explicitly compute ˆµ 1,..., ˆµ k, ˆΣ 1,..., ˆΣ k. Consider the log-likelihood function of (18): ln L(γ, µ 1,..., µ k, Σ 1,..., Σ k x 1,..., x n ) = ln = = n i=1 k n f µγi,σ γi (x i ) [ p 2 ln(2π) 1 2 ln det(σ γ i ) 1 ] 2 (x i µ γi ) T Σ 1 γ i (x i µ γi ) j=1 i C j (γ) [ p 2 ln(2π) 1 2 ln det(σ j) 1 ] 2 (x i µ j ) T Σ 1 j (x i µ j ). For any xed γ Γ R n,k and j {1,..., k} the sum [ p 2 ln(2π) 1 2 ln det(σ j) 1 ] 2 (x i µ j ) T Σ 1 j (x i µ j ), i C j (γ) i=1 25