SOME TOOLS FOR LINEAR DIMENSION REDUCTION

Size: px

Start display at page:

Download "SOME TOOLS FOR LINEAR DIMENSION REDUCTION"

Arnold Barnett
6 years ago
Views:

1 SOME TOOLS FOR LINEAR DIMENSION REDUCTION Joni Virta Master s thesis May 2014 DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF TURKU

3 UNIVERSITY OF TURKU Department of Mathematics and Statistics VIRTA, JONI: SOME TOOLS FOR LINEAR DIMENSION REDUCTION Master s thesis, 52 p., 2 appendices 7 p. Statistics May 2014 Dimension reduction refers to a family of methods commonly used in multivariate statistical analysis. The common objective for all dimension reduction methods is essentially the same: the reducing of the number of variables in the data while still preserving their information content, however it is measured. In linear dimension reduction this is done by replacing the original variables with a lesser number of linear combinations of them. Dimension reduction can further be divided into two subcategories: supervised dimension reduction where some (usually one) of the variables clearly have the role of a response and we are interested in explaining their behavior using the other variables, and unsupervised dimension reduction where no distinction like this is made and all variables are considered equal. In this thesis the theory behind three linear dimension reduction methods, the principal component analysis (PCA), independent component analysis (ICA) and sliced inverse regression (SIR) is reviewed. Although both PCA and ICA are unsupervised methods, their criteria in creating the new variables differ to an extent making them suitable in different situations. The third method, SIR, is supervised and thus aims to replace the original explanatory variables with a smaller number of variables that best explain the variability of some response. Additionally, a variant of SIR that uses splines in finding the new variables is briefly discussed. The concept of simultaneous diagonalization of two scatter matrices is also given some thought. A relatively new method, the importance of the diagonalization for dimension reduction stems from the fact that most dimension reduction methods, while seemingly disconnected, can be formulated in a unified way using it. In the final section two simulation studies are conducted. In the first the performances of the two unsupervised methods (PCA and ICA) in finding directions relevant to explaining a response variable in a regression situation are compared to that of the supervised SIR. In the second simulation the superiority of the spline-using SIR to the regular SIR under some specific models is demonstrated. Keywords: multivariate statistics, dimension reduction, principal component analysis, independent component analysis, sliced inverse regression. Asiasanat: monimuuttujamenetelmät, dimension supistus, pääkomponenttianalyysi, riippumattomien komponenttien analyysi, sliced inverse regression.

5 Contents 1 Introduction An overview of the subject Recent literature on the subject About referencing Necessary theoretical framework About notation The basics of dimension reduction Eigendecomposition Singular value decomposition Location and scatter functionals The diagonalization of two scatter matrices Principal Component Analysis (PCA) The basic idea behind PCA Finding the principal components On the use of the principal components Determining the correct dimension k Some specific applications of PCA PCA as a diagonalization of two scatter matrices Independent Component Analysis (ICA) Background on ICA The model for ICA The FOBI solution Some applications of ICA ICA as a diagonalization of two scatter matrices Sliced Inverse Regression (SIR) Introducing SIR The model for SIR Finding the e.d.r. space Some remarks on using SIR Spline-based approach to SIR SIR as a diagonalization of two scatter matrices Simulation studies Using the dimension reduction methods with regression Comparing the regular SIR with the spline-based version... 46

6 7 Concluding remarks and future research 48 Appendices A Proofs B Some important R code A1 A1 B1

7 List of Figures 1 Geometrical interpretation of PCA A scree plot and a scatter plot of the first two principal components Illustration of the use of ICA in signal processing Illustration of the use of ICA in group separation Graph of the estimated inverse regression curve Illustration of the limitations of SIR Paired scatter plots of the response and the components found by different methods from the simulated data Paired scatter plots of the response and the components found by different methods from the Boston data List of Tables 1 The results of comparing spline-based SIR with the regular SIR using model The results of comparing spline-based SIR with the regular SIR using model Testing spline-based SIR with smaller numbers of slices

8 1 Introduction 1.1 An overview of the subject When dealing with multivariate data it is not unusual that the number of variables involved is measured in thousands. In addition to the apparent computational issues, an extensive amount of variables can also limit the pool of techniques and methods applicable. For example, in the area of nonparametric statistics the number of observations required for the successful use of certain methods quickly increases with the dimension of the data, as stated in Li (1991). Traditionally two different approaches have been utilized in tackling the problem. With sufficient knowledge, one could choose to retain only the variables relevant to the phenomenon at hand. For example in regression analysis it could be possible to leave out variables having no effect on the response variable, thus reducing the total number of variables. This is of course not feasible when the meanings or relationships of the variables are not known. The second approach, dimension reduction does not outright omit any variables but instead tries to retain as much information as possible, while still decreasing the number of variables involved. This is in linear dimension reduction accomplished by replacing the original data with linear combinations of the original variables. The way these linear combinations are chosen depends on the dimension reduction method used and, as is often the case, only few combinations are needed to retain most of the information carried by the original variables. Note that the way of measuring the information is not constant across the methods and depends on the specific dimension reduction technique used. The purpose of this thesis is first to consider the subject of linear dimension reduction in general and to review some of the methods commonly used for linear dimension reduction(henceforth referred to as just dimension reduction ). Three dimension reduction methods all having fundamentally different premises are reviewed and discussed in detail. The theory behind the methods is also considered in the light of modern approaches and findings in the field. Second, some simulation studies are conducted to assess the performances of the three methods in some specific situations. But before the specific dimension reduction methods are introduced and examined in detail, some required basic theory is first revised. Beginning Section 2 is a short introduction to the notational conventions used throughout the thesis. After that a distinction between two types of dimension reduction is made. Supervised dimension reduction referring to a situation where some 1

9 (usually one) of the variables can clearly be labelled as responses, while the rest of the variables are thought to be explanatory. And the second type, unsupervised dimension reduction, meaning that all the variables involved are equal in the response/explanatory variable sense. Two very useful matrix decompositions, the eigendecomposition and the singular value decomposition, are also given a short revision. Both of these play central role in dimension reduction and especially the eigendecomposition is utilized in one way or another in all the dimension reduction methods covered. Finally closing Section 2 is a introduction to the concept of location and scatter functionals and the use of the latter in dimension reduction. The used method that revolves around the simultaneous diagonalization of a pair of so-called scatter matrices was already examined in Caussinus and Ruiz- Gazen (1993), but a comprehensive treatise was given in Tyler et al. (2009) and the subject was later explored in more detail in the setting of dimension reduction in Ilmonen et al. (2012) and Liski et al. (2014a). The remarkable fact that every dimension reduction method covered in this thesis (and plenty of others) can be seen as a special case of the method is also addressed. The next three sections are then devoted to the individual dimension reduction methods, starting from Section 3 and the principal component analysis (PCA), which is possibly the best-known of all unsupervised dimension reduction methods. PCA revolves around finding a linear transformation that renders the data uncorrelated and in addition maximizes the variances of the individual components (under suitable restrictions). While PCA is included in most statistical software packages and using it is straightforward enough, care must be taken since PCA does not utilize third or higher moments and some patterns in the data can thus be undetected by it. The scale of the original variables also affects the resulting components and PCA is thus not affine invariant. Section 4 introduces the second unsupervised dimension reduction method, the independent component analysis (ICA). While both ICA and PCA are unsupervised in nature, the former is a model-based approach. Namely, in ICA the observed variables are thought to be linear combinations of some original independent variables and the purpose of ICA is to retrieve these original independent components. This makes ICA suitable for the field of signal processing, where the observations are indeed thought to be linear combinations of some latent signals (however, ICA treats the observations as independent which they usually are not). The solution presented to the ICA problem, the fourth order blind identification (FOBI) is based on the use of fourth moments and the linear combinations are chosen as directions having maximal kurtosis. 2

10 The last and the only supervised dimension reduction method considered, the sliced inverse regression (SIR), is covered in Section 5. The name of the method refers to the inverse view on regression it takes; in order to find the linear combinations best explaining the univariate response SIR regresses each explanatory variable against the response. This leads to the so-called inverse regression curve that has essential role in finding the effective dimension reduction space (e.d.r. space) giving the relevant linear combinations. A method originally explored in Zhu and Yu (2007) that uses splines to interpolate the inverse regression curve is also reviewed. That is, the discrete conditional means of the explanatory variables used in regular SIR to estimate the inverse regression curve are replaced with piece-wise low-degree polynomials. In section 6 the three methods are then used in simulation tests. In the first simulation the efficiencies of the three methods are compared in a regression situation. Although PCA and ICA are unsupervised and do not utilize the joint distribution of the response and the explanatory variables their performance in finding the relevant linear combinations is nevertheless evaluated and compared to the supervised SIR. The second simulation compares the efficiency of the regular SIR to that of the spline-utilizing version of SIR. Tests are conducted using different forms of relationship between the response and the explanatory variables and both methods abilities to extract the relevant components are evaluated. The simulations, as well as all the figures and plots in the thesis, were produced using the R software package (R Development Core Team, 2011). 1.2 Recent literature on the subject While the core methods presented have been around for a while, the principal component analysis even more so, the field of dimension reduction has constantly continued to evolve and grow. Following is a short review of some selected recent publications related to dimension reduction. For a comprehensive account of the current state of supervised dimension reduction the reader may refer to Ma and Zhu (2013). In their article the authors consider the estimation of the e.d.r. space under the three most common approaches to supervised dimension reduction: the inverse regression based methods, non-parametric methods and semiparametric methods. The current literature pertaining to each of the three approaches is reviewed. The authors also discuss the problem of making inference on supervised dimension reduction; since the inference now consists of the estimation of a subspace instead of the usual estimation of parameters the situation has to be approached from a different angle. The problem of determining the cor- 3

11 rect dimension k is also discussed and some specific methods for estimating it are reviewed. The estimation of the dimension k of the dimension reduction space is of central importance also in Liski et al. (2014b) where the authors discuss it in the case of principal component analysis. Their approach is based on the use of different M scatter estimates, that is, matrices that characterize some aspect of the spread of the data. The eigenvalues of the M scatter estimate of choice are used to compute the value of a statistic testing the null hypothesis that the last p k principal components are uninformative. Three different step-wise procedures for performing the tests are given, namely the usual forward and backward procedures and a divide-and-conquer method resembling a binary search algorithm. Asymptotic results regarding the estimation of k and the subspace itself are also given and concluding the paper is an array of simulation studies comparing the performances of different combinations of M scatter estimates and test procedures in estimating k. While dimension reduction is most commonly used with the standard independent and identically distributed (i.i.d.) data, it has also been successfully utilized in more complicated situations, e.g. with time series and spatially correlated data. In both cases there are some additional structures in the data that have to be accounted for. For an example of using dimension reduction with time series see Miettinen et al. (2014) where the authors assume that the observed time series are linear combinations of some latent uncorrelated series. So-called second order blind identification (SOBI) methods can then be used to try to recover these original time series. For an example of spatial correlation and dimension reduction, refer to Nordhausen et al. (2014) where the authors present a dimension reduction method applicable to spatially correlated observations. The data used by them consists of high dimensional chemical element measurements analysed from samples of terrestrial moss collected at various sites in the Kola peninsula. The observations are clearly not independent the geographical proximity causing dependence between the measurements. To overcome this obstacle the authors then present spatial blind source separation to reduce the dimension and to find different features and geographical patterns in the data. The diagonalization of two scatter matrices discussed later in this thesis is in itself a relatively new approach. Although some basics of the method were considered already in Caussinus and Ruiz-Gazen (1993), the majority of publications on the subject are fairly recent. An extensive discussion of the method was given in Tyler et al. (2009) in which the authors give the fundamental properties of the diagonalization and its use in projecting the data into an invariant coordinate system (ICS). As suggested by its name, 4

12 this coordinate system has the property of being in some sense invariant, or unchanged, by affine transformations to the original data and can be used to reveal structures of the data undetectable in the original coordinate system. The theory is then accompanied by examples on extracting hidden features from data. The diagonalization s connection to the basic dimension reduction methods is also briefly touched on by showing how solutions to the independent component analysis can be formulated using it. Applying the diagonalization in finding invariant coordinate systems was in Ilmonen et al. (2012) further considered in the context of developing affine equivariant and affine invariant statistical procedures. A method is called affine equivariant if it reflects an affine transformation to the data in some meaningful way, while being affine invariant means that the output of the method is unchanged by any affine transformations to the original data. As affine transformations are used in everything from the scaling of variables to rotations, both affine equivariance and affine invariance are highly desired properties for statistical procedures. The authors of Ilmonen et al. (2012) then give ways for creating invariant coordinate system functionals that can be used to transform the data into a coordinate system that is affine invariant (possibly up to some smaller group of transformations). A suitable invariant coordinate system functional and a fitting statistical procedure can then be combined to result in an invariant statistical procedure. Examples of both univariate and multivariate invariant statistics and statistical procedures are next given. Lastly, the relationship between the diagonalization of two scatter matrices and the independent component analysis is again given some thought and the authors also show how the sliced inverse regression can be seen as a special case of the diagonalization. While the invariant coordinate selection was in Tyler et al. (2009) and in Ilmonen et al. (2012) applied only in unsupervised situation, it is in Liski et al. (2014a) extended to the supervised case, that is, for data consisting of both explanatory variables and a univariate response. This is accomplished by letting the second scatter matrix in the diagonalization be supervised (that is, dependant on the response), the first scatter matrix usually being the regular covariance matrix. The authors give several examples of supervised scatter and location functionals and further give asymptotic properties for the scatter functionals and for the matrix giving the transformation into the invariant coordinate system. The authors also show how several different supervised dimension reduction methods, such as the principal Hessian directions (PHD) and the sliced average variance estimation (SAVE), can be formulated using the diagonalization of two scatter matrices. Concluding the paper is a simulation study, where the authors compare the performances of 5

13 several different supervised scatter matrices in dimension reduction via the diagonalization. The previously mentioned supervised dimension reduction methods are also included in the comparison. 1.3 About referencing As some theoretical parts of the following sections are largely based on some single sources, the following custom regarding references will be practised throughout the thesis: if the theory parts of a section or subsection are based on a single source or reference, this will be stated in the beginning of the section or the subsection. Otherwise standard referencing practices are followed. 2 Necessary theoretical framework 2.1 About notation The exposition of the subject is somewhat theoretical in nature and involves certain recurring mathematical objects. Therefore it makes sense to set some notational conventions. Throughout the following all vectors are notated with bold lower case(e.g. x, µ) and all matrices with bold upper case (e.g. S, H). In addition, all vectors are thought to be column vectors. The p n matrix X = (x 1,...,x n ) is assumed to be a random sample of size n from a p-variate distribution and is the data we are working with. The random vectors x 1,...,x n are assumed to be independent and identically distributed (i.i.d.). Sometimes the vector x is also considered without a subscript to denote an arbitrary p-variate random vector. Standardized random vectors are notated with the subscript st, such as x st. When a distinction between explanatory variables and a response variable has to be made the vector x is thought to contain the former while the univariate response is denoted by y. Linear combinations of the original variables play central role in all dimension reduction methods and the p-dimensional vector of linear combinations is denoted by z = (z 1,z 2,...,z p ) T. The dimension reduction itself is accomplished by retaining only some k-dimensional subvector of z that ideally contains the same information as the original p-dimensional vector x. The new dimension k of course has to satisfy k < p. When dealing with location and scatter functionals arbitrary probability distributions of some random vector x are considered and notated as F x. Similarly F Ax+b is used to represent the distribution of the transformed ran- 6

14 dom variable x Ax+b. Any assumptions regarding A and b are, when necessary, stated in the text. Note also that transformations of the form Ax + b are called affine transformations. Certain matrix types are also given symbols of their own. U and V are used to represent orthogonal matrices, that is, matrices satisfying U T U = UU T = I, and D is used for diagonal matrices. If the diagonal elements of a diagonal matrix need to be explicitly stated, the notation diag(d 1,...,d p ) is used to represent a diagonal matrix with d 1,...,d p as its diagonal elements. The notation diag(a) is also used for a diagonal matrix having the same diagonal elements as the square matrix A. In addition to the usual Cov(x), the upper case Σ is in some cases used for covariance matrices of random vectors with the subscript denoting the random vector in question (e.g. Σ x ) while S is used similarly for the sample covariance matrix. Additionally, S, S 1, S 2... are in the context of the simultaneous diagonalization of two scatter matrices also used to denote arbitrary scatter matrices. If needed, the meaning of S in question will be explicitly stated in the text. And finally, some specific sets of square matrices have to be defined. The set S refers totheset ofallsymmetric, positive definite p p matricesandthe set A is the set of all non-singular p p matrices. In some contexts it is also necessary to consider some particular p p matrix transformation groups. The following list includes five commonly used transformation groups. C D0 = {ci : c R + } homogeneous rescaling; scales all elements of a matrix by same factor. C D = {diag(c 1,c 2,...,c p ) : c i R +, i = 1,...,p} heterogeneous rescaling; scales the elements by same factor row-wise or column-wise. C J = {diag(c 1,c 2,...,c p ) : c i = ±1, i = 1,...,p}heterogeneoussignchanging; changes the signs of the elements row-wise or column-wise. C P = {P : P is a permutation matrix} permutation; changes the order of rows or columns. C U = {U : U is an orthogonal matrix} orthogonal transformation; rotates(and possibly reflects) the column or row vectors around the origin without affecting their magnitudes. Notice that whether the operation in the last four transformations affects the rows or columns is determined by whether the matrix multiplication is done from the left or from the right, respectively. Notice also that all the sets satisfy the definition of a mathematical group; they are all associative 7

15 and closed under matrix multiplication as well as contain the identity matrix I and the inverse of each element. The groups C P, C D and C J (or a subset of them) can further be combined into a single transformation group C PDJ = {C : each row and column of C has exactly one non-zero element}. Transforming x Cx, C C PDJ then results in simultaneous reordering, rescaling and sign-changing of the components of x. This group is encountered when searching for the eigendecomposition of a matrix later on. 2.2 The basics of dimension reduction The goal in dimension reduction is the reducing of the number of variables without giving up the information content of the data. This is accomplished by finding a suitable p k transformation matrix B so that after the transformation x B T x R k the resulting vectors carry as much information as the original vectors x R p. The definition of information is characteristic to each dimension reduction method and some methods use, for example, variance or non-gaussianity as the measure of information. Furthermore, we naturally want k < p which means the dimension of the data has been reduced from p to k. Note that the transformation matrix B is not unique as any other matrix with the same column space provides the same dimension reduction as B. This can be countered by instead working with the corresponding unique projection matrixp B = B(B T B) 1 B T giving theprojection onto thecolumn space of B (see Liski et al. (2014a)). The dimension reduction is then carried out by transforming the p-dimensional observations into the k-dimensional subspace as z = P B x. However, due to the way the presented dimension reduction methods work, the projection matrix approach is not used in this thesis. The different dimension reduction methods presented later can further be divided into unsupervised and supervised dimension reduction, depending on the roles of the variables involved. For more information on the two classes see, for example, Liski et al. (2014a), on which the following short introduction to them is based. If the dimension reduction is based only on the random vector x and is not in any way concerned with building a model between x and some response variable y, the method is said to be unsupervised. Note that the term dimension reduction is sometimes used when actually referring to the unsupervised dimension reduction. Unsupervised dimension reduction is simply based on the above problem of finding a p k matrix B so that the 8

16 vectors z = B T x R k carry as much information as the original vectors x R p. Basic examples of unsupervised dimension reduction include principal component analysis, independent component analysis and factor analysis, the first two of which are covered in later sections. Supervised dimension reduction methods, on the other hand, are relevant when the data consist of some response variable y in addition to the x s. The aim is again to reduce the dimension of the vector x R p, but we are also interested in predicting the values of the response y using the x s. The dimension reduction method should then somehow take into account the relationship between y and x. The problem can be formulated as finding a p k matrix B = (b 1,...,b k ) such that y x (B T x). (1) That is, all the information that the original p-dimensional x has on y is contained in k linear combinations b T j x, j = 1,...,k. If such vectors b j are found, the dimension of the data (the explanatory variables) has been successfully reduced from p to k. For examples of supervised dimension reduction methods consider the sliced inverse regression covered later, but also the sliced average variance estimation (SAVE), principal Hessian directions (PHD), canonical correlation analysis (CCA) and the directional regression (DR). Note, that although in this thesis only univariate response variables are considered, there is no reason why multivariate responses could not be used in supervised dimension reduction. Indeed, in the aforementioned canonical correlation analysis the observations consist of vectors x R p and y R q. 2.3 Eigendecomposition A central element in all dimension reduction methods is the decomposition of a matrix into a product of other, in some sense simpler, matrices. While numerous different decompositions exist, the two most useful to us are the eigendecomposition and singular value decomposition. And though the reader is assumed to be familiar with these concepts, the key points of both decompositions are briefly discussed in the following subsections. For the proofs of the following results the reader may refer to Harville (1997), on which the following discussions of the two decompositions are based. The eigendecomposition (or spectral decomposition) of a p p matrix A arises from the problem of finding solutions to the equation Aq = dq, (2) 9

17 where each solution pair (q j, d j ), j = 1,...,p consists of a eigenvector q j and the corresponding eigenvalue d j. Assume now that A has p linearly independent eigenvectors. Then A can be expressed as A = QDQ 1, (3) where the columns of the p p matrix Q are the p eigenvectors q j of A and the diagonal matrix D has on its diagonal the p eigenvalues d j of A in an order corresponding to the ordering of the eigenvectors in Q. This is called the eigendecomposition of the matrix A. For symmetric matrices, e.g. covariance matrices, additional results can be proved. If S is a symmetric p p matrix, it can be expressed as S = UDU T, (4) where the columns u j of the orthogonal matrix U are the p eigenvectors of S and the diagonal matrix D again has on its diagonal the corresponding eigenvalues d j of S, j = 1,...,p. The converse result also holds and thus the eigenvectors of a matrix are orthogonal if and only if the matrix is symmetric. Furthermore, it can be proven that the eigenvalues of a positive definite matrix are positive and thus d j > 0, j = 1,...,p for any non-singular covariance matrix. The decomposition in (4) can also be stated as S = p d j u j u T j. j=1 Eigendecomposition also helps us in standardizing the p-variate random vector x. The standardization, which is achieved by transforming x st = Σ 1/2 x (x E(x)) requires the inverse square root of the covariance matrix Σ x which is not uniquely defined and indeed Σ 1/2 x denotes any matrix G satisfying GΣ x G T = I. (5) The construction of matrices G satisfying (5) is discussed in Ilmonen et al. (2012) where the authors propose several ways for uniquely choosing Σ 1/2 x. One unique and symmetric choice is Σ 1/2 x = UD 1/2 U T, where the matrices U and D refer to the eigendecomposition of Σ x and the diagonal matrix D 1/2 = diag(d 1/2 1,...,d 1/2 p ) is well-defined due to the positive definiteness of the non-singular covariance matrix Σ x. Note that when the inverse square root Σ 1/2 x is later used, it is, unless otherwise stated, always assumed to be the symmetric version UD 1/2 U T. Consequently, also its inverse, Σ 1/2 x, is then symmetric. 10

18 Finallynoticethatinconstructing thematricesqanduin(3)and(4)respectively, theorder of theeigenvectors as thecolumns of Q oruisnot fixed. Any ordering can be reflected in the ordering of the corresponding eigenvalues in the diagonal of D. Furthermore, for the not-generally-orthogonal matrix Q neither the signs nor the magnitudes of the eigenvectors are fixed as is evident from (2); if vector q j satisfies the equation (2) for some d j so does any vector λq j, for any λ R. The matrix Q in (3) is then unique only up to the group of transformations C PDJ (applied column-wise) presented earlier. For an orthogonal matrix U the scales of the eigenvectors are fixed to preserve the vectors orthonormality but their signs can still be chosen at will and thus the matrix U in (4) is unique up to the group of transformations C PJ. In case some of the eigenvalues of a matrix are equal then the uniqueness of the matrix Q or U is a slightly more complex matter. For some discussion on this see chapter 21.4 of Harville (1997). For the rest of this thesis, unless stated otherwise, an assumption that the eigenvalues of the matrices used are distinct is made. The eigendecomposition has its uses on all of the following dimension reduction methods and some additional properties of it are explained in later sections. 2.4 Singular value decomposition While the eigendecomposition is applicable only for square matrices the singular value decomposition is a more general method and works for all m n matrices. It is not as heavily featured in the following sections as the eigendecomposition and thus only some basic properties of it are given in the following discussion, which is also based on Harville (1997). Let A be a m n matrix of rank r. Then it can be expressed as ( ) A = UDV T D1 0 = U V T, (6) 0 0 where U andvarerespectively m mandn n orthogonalmatrices andthe m n block matrix D has a r r diagonal matrix D 1 as its upper left block and is otherwise filled with zeroes. The diagonal elements d i,i = 1,...,r of D 1 arecalled the singular values of A. The sparsity ofdfurther gives a more compact way of writing the singular value decomposition. If we partition the two orthogonal matrices as U = (U 1,U 2 ) and V = (V 1,V 2 ) where U 1 and V 1 are m r and n r matrices respectively, we get A = UDV T = U 1 D 1 V T 1. 11

19 The proof for the existence of the decomposition is here omitted but the decomposition itself can be found by observing the matrix products AA T and A T A, which are both symmetric matrices. Letting A have the singular value decomposition given in (6), we have for the two products AA T = UDD T U T and A T A = VD T DV T. (7) Both DD T and D T D are now diagonal matrices and the equations in (7) thus describe the eigendecompositions of the symmetric matrices AA T and A T A. The columns of the matrices U and V in (6) are then the eigenvectors of AA T and A T A, respectively. The equations in (7) also show that the matrices AA T anda T Ahave thesamenon-zeroeigenvalues, that is, thefirst r diagonal elements of the matrices DD T and D T D, respectively. Further, the r non-zero eigenvalues are equal to the squared singular values d 2 i,i = 1,..,r of the matrix A. 2.5 Location and scatter functionals In univariate case the sample mean and sample variance are the most common measures of location and spread, respectively. But as they characterize only certain aspects of the underlying distribution the use of additional statistics is advocated. If, for example, more robust measures of location and dispersion are desired the median and the median absolute deviation could be considered. Similarly in multivariate statistics there is no need to limit our attention to the ordinary sample mean vector and sample covariance matrix only. In fact, several other statistics and families of statistics describing location and spread exist, the usefulness of which depends on the situation. Next, the definitions of the so-called location and scatter functionals are given. They are defined as functionals to emphasize the fact that they are functions of probability distributions and therefore describe some properties of the distributions in question. The concept of affine equivariance also needs to be addressed before defining the functionals. A loose definition would be that a statistic or functional is said to be affine equivariant if an affine transformation on the input random vector is reflected in the statistic or the functional in some meaningful way. A location functional T(F x ) R p is a vector-valued functional which is affine equivariant in the sense that T(F Ax+b ) = AT(F x )+b, (8) for all matrices A A and vectors b R p. In essence, location functionals can be seen as describing some aspect of the location or center of a multivari- 12

20 ate distribution. For example the mean vector E(x) is a location functional. Note that the natural assumption that a change in the scale in which the observations are measured should be reflected in the location estimates can be seen as a special case of the equivariance in (8). A scatter functional S(F x ) S is a matrix-valued functional which is affine equivariant in the sense that S(F Ax+b ) = AS(F x )A T, (9) for all matrices A A and vectors b R p. Scatter functionals, the most common example of which is the ordinary covariance matrix Cov(x), are used to measure some aspects of multivariate spread or dispersion. In addition to the location and scatter functionals describing properties of probability distributions, respective sample statistics T(X) and S(X) can be defined. Correspondingly they satisfy T(AX+b1 T n) = AT(X)+b and S(AX+b1 T n) = AS(X)A T, for all A A and all b R p. For examples of different scatter and location functionals the reader may refer to Tyler et al. (2009), where the authors also discuss various means of constructing them. Finally, one special family of scatter functionals is also considered in the section discussing the independent component analysis. Namely, a scatter functionals(f x )issaidtopossesstheindependenceproperty ifitisadiagonal matrix whenever the components of x are independent. For example, both the ordinary covariance matrix Cov(x) and the covariance matrix based on fourth moments Cov 4 (x) examined later have this property. 2.6 The diagonalization of two scatter matrices Concluding each of the following sections reviewing different dimension reduction methods is a subsection that shows how the respective methods can all be formulated as a simultaneous diagonalization of two matrices and, apart from PCA, as a simultaneous diagonalization of two scatter matrices. That is, the only thing differentiating the distinct dimension reduction methods is the choice of the two scatter matrices used. Apart from this choice the procedure stays otherwise the same and thus gives an useful, unifying framework for studying the properties of the individual methods. For a brief account on the history of the approach see the introduction section. Inthissubsection, whichisbasedontyler etal.(2009),thebasicsandthe general principle of the approach are discussed and in the following sections 13

21 reviewing the individual dimension reduction methods the choices of scatter matrices and other specific aspects are looked into more detail. Before starting with the method itself consider first the standard eigendecomposition of a p p matrix A presented earlier. The relationship between the matrix and its eigenvectors q j and eigenvalues d j is summarized by the equation Aq j = d j q j, j = 1,...,p. (10) Thecomparisonofthetwoscattermatricesiscarriedoutbyfindingasolution to an equation similar to this. If S 1 S and S 2 S are symmetric, positive definite p p matrices, then a pair (h j,ρ j ) is said to be an eigenvector and a corresponding eigenvalue of S 2 relative to S 1 if ρ j and h j satisfy the equation S 2 h j = ρ j S 1 h j. (11) It can be shown, using the definition (10) that finding solutions to (11) is equivalent to two other eigenvector-eigenvalue problems (proofs in Appendix A). First,anysolutionpair(h j,ρ j )totheequation(11)isalsoaneigenvectoreigenvalue pair of the not-generally-symmetric matrix S 1 1 S 2. Second, any eigenvector-eigenvalue pair (h j,ρ j ) satisfying the equation (11) is closely related to the eigenvectors and eigenvalues of the symmetric matrix M = S 1/2 1 S 2 S 1/2 1 S, where S 1/2 1 S is the unique symmetric inverse square root of S 1. Namely, the eigenvalues ρ j are same in both problems and the eigenvectors q j of the matrix M and the eigenvectors h j of S 2 relative to S 1 corresponding to the same eigenvalue ρ j have the relationship q j S 1/2 1 h j. Note also that the matrix M is positive definite and thus the eigenvalues ρ j are also positive. The proof for this is omitted here but a more general proof of positive definiteness of matrices of the form A T BA where the p p matrix B is positive definite and A A can be found in chapter 14.2 of Harville (1997). Note that as discussed in the subsection concerning the eigendecomposition, having two or more eigenvalues that are equal complicates the matters somewhat and therefore the constraint regarding the uniqueness of the eigenvalues is now required. Namely, in the following it is assumed that the eigenvalues of the matrix S 1 1 S 2 are distinct. As M is symmetric its eigenvectors q j are orthogonal and then from the equation q T i q j (S 1/2 1 h i ) T (S 1/2 1 h j ) = h T i S 1h j 14

22 andfromthefactthatthevectors q j areorthogonal, itfollowsthath T i S 1 h j = 0 for i j. From the definition in (11) it can be seen that h T i S 2h j h T i S 1h j and so also h T i S 2h j = 0 for i j. Thus choosing two scatter matrices S 1 and S 2 and solving the equation (11) (for example via eigendecomposition of the matrix M) yields the following simultaneous diagonalization of the matrices S 1 and S 2 H T S 1 H = D 1 and H T S 2 H = D 2. (12) As stated in Ilmonen et al. (2012) the matrix H as a matrix of eigenvectors is not unique due to reasons given in the subsection concerning the eigendecomposition. It is unique only up to the group of transformations C PDJ. But with additional, non-restricting conditions given in Ilmonen et al. (2012), its uniqueness can be established. i) First, the magnitudes of the eigenvectors can be fixed by requiring that the diagonalization of the first scatter matrix yields the identity matrix, H T S 1 H = I. This is achieved by normalizing the vectors S 1/2 1 h j to be unit vectors, so that (S 1/2 1 h j ) T (S 1/2 1 h j ) = h T j S 1 h j = 1. This makes the matrix H unique up to the group of transformations C PJ. Note also that the diagonal elements of the second diagonal matrix are then the eigenvalues of S 1 1 S 2, that is, D 2 = D = diag(ρ 1,...,ρ p ). This holds because from (11) it follows that S 2 H = S 1 HD and pre-multiplying by H T we have D 2 = D. ii) Second, the order of the eigenvectors (as the columns of H) can be fixed by requiring that the eigenvalues ρ j be ordered in the diagonal of D 2 in descending order. Now the matrix H is unique up to the group of transformations C J. iii) And finally, the signs of the eigenvectors h j can be fixed by requiring the additional condition H T (T 1 T 2 ) 0, where T 1 and T 2 are any two location functionals and the inequality is taken componentwise, to hold. This final condition makes the matrix H unique. As pointed out in Ilmonen et al. (2012) the final condition fixing the signs of the eigenvectors prohibits the use of symmetric distributions as for symmetric distributions all location functionals are equal to the symmetry point of the distribution. In these cases it is possible to drop the final condition and settle for the uniqueness of H up to the signs of the eigenvectors. After finding the matrix H it can be used to transform the data into an invariant coordinate system mentioned earlier by x H T x, and this is 15

23 extensively discussed in Tyler et al. (2009). But we are mainly concerned with the equations in (12) showcasing the simultaneous diagonalization of the two scatter matrices. And especially the normalized form given by fixing the magnitudes and the order of the eigenvectors H T S 1 H = I and H T S 2 H = D, (13) where the eigenvalues in the diagonal of D are in descending order. This is the common form on which all the presented dimension reduction methods are shown to be based in the following sections (although for PCA S 1 is not an actual scatter matrix). Usually S 1 is chosen to be the regular covariance matrix Cov(x) and the matrix S 2 measures some other aspect of the spread specific to the particular dimension reduction method at hand. The previous process of finding the eigendecomposition of S 1 relative to S 2 can also be given an intuitive explanation. From the fact that the solution can be found from the eigendecomposition of S 1 1 S 2, it is evident that the process somehow compares the magnitudes of the two matrices. In fact, the diagonalization extracts the specific directions in which the spreads of the data, as measured by S 1 and S 2, most differ and this ability of the diagonalization to compare different kinds of variation can be seen as the cornerstone of its usefulness in dimension reduction. 3 Principal Component Analysis (PCA) 3.1 The basic idea behind PCA The first, and perhaps the simplest, of the presented dimension reduction methods is called principal component analysis. An unsupervised method, PCA is based on the assumption that the observed p original variables in x contain redundant information as measured by correlation. For a simple example consider the extreme case of two variables being perfectly correlated. Knowing the value of the first variable would then allow us to exactly determine the value of the second variable and we could discard the second variable without losing any information. Such special cases aside, the objective of PCA is to replace the p original variables with a smaller number k of uncorrelated linear combinations of them, i.e. principal components, that preserve the information content of the original variables but have none of this redundancy. In addition to the uncorrelatedness, the principal components are chosen by requiring that the first principal component captures maximal amount of the original variables 16

24 information content, the second component the maximal amount left after that and so on until a total of p principal components is constructed. The measure for the information content of the data, as stated before, varies between methods and is in the case of principal component analysis the variance. That is, data or variables with high variance are interpreted as having high information content and respectively low variance is interpreted as a sign of low information content. PCA can thus also be seen as the re-distribution of the variance of the original data. Further, it is often the case that the first k principal components have a high variance, i.e. information content, relative to the rest of the components. The following analyses can then be carried out using only these k new variables and the last p k uninformative components can be dropped out, thus completing the dimension reduction. The choice of k, the number of kept components, of course depends on the situation and the data used and many different criteria have been developed for determining it, some of which are briefly discussed later. PCA also has a simple geometrical interpretation. It is later seen that the transformation giving the principal components is an orthogonal one and the procedure can thus be thought as plotting the observations in R p and finding, one by one, the orthogonal directions having maximal variability. The principal components are then given as a result of transforming the original data to the new coordinate system given by these directions. Moreover, the orthogonality of the new coordinate axes means that the transformation giving the principal components is represented by a rotation (and possibly reflection) of the original axes around the origin. This procedure is illustrated in a simple two-dimensional case in Figure 1. Next, the theory behind the extracting of the principal components is discussed. The approach is, unless stated otherwise, based on Jolliffe (2002). 3.2 Finding the principal components Let x = (x 1,x 2,...,x p ) be a random vector having the covariance matrix Σ x. Further assume that the principal components satisfying the previous requirements of uncorrelatedness and step-wise maximal variances are given by some linear combinations a T 1 x, at 2 x,..., at p x, where a j R p,j = 1,...,p. The variance of, for example, the first linear combination is then given by a T 1 Σ xa 1. Notice that in maximizing this with respect to a 1 some constraint fora 1 hasto beapplied orotherwise thereis no upper boundfor thevariance. The usual choice for the constraint is to require the coefficient vectors to be unit vectors, a T j a j = 1, j = 1,...,p. 17

25 x x1 Figure 1: The graph shows the scatter plot of n = 30 bivariate observations onto which the new coordinate axes giving the principal components are plotted. As can be seen, the first principal component has taken the direction having the most variability in the data. Notice also that the data being two-dimensional combined with the requirement of orthogonality means that the direction of the second principal component is determined up to the sign by the direction of the first principal component. The lengths of the arrows in the graph are proportional to the variances of the respective principal components a T j Notice also that the covariance between two linear combinations a T i x and x is given by Cov(a T i x, at j x) = E[(aT i x)(xt a j )] E[a T i x]e[xt a j ] = a T i Σ xa j. This means that thelinear combinations a T i x andat j x areuncorrelated if and only if a T i Σ x a j = 0. With these in mind, the process of finding the principal components can now be formulated as a series of optimization problems with additional constraints. Find a 1 R p that maximizes the variance given by a T 1 Σ xa 1 under the constraint of unit length a T 1a 1 = 1. The first principal component is then given by z 1 = a T 1 x. 18

26 Find a 2 R p that maximizes a T 2Σ x a 2 under the constraints of unit length a T 2 a 2 = 1 and uncorrelatedness a T 1 Σ xa 2 = 0. The second principal component is then given by z 2 = a T 2x. Find a 3 R p that maximizes a T 3 Σ xa 3 under the constraints a T 3 a 3 = 1 and a T 1Σ x a 3 = a T 2Σ x a 3 = 0. The third principal component is then given by z 3 = a T 3 x. Continue accordingly until a total of p principal components has been constructed. While the process may seem complicated, what is convenient is that the solution is found directly from the eigendecomposition of the covariance matrix Σ x. One way to show this is the method of Lagrange multipliers, which is a technique used in optimizing problems with additional constraints (like the previous a T 1a 1 = 1). For a comprehensive account of the method of Lagrange multipliers the reader can refer to Loomis and Sternberg (1990). Now it suffices to say that if we want to maximize the variance a T 1 Σ xa 1 with respect to a 1 under the constraint a T 1a 1 = 1, then under certain conditions the solution can be found amongst the vectors satisfying the following equations (note that the second equation is of vector form and thus consists of p individual equations) { a T 1 a 1 1 = 0 Σ x a 1 λ 1 a 1 = 0, (14) for some scalar λ 1 and a 1 R p. Note that the second equation can also be writtenasσ x a 1 = λ 1 a 1 showing theconnection between finding the principal components andtheeigendecomposition ofσ x (seeequation (2)). Thevector a 1 maximizing the quantity a T 1 Σ xa 1 is then found amongst the p eigenvectors of Σ x and the correct eigenvector can be chosen by observing that the quantity to bemaximized, that is, the variance of a T 1x, is for each eigenvector equal to the corresponding eigenvalue, a T 1 Σ xa 1 = a T 1 λ 1a 1 = λ 1. Thus a 1 is chosen to be the eigenvector corresponding to the largest eigenvalue of Σ x. The additional constraint a T 1a 1 = 1 in (14) requires the solution eigenvector to be normalized to have unit length. The rest of the principal components can be found similarly using the method of Lagrange multipliers. Each additional constraint regarding the uncorrelatedness of the principal components adds another scalar λ j to the equation group but the process remains relatively unchanged. In all cases the equationscanbereducedtotheformof (2)showing thatthesolutionisfound from the eigenvectors of Σ x. The additional constraints then imply that the 19

Independent component analysis for functional data

Independent component analysis for functional data Hannu Oja Department of Mathematics and Statistics University of Turku Version 12.8.216 August 216 Oja (UTU) FICA Date bottom 1 / 38 Outline 1 Probability