SOME TOOLS FOR LINEAR DIMENSION REDUCTION

Size: px
Start display at page:

Download "SOME TOOLS FOR LINEAR DIMENSION REDUCTION"

Transcription

1 SOME TOOLS FOR LINEAR DIMENSION REDUCTION Joni Virta Master s thesis May 2014 DEPARTMENT OF MATHEMATICS AND STATISTICS UNIVERSITY OF TURKU

2

3 UNIVERSITY OF TURKU Department of Mathematics and Statistics VIRTA, JONI: SOME TOOLS FOR LINEAR DIMENSION REDUCTION Master s thesis, 52 p., 2 appendices 7 p. Statistics May 2014 Dimension reduction refers to a family of methods commonly used in multivariate statistical analysis. The common objective for all dimension reduction methods is essentially the same: the reducing of the number of variables in the data while still preserving their information content, however it is measured. In linear dimension reduction this is done by replacing the original variables with a lesser number of linear combinations of them. Dimension reduction can further be divided into two subcategories: supervised dimension reduction where some (usually one) of the variables clearly have the role of a response and we are interested in explaining their behavior using the other variables, and unsupervised dimension reduction where no distinction like this is made and all variables are considered equal. In this thesis the theory behind three linear dimension reduction methods, the principal component analysis (PCA), independent component analysis (ICA) and sliced inverse regression (SIR) is reviewed. Although both PCA and ICA are unsupervised methods, their criteria in creating the new variables differ to an extent making them suitable in different situations. The third method, SIR, is supervised and thus aims to replace the original explanatory variables with a smaller number of variables that best explain the variability of some response. Additionally, a variant of SIR that uses splines in finding the new variables is briefly discussed. The concept of simultaneous diagonalization of two scatter matrices is also given some thought. A relatively new method, the importance of the diagonalization for dimension reduction stems from the fact that most dimension reduction methods, while seemingly disconnected, can be formulated in a unified way using it. In the final section two simulation studies are conducted. In the first the performances of the two unsupervised methods (PCA and ICA) in finding directions relevant to explaining a response variable in a regression situation are compared to that of the supervised SIR. In the second simulation the superiority of the spline-using SIR to the regular SIR under some specific models is demonstrated. Keywords: multivariate statistics, dimension reduction, principal component analysis, independent component analysis, sliced inverse regression. Asiasanat: monimuuttujamenetelmät, dimension supistus, pääkomponenttianalyysi, riippumattomien komponenttien analyysi, sliced inverse regression.

4

5 Contents 1 Introduction An overview of the subject Recent literature on the subject About referencing Necessary theoretical framework About notation The basics of dimension reduction Eigendecomposition Singular value decomposition Location and scatter functionals The diagonalization of two scatter matrices Principal Component Analysis (PCA) The basic idea behind PCA Finding the principal components On the use of the principal components Determining the correct dimension k Some specific applications of PCA PCA as a diagonalization of two scatter matrices Independent Component Analysis (ICA) Background on ICA The model for ICA The FOBI solution Some applications of ICA ICA as a diagonalization of two scatter matrices Sliced Inverse Regression (SIR) Introducing SIR The model for SIR Finding the e.d.r. space Some remarks on using SIR Spline-based approach to SIR SIR as a diagonalization of two scatter matrices Simulation studies Using the dimension reduction methods with regression Comparing the regular SIR with the spline-based version... 46

6 7 Concluding remarks and future research 48 Appendices A Proofs B Some important R code A1 A1 B1

7 List of Figures 1 Geometrical interpretation of PCA A scree plot and a scatter plot of the first two principal components Illustration of the use of ICA in signal processing Illustration of the use of ICA in group separation Graph of the estimated inverse regression curve Illustration of the limitations of SIR Paired scatter plots of the response and the components found by different methods from the simulated data Paired scatter plots of the response and the components found by different methods from the Boston data List of Tables 1 The results of comparing spline-based SIR with the regular SIR using model The results of comparing spline-based SIR with the regular SIR using model Testing spline-based SIR with smaller numbers of slices

8 1 Introduction 1.1 An overview of the subject When dealing with multivariate data it is not unusual that the number of variables involved is measured in thousands. In addition to the apparent computational issues, an extensive amount of variables can also limit the pool of techniques and methods applicable. For example, in the area of nonparametric statistics the number of observations required for the successful use of certain methods quickly increases with the dimension of the data, as stated in Li (1991). Traditionally two different approaches have been utilized in tackling the problem. With sufficient knowledge, one could choose to retain only the variables relevant to the phenomenon at hand. For example in regression analysis it could be possible to leave out variables having no effect on the response variable, thus reducing the total number of variables. This is of course not feasible when the meanings or relationships of the variables are not known. The second approach, dimension reduction does not outright omit any variables but instead tries to retain as much information as possible, while still decreasing the number of variables involved. This is in linear dimension reduction accomplished by replacing the original data with linear combinations of the original variables. The way these linear combinations are chosen depends on the dimension reduction method used and, as is often the case, only few combinations are needed to retain most of the information carried by the original variables. Note that the way of measuring the information is not constant across the methods and depends on the specific dimension reduction technique used. The purpose of this thesis is first to consider the subject of linear dimension reduction in general and to review some of the methods commonly used for linear dimension reduction(henceforth referred to as just dimension reduction ). Three dimension reduction methods all having fundamentally different premises are reviewed and discussed in detail. The theory behind the methods is also considered in the light of modern approaches and findings in the field. Second, some simulation studies are conducted to assess the performances of the three methods in some specific situations. But before the specific dimension reduction methods are introduced and examined in detail, some required basic theory is first revised. Beginning Section 2 is a short introduction to the notational conventions used throughout the thesis. After that a distinction between two types of dimension reduction is made. Supervised dimension reduction referring to a situation where some 1

9 (usually one) of the variables can clearly be labelled as responses, while the rest of the variables are thought to be explanatory. And the second type, unsupervised dimension reduction, meaning that all the variables involved are equal in the response/explanatory variable sense. Two very useful matrix decompositions, the eigendecomposition and the singular value decomposition, are also given a short revision. Both of these play central role in dimension reduction and especially the eigendecomposition is utilized in one way or another in all the dimension reduction methods covered. Finally closing Section 2 is a introduction to the concept of location and scatter functionals and the use of the latter in dimension reduction. The used method that revolves around the simultaneous diagonalization of a pair of so-called scatter matrices was already examined in Caussinus and Ruiz- Gazen (1993), but a comprehensive treatise was given in Tyler et al. (2009) and the subject was later explored in more detail in the setting of dimension reduction in Ilmonen et al. (2012) and Liski et al. (2014a). The remarkable fact that every dimension reduction method covered in this thesis (and plenty of others) can be seen as a special case of the method is also addressed. The next three sections are then devoted to the individual dimension reduction methods, starting from Section 3 and the principal component analysis (PCA), which is possibly the best-known of all unsupervised dimension reduction methods. PCA revolves around finding a linear transformation that renders the data uncorrelated and in addition maximizes the variances of the individual components (under suitable restrictions). While PCA is included in most statistical software packages and using it is straightforward enough, care must be taken since PCA does not utilize third or higher moments and some patterns in the data can thus be undetected by it. The scale of the original variables also affects the resulting components and PCA is thus not affine invariant. Section 4 introduces the second unsupervised dimension reduction method, the independent component analysis (ICA). While both ICA and PCA are unsupervised in nature, the former is a model-based approach. Namely, in ICA the observed variables are thought to be linear combinations of some original independent variables and the purpose of ICA is to retrieve these original independent components. This makes ICA suitable for the field of signal processing, where the observations are indeed thought to be linear combinations of some latent signals (however, ICA treats the observations as independent which they usually are not). The solution presented to the ICA problem, the fourth order blind identification (FOBI) is based on the use of fourth moments and the linear combinations are chosen as directions having maximal kurtosis. 2

10 The last and the only supervised dimension reduction method considered, the sliced inverse regression (SIR), is covered in Section 5. The name of the method refers to the inverse view on regression it takes; in order to find the linear combinations best explaining the univariate response SIR regresses each explanatory variable against the response. This leads to the so-called inverse regression curve that has essential role in finding the effective dimension reduction space (e.d.r. space) giving the relevant linear combinations. A method originally explored in Zhu and Yu (2007) that uses splines to interpolate the inverse regression curve is also reviewed. That is, the discrete conditional means of the explanatory variables used in regular SIR to estimate the inverse regression curve are replaced with piece-wise low-degree polynomials. In section 6 the three methods are then used in simulation tests. In the first simulation the efficiencies of the three methods are compared in a regression situation. Although PCA and ICA are unsupervised and do not utilize the joint distribution of the response and the explanatory variables their performance in finding the relevant linear combinations is nevertheless evaluated and compared to the supervised SIR. The second simulation compares the efficiency of the regular SIR to that of the spline-utilizing version of SIR. Tests are conducted using different forms of relationship between the response and the explanatory variables and both methods abilities to extract the relevant components are evaluated. The simulations, as well as all the figures and plots in the thesis, were produced using the R software package (R Development Core Team, 2011). 1.2 Recent literature on the subject While the core methods presented have been around for a while, the principal component analysis even more so, the field of dimension reduction has constantly continued to evolve and grow. Following is a short review of some selected recent publications related to dimension reduction. For a comprehensive account of the current state of supervised dimension reduction the reader may refer to Ma and Zhu (2013). In their article the authors consider the estimation of the e.d.r. space under the three most common approaches to supervised dimension reduction: the inverse regression based methods, non-parametric methods and semiparametric methods. The current literature pertaining to each of the three approaches is reviewed. The authors also discuss the problem of making inference on supervised dimension reduction; since the inference now consists of the estimation of a subspace instead of the usual estimation of parameters the situation has to be approached from a different angle. The problem of determining the cor- 3

11 rect dimension k is also discussed and some specific methods for estimating it are reviewed. The estimation of the dimension k of the dimension reduction space is of central importance also in Liski et al. (2014b) where the authors discuss it in the case of principal component analysis. Their approach is based on the use of different M scatter estimates, that is, matrices that characterize some aspect of the spread of the data. The eigenvalues of the M scatter estimate of choice are used to compute the value of a statistic testing the null hypothesis that the last p k principal components are uninformative. Three different step-wise procedures for performing the tests are given, namely the usual forward and backward procedures and a divide-and-conquer method resembling a binary search algorithm. Asymptotic results regarding the estimation of k and the subspace itself are also given and concluding the paper is an array of simulation studies comparing the performances of different combinations of M scatter estimates and test procedures in estimating k. While dimension reduction is most commonly used with the standard independent and identically distributed (i.i.d.) data, it has also been successfully utilized in more complicated situations, e.g. with time series and spatially correlated data. In both cases there are some additional structures in the data that have to be accounted for. For an example of using dimension reduction with time series see Miettinen et al. (2014) where the authors assume that the observed time series are linear combinations of some latent uncorrelated series. So-called second order blind identification (SOBI) methods can then be used to try to recover these original time series. For an example of spatial correlation and dimension reduction, refer to Nordhausen et al. (2014) where the authors present a dimension reduction method applicable to spatially correlated observations. The data used by them consists of high dimensional chemical element measurements analysed from samples of terrestrial moss collected at various sites in the Kola peninsula. The observations are clearly not independent the geographical proximity causing dependence between the measurements. To overcome this obstacle the authors then present spatial blind source separation to reduce the dimension and to find different features and geographical patterns in the data. The diagonalization of two scatter matrices discussed later in this thesis is in itself a relatively new approach. Although some basics of the method were considered already in Caussinus and Ruiz-Gazen (1993), the majority of publications on the subject are fairly recent. An extensive discussion of the method was given in Tyler et al. (2009) in which the authors give the fundamental properties of the diagonalization and its use in projecting the data into an invariant coordinate system (ICS). As suggested by its name, 4

12 this coordinate system has the property of being in some sense invariant, or unchanged, by affine transformations to the original data and can be used to reveal structures of the data undetectable in the original coordinate system. The theory is then accompanied by examples on extracting hidden features from data. The diagonalization s connection to the basic dimension reduction methods is also briefly touched on by showing how solutions to the independent component analysis can be formulated using it. Applying the diagonalization in finding invariant coordinate systems was in Ilmonen et al. (2012) further considered in the context of developing affine equivariant and affine invariant statistical procedures. A method is called affine equivariant if it reflects an affine transformation to the data in some meaningful way, while being affine invariant means that the output of the method is unchanged by any affine transformations to the original data. As affine transformations are used in everything from the scaling of variables to rotations, both affine equivariance and affine invariance are highly desired properties for statistical procedures. The authors of Ilmonen et al. (2012) then give ways for creating invariant coordinate system functionals that can be used to transform the data into a coordinate system that is affine invariant (possibly up to some smaller group of transformations). A suitable invariant coordinate system functional and a fitting statistical procedure can then be combined to result in an invariant statistical procedure. Examples of both univariate and multivariate invariant statistics and statistical procedures are next given. Lastly, the relationship between the diagonalization of two scatter matrices and the independent component analysis is again given some thought and the authors also show how the sliced inverse regression can be seen as a special case of the diagonalization. While the invariant coordinate selection was in Tyler et al. (2009) and in Ilmonen et al. (2012) applied only in unsupervised situation, it is in Liski et al. (2014a) extended to the supervised case, that is, for data consisting of both explanatory variables and a univariate response. This is accomplished by letting the second scatter matrix in the diagonalization be supervised (that is, dependant on the response), the first scatter matrix usually being the regular covariance matrix. The authors give several examples of supervised scatter and location functionals and further give asymptotic properties for the scatter functionals and for the matrix giving the transformation into the invariant coordinate system. The authors also show how several different supervised dimension reduction methods, such as the principal Hessian directions (PHD) and the sliced average variance estimation (SAVE), can be formulated using the diagonalization of two scatter matrices. Concluding the paper is a simulation study, where the authors compare the performances of 5

13 several different supervised scatter matrices in dimension reduction via the diagonalization. The previously mentioned supervised dimension reduction methods are also included in the comparison. 1.3 About referencing As some theoretical parts of the following sections are largely based on some single sources, the following custom regarding references will be practised throughout the thesis: if the theory parts of a section or subsection are based on a single source or reference, this will be stated in the beginning of the section or the subsection. Otherwise standard referencing practices are followed. 2 Necessary theoretical framework 2.1 About notation The exposition of the subject is somewhat theoretical in nature and involves certain recurring mathematical objects. Therefore it makes sense to set some notational conventions. Throughout the following all vectors are notated with bold lower case(e.g. x, µ) and all matrices with bold upper case (e.g. S, H). In addition, all vectors are thought to be column vectors. The p n matrix X = (x 1,...,x n ) is assumed to be a random sample of size n from a p-variate distribution and is the data we are working with. The random vectors x 1,...,x n are assumed to be independent and identically distributed (i.i.d.). Sometimes the vector x is also considered without a subscript to denote an arbitrary p-variate random vector. Standardized random vectors are notated with the subscript st, such as x st. When a distinction between explanatory variables and a response variable has to be made the vector x is thought to contain the former while the univariate response is denoted by y. Linear combinations of the original variables play central role in all dimension reduction methods and the p-dimensional vector of linear combinations is denoted by z = (z 1,z 2,...,z p ) T. The dimension reduction itself is accomplished by retaining only some k-dimensional subvector of z that ideally contains the same information as the original p-dimensional vector x. The new dimension k of course has to satisfy k < p. When dealing with location and scatter functionals arbitrary probability distributions of some random vector x are considered and notated as F x. Similarly F Ax+b is used to represent the distribution of the transformed ran- 6

14 dom variable x Ax+b. Any assumptions regarding A and b are, when necessary, stated in the text. Note also that transformations of the form Ax + b are called affine transformations. Certain matrix types are also given symbols of their own. U and V are used to represent orthogonal matrices, that is, matrices satisfying U T U = UU T = I, and D is used for diagonal matrices. If the diagonal elements of a diagonal matrix need to be explicitly stated, the notation diag(d 1,...,d p ) is used to represent a diagonal matrix with d 1,...,d p as its diagonal elements. The notation diag(a) is also used for a diagonal matrix having the same diagonal elements as the square matrix A. In addition to the usual Cov(x), the upper case Σ is in some cases used for covariance matrices of random vectors with the subscript denoting the random vector in question (e.g. Σ x ) while S is used similarly for the sample covariance matrix. Additionally, S, S 1, S 2... are in the context of the simultaneous diagonalization of two scatter matrices also used to denote arbitrary scatter matrices. If needed, the meaning of S in question will be explicitly stated in the text. And finally, some specific sets of square matrices have to be defined. The set S refers totheset ofallsymmetric, positive definite p p matricesandthe set A is the set of all non-singular p p matrices. In some contexts it is also necessary to consider some particular p p matrix transformation groups. The following list includes five commonly used transformation groups. C D0 = {ci : c R + } homogeneous rescaling; scales all elements of a matrix by same factor. C D = {diag(c 1,c 2,...,c p ) : c i R +, i = 1,...,p} heterogeneous rescaling; scales the elements by same factor row-wise or column-wise. C J = {diag(c 1,c 2,...,c p ) : c i = ±1, i = 1,...,p}heterogeneoussignchanging; changes the signs of the elements row-wise or column-wise. C P = {P : P is a permutation matrix} permutation; changes the order of rows or columns. C U = {U : U is an orthogonal matrix} orthogonal transformation; rotates(and possibly reflects) the column or row vectors around the origin without affecting their magnitudes. Notice that whether the operation in the last four transformations affects the rows or columns is determined by whether the matrix multiplication is done from the left or from the right, respectively. Notice also that all the sets satisfy the definition of a mathematical group; they are all associative 7

15 and closed under matrix multiplication as well as contain the identity matrix I and the inverse of each element. The groups C P, C D and C J (or a subset of them) can further be combined into a single transformation group C PDJ = {C : each row and column of C has exactly one non-zero element}. Transforming x Cx, C C PDJ then results in simultaneous reordering, rescaling and sign-changing of the components of x. This group is encountered when searching for the eigendecomposition of a matrix later on. 2.2 The basics of dimension reduction The goal in dimension reduction is the reducing of the number of variables without giving up the information content of the data. This is accomplished by finding a suitable p k transformation matrix B so that after the transformation x B T x R k the resulting vectors carry as much information as the original vectors x R p. The definition of information is characteristic to each dimension reduction method and some methods use, for example, variance or non-gaussianity as the measure of information. Furthermore, we naturally want k < p which means the dimension of the data has been reduced from p to k. Note that the transformation matrix B is not unique as any other matrix with the same column space provides the same dimension reduction as B. This can be countered by instead working with the corresponding unique projection matrixp B = B(B T B) 1 B T giving theprojection onto thecolumn space of B (see Liski et al. (2014a)). The dimension reduction is then carried out by transforming the p-dimensional observations into the k-dimensional subspace as z = P B x. However, due to the way the presented dimension reduction methods work, the projection matrix approach is not used in this thesis. The different dimension reduction methods presented later can further be divided into unsupervised and supervised dimension reduction, depending on the roles of the variables involved. For more information on the two classes see, for example, Liski et al. (2014a), on which the following short introduction to them is based. If the dimension reduction is based only on the random vector x and is not in any way concerned with building a model between x and some response variable y, the method is said to be unsupervised. Note that the term dimension reduction is sometimes used when actually referring to the unsupervised dimension reduction. Unsupervised dimension reduction is simply based on the above problem of finding a p k matrix B so that the 8

16 vectors z = B T x R k carry as much information as the original vectors x R p. Basic examples of unsupervised dimension reduction include principal component analysis, independent component analysis and factor analysis, the first two of which are covered in later sections. Supervised dimension reduction methods, on the other hand, are relevant when the data consist of some response variable y in addition to the x s. The aim is again to reduce the dimension of the vector x R p, but we are also interested in predicting the values of the response y using the x s. The dimension reduction method should then somehow take into account the relationship between y and x. The problem can be formulated as finding a p k matrix B = (b 1,...,b k ) such that y x (B T x). (1) That is, all the information that the original p-dimensional x has on y is contained in k linear combinations b T j x, j = 1,...,k. If such vectors b j are found, the dimension of the data (the explanatory variables) has been successfully reduced from p to k. For examples of supervised dimension reduction methods consider the sliced inverse regression covered later, but also the sliced average variance estimation (SAVE), principal Hessian directions (PHD), canonical correlation analysis (CCA) and the directional regression (DR). Note, that although in this thesis only univariate response variables are considered, there is no reason why multivariate responses could not be used in supervised dimension reduction. Indeed, in the aforementioned canonical correlation analysis the observations consist of vectors x R p and y R q. 2.3 Eigendecomposition A central element in all dimension reduction methods is the decomposition of a matrix into a product of other, in some sense simpler, matrices. While numerous different decompositions exist, the two most useful to us are the eigendecomposition and singular value decomposition. And though the reader is assumed to be familiar with these concepts, the key points of both decompositions are briefly discussed in the following subsections. For the proofs of the following results the reader may refer to Harville (1997), on which the following discussions of the two decompositions are based. The eigendecomposition (or spectral decomposition) of a p p matrix A arises from the problem of finding solutions to the equation Aq = dq, (2) 9

17 where each solution pair (q j, d j ), j = 1,...,p consists of a eigenvector q j and the corresponding eigenvalue d j. Assume now that A has p linearly independent eigenvectors. Then A can be expressed as A = QDQ 1, (3) where the columns of the p p matrix Q are the p eigenvectors q j of A and the diagonal matrix D has on its diagonal the p eigenvalues d j of A in an order corresponding to the ordering of the eigenvectors in Q. This is called the eigendecomposition of the matrix A. For symmetric matrices, e.g. covariance matrices, additional results can be proved. If S is a symmetric p p matrix, it can be expressed as S = UDU T, (4) where the columns u j of the orthogonal matrix U are the p eigenvectors of S and the diagonal matrix D again has on its diagonal the corresponding eigenvalues d j of S, j = 1,...,p. The converse result also holds and thus the eigenvectors of a matrix are orthogonal if and only if the matrix is symmetric. Furthermore, it can be proven that the eigenvalues of a positive definite matrix are positive and thus d j > 0, j = 1,...,p for any non-singular covariance matrix. The decomposition in (4) can also be stated as S = p d j u j u T j. j=1 Eigendecomposition also helps us in standardizing the p-variate random vector x. The standardization, which is achieved by transforming x st = Σ 1/2 x (x E(x)) requires the inverse square root of the covariance matrix Σ x which is not uniquely defined and indeed Σ 1/2 x denotes any matrix G satisfying GΣ x G T = I. (5) The construction of matrices G satisfying (5) is discussed in Ilmonen et al. (2012) where the authors propose several ways for uniquely choosing Σ 1/2 x. One unique and symmetric choice is Σ 1/2 x = UD 1/2 U T, where the matrices U and D refer to the eigendecomposition of Σ x and the diagonal matrix D 1/2 = diag(d 1/2 1,...,d 1/2 p ) is well-defined due to the positive definiteness of the non-singular covariance matrix Σ x. Note that when the inverse square root Σ 1/2 x is later used, it is, unless otherwise stated, always assumed to be the symmetric version UD 1/2 U T. Consequently, also its inverse, Σ 1/2 x, is then symmetric. 10

18 Finallynoticethatinconstructing thematricesqanduin(3)and(4)respectively, theorder of theeigenvectors as thecolumns of Q oruisnot fixed. Any ordering can be reflected in the ordering of the corresponding eigenvalues in the diagonal of D. Furthermore, for the not-generally-orthogonal matrix Q neither the signs nor the magnitudes of the eigenvectors are fixed as is evident from (2); if vector q j satisfies the equation (2) for some d j so does any vector λq j, for any λ R. The matrix Q in (3) is then unique only up to the group of transformations C PDJ (applied column-wise) presented earlier. For an orthogonal matrix U the scales of the eigenvectors are fixed to preserve the vectors orthonormality but their signs can still be chosen at will and thus the matrix U in (4) is unique up to the group of transformations C PJ. In case some of the eigenvalues of a matrix are equal then the uniqueness of the matrix Q or U is a slightly more complex matter. For some discussion on this see chapter 21.4 of Harville (1997). For the rest of this thesis, unless stated otherwise, an assumption that the eigenvalues of the matrices used are distinct is made. The eigendecomposition has its uses on all of the following dimension reduction methods and some additional properties of it are explained in later sections. 2.4 Singular value decomposition While the eigendecomposition is applicable only for square matrices the singular value decomposition is a more general method and works for all m n matrices. It is not as heavily featured in the following sections as the eigendecomposition and thus only some basic properties of it are given in the following discussion, which is also based on Harville (1997). Let A be a m n matrix of rank r. Then it can be expressed as ( ) A = UDV T D1 0 = U V T, (6) 0 0 where U andvarerespectively m mandn n orthogonalmatrices andthe m n block matrix D has a r r diagonal matrix D 1 as its upper left block and is otherwise filled with zeroes. The diagonal elements d i,i = 1,...,r of D 1 arecalled the singular values of A. The sparsity ofdfurther gives a more compact way of writing the singular value decomposition. If we partition the two orthogonal matrices as U = (U 1,U 2 ) and V = (V 1,V 2 ) where U 1 and V 1 are m r and n r matrices respectively, we get A = UDV T = U 1 D 1 V T 1. 11

19 The proof for the existence of the decomposition is here omitted but the decomposition itself can be found by observing the matrix products AA T and A T A, which are both symmetric matrices. Letting A have the singular value decomposition given in (6), we have for the two products AA T = UDD T U T and A T A = VD T DV T. (7) Both DD T and D T D are now diagonal matrices and the equations in (7) thus describe the eigendecompositions of the symmetric matrices AA T and A T A. The columns of the matrices U and V in (6) are then the eigenvectors of AA T and A T A, respectively. The equations in (7) also show that the matrices AA T anda T Ahave thesamenon-zeroeigenvalues, that is, thefirst r diagonal elements of the matrices DD T and D T D, respectively. Further, the r non-zero eigenvalues are equal to the squared singular values d 2 i,i = 1,..,r of the matrix A. 2.5 Location and scatter functionals In univariate case the sample mean and sample variance are the most common measures of location and spread, respectively. But as they characterize only certain aspects of the underlying distribution the use of additional statistics is advocated. If, for example, more robust measures of location and dispersion are desired the median and the median absolute deviation could be considered. Similarly in multivariate statistics there is no need to limit our attention to the ordinary sample mean vector and sample covariance matrix only. In fact, several other statistics and families of statistics describing location and spread exist, the usefulness of which depends on the situation. Next, the definitions of the so-called location and scatter functionals are given. They are defined as functionals to emphasize the fact that they are functions of probability distributions and therefore describe some properties of the distributions in question. The concept of affine equivariance also needs to be addressed before defining the functionals. A loose definition would be that a statistic or functional is said to be affine equivariant if an affine transformation on the input random vector is reflected in the statistic or the functional in some meaningful way. A location functional T(F x ) R p is a vector-valued functional which is affine equivariant in the sense that T(F Ax+b ) = AT(F x )+b, (8) for all matrices A A and vectors b R p. In essence, location functionals can be seen as describing some aspect of the location or center of a multivari- 12

20 ate distribution. For example the mean vector E(x) is a location functional. Note that the natural assumption that a change in the scale in which the observations are measured should be reflected in the location estimates can be seen as a special case of the equivariance in (8). A scatter functional S(F x ) S is a matrix-valued functional which is affine equivariant in the sense that S(F Ax+b ) = AS(F x )A T, (9) for all matrices A A and vectors b R p. Scatter functionals, the most common example of which is the ordinary covariance matrix Cov(x), are used to measure some aspects of multivariate spread or dispersion. In addition to the location and scatter functionals describing properties of probability distributions, respective sample statistics T(X) and S(X) can be defined. Correspondingly they satisfy T(AX+b1 T n) = AT(X)+b and S(AX+b1 T n) = AS(X)A T, for all A A and all b R p. For examples of different scatter and location functionals the reader may refer to Tyler et al. (2009), where the authors also discuss various means of constructing them. Finally, one special family of scatter functionals is also considered in the section discussing the independent component analysis. Namely, a scatter functionals(f x )issaidtopossesstheindependenceproperty ifitisadiagonal matrix whenever the components of x are independent. For example, both the ordinary covariance matrix Cov(x) and the covariance matrix based on fourth moments Cov 4 (x) examined later have this property. 2.6 The diagonalization of two scatter matrices Concluding each of the following sections reviewing different dimension reduction methods is a subsection that shows how the respective methods can all be formulated as a simultaneous diagonalization of two matrices and, apart from PCA, as a simultaneous diagonalization of two scatter matrices. That is, the only thing differentiating the distinct dimension reduction methods is the choice of the two scatter matrices used. Apart from this choice the procedure stays otherwise the same and thus gives an useful, unifying framework for studying the properties of the individual methods. For a brief account on the history of the approach see the introduction section. Inthissubsection, whichisbasedontyler etal.(2009),thebasicsandthe general principle of the approach are discussed and in the following sections 13

21 reviewing the individual dimension reduction methods the choices of scatter matrices and other specific aspects are looked into more detail. Before starting with the method itself consider first the standard eigendecomposition of a p p matrix A presented earlier. The relationship between the matrix and its eigenvectors q j and eigenvalues d j is summarized by the equation Aq j = d j q j, j = 1,...,p. (10) Thecomparisonofthetwoscattermatricesiscarriedoutbyfindingasolution to an equation similar to this. If S 1 S and S 2 S are symmetric, positive definite p p matrices, then a pair (h j,ρ j ) is said to be an eigenvector and a corresponding eigenvalue of S 2 relative to S 1 if ρ j and h j satisfy the equation S 2 h j = ρ j S 1 h j. (11) It can be shown, using the definition (10) that finding solutions to (11) is equivalent to two other eigenvector-eigenvalue problems (proofs in Appendix A). First,anysolutionpair(h j,ρ j )totheequation(11)isalsoaneigenvectoreigenvalue pair of the not-generally-symmetric matrix S 1 1 S 2. Second, any eigenvector-eigenvalue pair (h j,ρ j ) satisfying the equation (11) is closely related to the eigenvectors and eigenvalues of the symmetric matrix M = S 1/2 1 S 2 S 1/2 1 S, where S 1/2 1 S is the unique symmetric inverse square root of S 1. Namely, the eigenvalues ρ j are same in both problems and the eigenvectors q j of the matrix M and the eigenvectors h j of S 2 relative to S 1 corresponding to the same eigenvalue ρ j have the relationship q j S 1/2 1 h j. Note also that the matrix M is positive definite and thus the eigenvalues ρ j are also positive. The proof for this is omitted here but a more general proof of positive definiteness of matrices of the form A T BA where the p p matrix B is positive definite and A A can be found in chapter 14.2 of Harville (1997). Note that as discussed in the subsection concerning the eigendecomposition, having two or more eigenvalues that are equal complicates the matters somewhat and therefore the constraint regarding the uniqueness of the eigenvalues is now required. Namely, in the following it is assumed that the eigenvalues of the matrix S 1 1 S 2 are distinct. As M is symmetric its eigenvectors q j are orthogonal and then from the equation q T i q j (S 1/2 1 h i ) T (S 1/2 1 h j ) = h T i S 1h j 14

22 andfromthefactthatthevectors q j areorthogonal, itfollowsthath T i S 1 h j = 0 for i j. From the definition in (11) it can be seen that h T i S 2h j h T i S 1h j and so also h T i S 2h j = 0 for i j. Thus choosing two scatter matrices S 1 and S 2 and solving the equation (11) (for example via eigendecomposition of the matrix M) yields the following simultaneous diagonalization of the matrices S 1 and S 2 H T S 1 H = D 1 and H T S 2 H = D 2. (12) As stated in Ilmonen et al. (2012) the matrix H as a matrix of eigenvectors is not unique due to reasons given in the subsection concerning the eigendecomposition. It is unique only up to the group of transformations C PDJ. But with additional, non-restricting conditions given in Ilmonen et al. (2012), its uniqueness can be established. i) First, the magnitudes of the eigenvectors can be fixed by requiring that the diagonalization of the first scatter matrix yields the identity matrix, H T S 1 H = I. This is achieved by normalizing the vectors S 1/2 1 h j to be unit vectors, so that (S 1/2 1 h j ) T (S 1/2 1 h j ) = h T j S 1 h j = 1. This makes the matrix H unique up to the group of transformations C PJ. Note also that the diagonal elements of the second diagonal matrix are then the eigenvalues of S 1 1 S 2, that is, D 2 = D = diag(ρ 1,...,ρ p ). This holds because from (11) it follows that S 2 H = S 1 HD and pre-multiplying by H T we have D 2 = D. ii) Second, the order of the eigenvectors (as the columns of H) can be fixed by requiring that the eigenvalues ρ j be ordered in the diagonal of D 2 in descending order. Now the matrix H is unique up to the group of transformations C J. iii) And finally, the signs of the eigenvectors h j can be fixed by requiring the additional condition H T (T 1 T 2 ) 0, where T 1 and T 2 are any two location functionals and the inequality is taken componentwise, to hold. This final condition makes the matrix H unique. As pointed out in Ilmonen et al. (2012) the final condition fixing the signs of the eigenvectors prohibits the use of symmetric distributions as for symmetric distributions all location functionals are equal to the symmetry point of the distribution. In these cases it is possible to drop the final condition and settle for the uniqueness of H up to the signs of the eigenvectors. After finding the matrix H it can be used to transform the data into an invariant coordinate system mentioned earlier by x H T x, and this is 15

23 extensively discussed in Tyler et al. (2009). But we are mainly concerned with the equations in (12) showcasing the simultaneous diagonalization of the two scatter matrices. And especially the normalized form given by fixing the magnitudes and the order of the eigenvectors H T S 1 H = I and H T S 2 H = D, (13) where the eigenvalues in the diagonal of D are in descending order. This is the common form on which all the presented dimension reduction methods are shown to be based in the following sections (although for PCA S 1 is not an actual scatter matrix). Usually S 1 is chosen to be the regular covariance matrix Cov(x) and the matrix S 2 measures some other aspect of the spread specific to the particular dimension reduction method at hand. The previous process of finding the eigendecomposition of S 1 relative to S 2 can also be given an intuitive explanation. From the fact that the solution can be found from the eigendecomposition of S 1 1 S 2, it is evident that the process somehow compares the magnitudes of the two matrices. In fact, the diagonalization extracts the specific directions in which the spreads of the data, as measured by S 1 and S 2, most differ and this ability of the diagonalization to compare different kinds of variation can be seen as the cornerstone of its usefulness in dimension reduction. 3 Principal Component Analysis (PCA) 3.1 The basic idea behind PCA The first, and perhaps the simplest, of the presented dimension reduction methods is called principal component analysis. An unsupervised method, PCA is based on the assumption that the observed p original variables in x contain redundant information as measured by correlation. For a simple example consider the extreme case of two variables being perfectly correlated. Knowing the value of the first variable would then allow us to exactly determine the value of the second variable and we could discard the second variable without losing any information. Such special cases aside, the objective of PCA is to replace the p original variables with a smaller number k of uncorrelated linear combinations of them, i.e. principal components, that preserve the information content of the original variables but have none of this redundancy. In addition to the uncorrelatedness, the principal components are chosen by requiring that the first principal component captures maximal amount of the original variables 16

24 information content, the second component the maximal amount left after that and so on until a total of p principal components is constructed. The measure for the information content of the data, as stated before, varies between methods and is in the case of principal component analysis the variance. That is, data or variables with high variance are interpreted as having high information content and respectively low variance is interpreted as a sign of low information content. PCA can thus also be seen as the re-distribution of the variance of the original data. Further, it is often the case that the first k principal components have a high variance, i.e. information content, relative to the rest of the components. The following analyses can then be carried out using only these k new variables and the last p k uninformative components can be dropped out, thus completing the dimension reduction. The choice of k, the number of kept components, of course depends on the situation and the data used and many different criteria have been developed for determining it, some of which are briefly discussed later. PCA also has a simple geometrical interpretation. It is later seen that the transformation giving the principal components is an orthogonal one and the procedure can thus be thought as plotting the observations in R p and finding, one by one, the orthogonal directions having maximal variability. The principal components are then given as a result of transforming the original data to the new coordinate system given by these directions. Moreover, the orthogonality of the new coordinate axes means that the transformation giving the principal components is represented by a rotation (and possibly reflection) of the original axes around the origin. This procedure is illustrated in a simple two-dimensional case in Figure 1. Next, the theory behind the extracting of the principal components is discussed. The approach is, unless stated otherwise, based on Jolliffe (2002). 3.2 Finding the principal components Let x = (x 1,x 2,...,x p ) be a random vector having the covariance matrix Σ x. Further assume that the principal components satisfying the previous requirements of uncorrelatedness and step-wise maximal variances are given by some linear combinations a T 1 x, at 2 x,..., at p x, where a j R p,j = 1,...,p. The variance of, for example, the first linear combination is then given by a T 1 Σ xa 1. Notice that in maximizing this with respect to a 1 some constraint fora 1 hasto beapplied orotherwise thereis no upper boundfor thevariance. The usual choice for the constraint is to require the coefficient vectors to be unit vectors, a T j a j = 1, j = 1,...,p. 17

25 x x1 Figure 1: The graph shows the scatter plot of n = 30 bivariate observations onto which the new coordinate axes giving the principal components are plotted. As can be seen, the first principal component has taken the direction having the most variability in the data. Notice also that the data being two-dimensional combined with the requirement of orthogonality means that the direction of the second principal component is determined up to the sign by the direction of the first principal component. The lengths of the arrows in the graph are proportional to the variances of the respective principal components a T j Notice also that the covariance between two linear combinations a T i x and x is given by Cov(a T i x, at j x) = E[(aT i x)(xt a j )] E[a T i x]e[xt a j ] = a T i Σ xa j. This means that thelinear combinations a T i x andat j x areuncorrelated if and only if a T i Σ x a j = 0. With these in mind, the process of finding the principal components can now be formulated as a series of optimization problems with additional constraints. Find a 1 R p that maximizes the variance given by a T 1 Σ xa 1 under the constraint of unit length a T 1a 1 = 1. The first principal component is then given by z 1 = a T 1 x. 18

26 Find a 2 R p that maximizes a T 2Σ x a 2 under the constraints of unit length a T 2 a 2 = 1 and uncorrelatedness a T 1 Σ xa 2 = 0. The second principal component is then given by z 2 = a T 2x. Find a 3 R p that maximizes a T 3 Σ xa 3 under the constraints a T 3 a 3 = 1 and a T 1Σ x a 3 = a T 2Σ x a 3 = 0. The third principal component is then given by z 3 = a T 3 x. Continue accordingly until a total of p principal components has been constructed. While the process may seem complicated, what is convenient is that the solution is found directly from the eigendecomposition of the covariance matrix Σ x. One way to show this is the method of Lagrange multipliers, which is a technique used in optimizing problems with additional constraints (like the previous a T 1a 1 = 1). For a comprehensive account of the method of Lagrange multipliers the reader can refer to Loomis and Sternberg (1990). Now it suffices to say that if we want to maximize the variance a T 1 Σ xa 1 with respect to a 1 under the constraint a T 1a 1 = 1, then under certain conditions the solution can be found amongst the vectors satisfying the following equations (note that the second equation is of vector form and thus consists of p individual equations) { a T 1 a 1 1 = 0 Σ x a 1 λ 1 a 1 = 0, (14) for some scalar λ 1 and a 1 R p. Note that the second equation can also be writtenasσ x a 1 = λ 1 a 1 showing theconnection between finding the principal components andtheeigendecomposition ofσ x (seeequation (2)). Thevector a 1 maximizing the quantity a T 1 Σ xa 1 is then found amongst the p eigenvectors of Σ x and the correct eigenvector can be chosen by observing that the quantity to bemaximized, that is, the variance of a T 1x, is for each eigenvector equal to the corresponding eigenvalue, a T 1 Σ xa 1 = a T 1 λ 1a 1 = λ 1. Thus a 1 is chosen to be the eigenvector corresponding to the largest eigenvalue of Σ x. The additional constraint a T 1a 1 = 1 in (14) requires the solution eigenvector to be normalized to have unit length. The rest of the principal components can be found similarly using the method of Lagrange multipliers. Each additional constraint regarding the uncorrelatedness of the principal components adds another scalar λ j to the equation group but the process remains relatively unchanged. In all cases the equationscanbereducedtotheformof (2)showing thatthesolutionisfound from the eigenvectors of Σ x. The additional constraints then imply that the 19

Independent component analysis for functional data

Independent component analysis for functional data Independent component analysis for functional data Hannu Oja Department of Mathematics and Statistics University of Turku Version 12.8.216 August 216 Oja (UTU) FICA Date bottom 1 / 38 Outline 1 Probability

More information

This appendix provides a very basic introduction to linear algebra concepts.

This appendix provides a very basic introduction to linear algebra concepts. APPENDIX Basic Linear Algebra Concepts This appendix provides a very basic introduction to linear algebra concepts. Some of these concepts are intentionally presented here in a somewhat simplified (not

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

B553 Lecture 5: Matrix Algebra Review

B553 Lecture 5: Matrix Algebra Review B553 Lecture 5: Matrix Algebra Review Kris Hauser January 19, 2012 We have seen in prior lectures how vectors represent points in R n and gradients of functions. Matrices represent linear transformations

More information

Introduction to Machine Learning

Introduction to Machine Learning 10-701 Introduction to Machine Learning PCA Slides based on 18-661 Fall 2018 PCA Raw data can be Complex, High-dimensional To understand a phenomenon we measure various related quantities If we knew what

More information

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra. DS-GA 1002 Lecture notes 0 Fall 2016 Linear Algebra These notes provide a review of basic concepts in linear algebra. 1 Vector spaces You are no doubt familiar with vectors in R 2 or R 3, i.e. [ ] 1.1

More information

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA

Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis (PCA) Relationship Between a Linear Combination of Variables and Axes Rotation for PCA Principle Components Analysis: Uses one group of variables (we will call this X) In

More information

Multivariate Statistical Analysis

Multivariate Statistical Analysis Multivariate Statistical Analysis Fall 2011 C. L. Williams, Ph.D. Lecture 4 for Applied Multivariate Analysis Outline 1 Eigen values and eigen vectors Characteristic equation Some properties of eigendecompositions

More information

arxiv: v2 [stat.me] 31 Aug 2017

arxiv: v2 [stat.me] 31 Aug 2017 Asymptotic and bootstrap tests for subspace dimension Klaus Nordhausen 1,2, Hannu Oja 1, and David E. Tyler 3 arxiv:1611.04908v2 [stat.me] 31 Aug 2017 1 Department of Mathematics and Statistics, University

More information

PRINCIPAL COMPONENTS ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS 121 CHAPTER 11 PRINCIPAL COMPONENTS ANALYSIS We now have the tools necessary to discuss one of the most important concepts in mathematical statistics: Principal Components Analysis (PCA). PCA involves

More information

Machine Learning (Spring 2012) Principal Component Analysis

Machine Learning (Spring 2012) Principal Component Analysis 1-71 Machine Learning (Spring 1) Principal Component Analysis Yang Xu This note is partly based on Chapter 1.1 in Chris Bishop s book on PRML and the lecture slides on PCA written by Carlos Guestrin in

More information

Robustness of Principal Components

Robustness of Principal Components PCA for Clustering An objective of principal components analysis is to identify linear combinations of the original variables that are useful in accounting for the variation in those original variables.

More information

Deep Learning Book Notes Chapter 2: Linear Algebra

Deep Learning Book Notes Chapter 2: Linear Algebra Deep Learning Book Notes Chapter 2: Linear Algebra Compiled By: Abhinaba Bala, Dakshit Agrawal, Mohit Jain Section 2.1: Scalars, Vectors, Matrices and Tensors Scalar Single Number Lowercase names in italic

More information

INVARIANT COORDINATE SELECTION

INVARIANT COORDINATE SELECTION INVARIANT COORDINATE SELECTION By David E. Tyler 1, Frank Critchley, Lutz Dümbgen 2, and Hannu Oja Rutgers University, Open University, University of Berne and University of Tampere SUMMARY A general method

More information

Eigenvalues and diagonalization

Eigenvalues and diagonalization Eigenvalues and diagonalization Patrick Breheny November 15 Patrick Breheny BST 764: Applied Statistical Modeling 1/20 Introduction The next topic in our course, principal components analysis, revolves

More information

Invariant coordinate selection for multivariate data analysis - the package ICS

Invariant coordinate selection for multivariate data analysis - the package ICS Invariant coordinate selection for multivariate data analysis - the package ICS Klaus Nordhausen 1 Hannu Oja 1 David E. Tyler 2 1 Tampere School of Public Health University of Tampere 2 Department of Statistics

More information

Multivariate Distributions

Multivariate Distributions IEOR E4602: Quantitative Risk Management Spring 2016 c 2016 by Martin Haugh Multivariate Distributions We will study multivariate distributions in these notes, focusing 1 in particular on multivariate

More information

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014

Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014 Duke University, Department of Electrical and Computer Engineering Optimization for Scientists and Engineers c Alex Bronstein, 2014 Linear Algebra A Brief Reminder Purpose. The purpose of this document

More information

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces Singular Value Decomposition This handout is a review of some basic concepts in linear algebra For a detailed introduction, consult a linear algebra text Linear lgebra and its pplications by Gilbert Strang

More information

Maximum variance formulation

Maximum variance formulation 12.1. Principal Component Analysis 561 Figure 12.2 Principal component analysis seeks a space of lower dimensionality, known as the principal subspace and denoted by the magenta line, such that the orthogonal

More information

2. Matrix Algebra and Random Vectors

2. Matrix Algebra and Random Vectors 2. Matrix Algebra and Random Vectors 2.1 Introduction Multivariate data can be conveniently display as array of numbers. In general, a rectangular array of numbers with, for instance, n rows and p columns

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

CS 143 Linear Algebra Review

CS 143 Linear Algebra Review CS 143 Linear Algebra Review Stefan Roth September 29, 2003 Introductory Remarks This review does not aim at mathematical rigor very much, but instead at ease of understanding and conciseness. Please see

More information

Statistical Pattern Recognition

Statistical Pattern Recognition Statistical Pattern Recognition Feature Extraction Hamid R. Rabiee Jafar Muhammadi, Alireza Ghasemi, Payam Siyari Spring 2014 http://ce.sharif.edu/courses/92-93/2/ce725-2/ Agenda Dimensionality Reduction

More information

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani

PCA & ICA. CE-717: Machine Learning Sharif University of Technology Spring Soleymani PCA & ICA CE-717: Machine Learning Sharif University of Technology Spring 2015 Soleymani Dimensionality Reduction: Feature Selection vs. Feature Extraction Feature selection Select a subset of a given

More information

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces. Math 350 Fall 2011 Notes about inner product spaces In this notes we state and prove some important properties of inner product spaces. First, recall the dot product on R n : if x, y R n, say x = (x 1,...,

More information

Linear Dimensionality Reduction

Linear Dimensionality Reduction Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Principal Component Analysis 3 Factor Analysis

More information

8. Diagonalization.

8. Diagonalization. 8. Diagonalization 8.1. Matrix Representations of Linear Transformations Matrix of A Linear Operator with Respect to A Basis We know that every linear transformation T: R n R m has an associated standard

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

7. Variable extraction and dimensionality reduction

7. Variable extraction and dimensionality reduction 7. Variable extraction and dimensionality reduction The goal of the variable selection in the preceding chapter was to find least useful variables so that it would be possible to reduce the dimensionality

More information

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = 30 MATHEMATICS REVIEW G A.1.1 Matrices and Vectors Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A = a 11 a 12... a 1N a 21 a 22... a 2N...... a M1 a M2... a MN A matrix can

More information

A Introduction to Matrix Algebra and the Multivariate Normal Distribution

A Introduction to Matrix Algebra and the Multivariate Normal Distribution A Introduction to Matrix Algebra and the Multivariate Normal Distribution PRE 905: Multivariate Analysis Spring 2014 Lecture 6 PRE 905: Lecture 7 Matrix Algebra and the MVN Distribution Today s Class An

More information

Basic Concepts in Matrix Algebra

Basic Concepts in Matrix Algebra Basic Concepts in Matrix Algebra An column array of p elements is called a vector of dimension p and is written as x p 1 = x 1 x 2. x p. The transpose of the column vector x p 1 is row vector x = [x 1

More information

Principal Components Theory Notes

Principal Components Theory Notes Principal Components Theory Notes Charles J. Geyer August 29, 2007 1 Introduction These are class notes for Stat 5601 (nonparametrics) taught at the University of Minnesota, Spring 2006. This not a theory

More information

Appendix A: Matrices

Appendix A: Matrices Appendix A: Matrices A matrix is a rectangular array of numbers Such arrays have rows and columns The numbers of rows and columns are referred to as the dimensions of a matrix A matrix with, say, 5 rows

More information

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Combinations of features Given a data matrix X n p with p fairly large, it can

More information

Unconstrained Ordination

Unconstrained Ordination Unconstrained Ordination Sites Species A Species B Species C Species D Species E 1 0 (1) 5 (1) 1 (1) 10 (4) 10 (4) 2 2 (3) 8 (3) 4 (3) 12 (6) 20 (6) 3 8 (6) 20 (6) 10 (6) 1 (2) 3 (2) 4 4 (5) 11 (5) 8 (5)

More information

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395

Data Mining. Dimensionality reduction. Hamid Beigy. Sharif University of Technology. Fall 1395 Data Mining Dimensionality reduction Hamid Beigy Sharif University of Technology Fall 1395 Hamid Beigy (Sharif University of Technology) Data Mining Fall 1395 1 / 42 Outline 1 Introduction 2 Feature selection

More information

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx

More information

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations. Previously Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations y = Ax Or A simply represents data Notion of eigenvectors,

More information

Machine Learning 2nd Edition

Machine Learning 2nd Edition INTRODUCTION TO Lecture Slides for Machine Learning 2nd Edition ETHEM ALPAYDIN, modified by Leonardo Bobadilla and some parts from http://www.cs.tau.ac.il/~apartzin/machinelearning/ The MIT Press, 2010

More information

Preface. Figures Figures appearing in the text were prepared using MATLAB R. For product information, please contact:

Preface. Figures Figures appearing in the text were prepared using MATLAB R. For product information, please contact: Linear algebra forms the basis for much of modern mathematics theoretical, applied, and computational. The purpose of this book is to provide a broad and solid foundation for the study of advanced mathematics.

More information

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

Principal Component Analysis. Applied Multivariate Statistics Spring 2012 Principal Component Analysis Applied Multivariate Statistics Spring 2012 Overview Intuition Four definitions Practical examples Mathematical example Case study 2 PCA: Goals Goal 1: Dimension reduction

More information

EE731 Lecture Notes: Matrix Computations for Signal Processing

EE731 Lecture Notes: Matrix Computations for Signal Processing EE731 Lecture Notes: Matrix Computations for Signal Processing James P. Reilly c Department of Electrical and Computer Engineering McMaster University September 22, 2005 0 Preface This collection of ten

More information

Stat 206: Linear algebra

Stat 206: Linear algebra Stat 206: Linear algebra James Johndrow (adapted from Iain Johnstone s notes) 2016-11-02 Vectors We have already been working with vectors, but let s review a few more concepts. The inner product of two

More information

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu

Dimension Reduction Techniques. Presented by Jie (Jerry) Yu Dimension Reduction Techniques Presented by Jie (Jerry) Yu Outline Problem Modeling Review of PCA and MDS Isomap Local Linear Embedding (LLE) Charting Background Advances in data collection and storage

More information

Linear Algebra and Robot Modeling

Linear Algebra and Robot Modeling Linear Algebra and Robot Modeling Nathan Ratliff Abstract Linear algebra is fundamental to robot modeling, control, and optimization. This document reviews some of the basic kinematic equations and uses

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. I assume the reader is familiar with basic linear algebra, including the

More information

Principal component analysis

Principal component analysis Principal component analysis Angela Montanari 1 Introduction Principal component analysis (PCA) is one of the most popular multivariate statistical methods. It was first introduced by Pearson (1901) and

More information

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ). .8.6 µ =, σ = 1 µ = 1, σ = 1 / µ =, σ =.. 3 1 1 3 x Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ ). The Gaussian distribution Probably the most-important distribution in all of statistics

More information

arxiv: v1 [math.na] 5 May 2011

arxiv: v1 [math.na] 5 May 2011 ITERATIVE METHODS FOR COMPUTING EIGENVALUES AND EIGENVECTORS MAYSUM PANJU arxiv:1105.1185v1 [math.na] 5 May 2011 Abstract. We examine some numerical iterative methods for computing the eigenvalues and

More information

Foundations of Computer Vision

Foundations of Computer Vision Foundations of Computer Vision Wesley. E. Snyder North Carolina State University Hairong Qi University of Tennessee, Knoxville Last Edited February 8, 2017 1 3.2. A BRIEF REVIEW OF LINEAR ALGEBRA Apply

More information

Factor Analysis (10/2/13)

Factor Analysis (10/2/13) STA561: Probabilistic machine learning Factor Analysis (10/2/13) Lecturer: Barbara Engelhardt Scribes: Li Zhu, Fan Li, Ni Guan Factor Analysis Factor analysis is related to the mixture models we have studied.

More information

Least Squares Optimization

Least Squares Optimization Least Squares Optimization The following is a brief review of least squares optimization and constrained optimization techniques. Broadly, these techniques can be used in data analysis and visualization

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

Chapter Two Elements of Linear Algebra

Chapter Two Elements of Linear Algebra Chapter Two Elements of Linear Algebra Previously, in chapter one, we have considered single first order differential equations involving a single unknown function. In the next chapter we will begin to

More information

Properties of Matrices and Operations on Matrices

Properties of Matrices and Operations on Matrices Properties of Matrices and Operations on Matrices A common data structure for statistical analysis is a rectangular array or matris. Rows represent individual observational units, or just observations,

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Laurenz Wiskott Institute for Theoretical Biology Humboldt-University Berlin Invalidenstraße 43 D-10115 Berlin, Germany 11 March 2004 1 Intuition Problem Statement Experimental

More information

Data Preprocessing Tasks

Data Preprocessing Tasks Data Tasks 1 2 3 Data Reduction 4 We re here. 1 Dimensionality Reduction Dimensionality reduction is a commonly used approach for generating fewer features. Typically used because too many features can

More information

An Introduction to Matrix Algebra

An Introduction to Matrix Algebra An Introduction to Matrix Algebra EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #8 EPSY 905: Matrix Algebra In This Lecture An introduction to matrix algebra Ø Scalars, vectors, and matrices

More information

Chapter 3 Transformations

Chapter 3 Transformations Chapter 3 Transformations An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Linear Transformations A function is called a linear transformation if 1. for every and 2. for every If we fix the bases

More information

Vector Space Models. wine_spectral.r

Vector Space Models. wine_spectral.r Vector Space Models 137 wine_spectral.r Latent Semantic Analysis Problem with words Even a small vocabulary as in wine example is challenging LSA Reduce number of columns of DTM by principal components

More information

Example Linear Algebra Competency Test

Example Linear Algebra Competency Test Example Linear Algebra Competency Test The 4 questions below are a combination of True or False, multiple choice, fill in the blank, and computations involving matrices and vectors. In the latter case,

More information

On Independent Component Analysis

On Independent Component Analysis On Independent Component Analysis Université libre de Bruxelles European Centre for Advanced Research in Economics and Statistics (ECARES) Solvay Brussels School of Economics and Management Symmetric Outline

More information

Vectors and Matrices Statistics with Vectors and Matrices

Vectors and Matrices Statistics with Vectors and Matrices Vectors and Matrices Statistics with Vectors and Matrices Lecture 3 September 7, 005 Analysis Lecture #3-9/7/005 Slide 1 of 55 Today s Lecture Vectors and Matrices (Supplement A - augmented with SAS proc

More information

Discriminant analysis and supervised classification

Discriminant analysis and supervised classification Discriminant analysis and supervised classification Angela Montanari 1 Linear discriminant analysis Linear discriminant analysis (LDA) also known as Fisher s linear discriminant analysis or as Canonical

More information

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008

Gaussian processes. Chuong B. Do (updated by Honglak Lee) November 22, 2008 Gaussian processes Chuong B Do (updated by Honglak Lee) November 22, 2008 Many of the classical machine learning algorithms that we talked about during the first half of this course fit the following pattern:

More information

Singular Value Decomposition and Principal Component Analysis (PCA) I

Singular Value Decomposition and Principal Component Analysis (PCA) I Singular Value Decomposition and Principal Component Analysis (PCA) I Prof Ned Wingreen MOL 40/50 Microarray review Data per array: 0000 genes, I (green) i,i (red) i 000 000+ data points! The expression

More information

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1: Introduction, Multivariate Location and Scatter MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 1:, Multivariate Location Contents , pauliina.ilmonen(a)aalto.fi Lectures on Mondays 12.15-14.00 (2.1. - 6.2., 20.2. - 27.3.), U147 (U5) Exercises

More information

Notes on Linear Algebra and Matrix Theory

Notes on Linear Algebra and Matrix Theory Massimo Franceschet featuring Enrico Bozzo Scalar product The scalar product (a.k.a. dot product or inner product) of two real vectors x = (x 1,..., x n ) and y = (y 1,..., y n ) is not a vector but a

More information

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1

Statistics 202: Data Mining. c Jonathan Taylor. Week 2 Based in part on slides from textbook, slides of Susan Holmes. October 3, / 1 Week 2 Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Part I Other datatypes, preprocessing 2 / 1 Other datatypes Document data You might start with a collection of

More information

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes

Part I. Other datatypes, preprocessing. Other datatypes. Other datatypes. Week 2 Based in part on slides from textbook, slides of Susan Holmes Week 2 Based in part on slides from textbook, slides of Susan Holmes Part I Other datatypes, preprocessing October 3, 2012 1 / 1 2 / 1 Other datatypes Other datatypes Document data You might start with

More information

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the

-Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the 1 2 3 -Principal components analysis is by far the oldest multivariate technique, dating back to the early 1900's; ecologists have used PCA since the 1950's. -PCA is based on covariance or correlation

More information

Scatter Matrices and Independent Component Analysis

Scatter Matrices and Independent Component Analysis AUSTRIAN JOURNAL OF STATISTICS Volume 35 (2006), Number 2&3, 175 189 Scatter Matrices and Independent Component Analysis Hannu Oja 1, Seija Sirkiä 2, and Jan Eriksson 3 1 University of Tampere, Finland

More information

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where:

VAR Model. (k-variate) VAR(p) model (in the Reduced Form): Y t-2. Y t-1 = A + B 1. Y t + B 2. Y t-p. + ε t. + + B p. where: VAR Model (k-variate VAR(p model (in the Reduced Form: where: Y t = A + B 1 Y t-1 + B 2 Y t-2 + + B p Y t-p + ε t Y t = (y 1t, y 2t,, y kt : a (k x 1 vector of time series variables A: a (k x 1 vector

More information

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data. Structure in Data A major objective in data analysis is to identify interesting features or structure in the data. The graphical methods are very useful in discovering structure. There are basically two

More information

Multivariate Distributions

Multivariate Distributions Copyright Cosma Rohilla Shalizi; do not distribute without permission updates at http://www.stat.cmu.edu/~cshalizi/adafaepov/ Appendix E Multivariate Distributions E.1 Review of Definitions Let s review

More information

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal:

More information

A Selective Review of Sufficient Dimension Reduction

A Selective Review of Sufficient Dimension Reduction A Selective Review of Sufficient Dimension Reduction Lexin Li Department of Statistics North Carolina State University Lexin Li (NCSU) Sufficient Dimension Reduction 1 / 19 Outline 1 General Framework

More information

Introduction to Matrices

Introduction to Matrices POLS 704 Introduction to Matrices Introduction to Matrices. The Cast of Characters A matrix is a rectangular array (i.e., a table) of numbers. For example, 2 3 X 4 5 6 (4 3) 7 8 9 0 0 0 Thismatrix,with4rowsand3columns,isoforder

More information

Table of Contents. Multivariate methods. Introduction II. Introduction I

Table of Contents. Multivariate methods. Introduction II. Introduction I Table of Contents Introduction Antti Penttilä Department of Physics University of Helsinki Exactum summer school, 04 Construction of multinormal distribution Test of multinormality with 3 Interpretation

More information

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2

MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 MA 575 Linear Models: Cedric E. Ginestet, Boston University Regularization: Ridge Regression and Lasso Week 14, Lecture 2 1 Ridge Regression Ridge regression and the Lasso are two forms of regularized

More information

22.3. Repeated Eigenvalues and Symmetric Matrices. Introduction. Prerequisites. Learning Outcomes

22.3. Repeated Eigenvalues and Symmetric Matrices. Introduction. Prerequisites. Learning Outcomes Repeated Eigenvalues and Symmetric Matrices. Introduction In this Section we further develop the theory of eigenvalues and eigenvectors in two distinct directions. Firstly we look at matrices where one

More information

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag

A Tutorial on Data Reduction. Principal Component Analysis Theoretical Discussion. By Shireen Elhabian and Aly Farag A Tutorial on Data Reduction Principal Component Analysis Theoretical Discussion By Shireen Elhabian and Aly Farag University of Louisville, CVIP Lab November 2008 PCA PCA is A backbone of modern data

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

7 Principal Component Analysis

7 Principal Component Analysis 7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is

More information

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017

CPSC 340: Machine Learning and Data Mining. More PCA Fall 2017 CPSC 340: Machine Learning and Data Mining More PCA Fall 2017 Admin Assignment 4: Due Friday of next week. No class Monday due to holiday. There will be tutorials next week on MAP/PCA (except Monday).

More information

Neuroscience Introduction

Neuroscience Introduction Neuroscience Introduction The brain As humans, we can identify galaxies light years away, we can study particles smaller than an atom. But we still haven t unlocked the mystery of the three pounds of matter

More information

APPENDIX 1 BASIC STATISTICS. Summarizing Data

APPENDIX 1 BASIC STATISTICS. Summarizing Data 1 APPENDIX 1 Figure A1.1: Normal Distribution BASIC STATISTICS The problem that we face in financial analysis today is not having too little information but too much. Making sense of large and often contradictory

More information

Repeated Eigenvalues and Symmetric Matrices

Repeated Eigenvalues and Symmetric Matrices Repeated Eigenvalues and Symmetric Matrices. Introduction In this Section we further develop the theory of eigenvalues and eigenvectors in two distinct directions. Firstly we look at matrices where one

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

On Invariant Within Equivalence Coordinate System (IWECS) Transformations

On Invariant Within Equivalence Coordinate System (IWECS) Transformations On Invariant Within Equivalence Coordinate System (IWECS) Transformations Robert Serfling Abstract In exploratory data analysis and data mining in the very common setting of a data set X of vectors from

More information

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2 Final Review Sheet The final will cover Sections Chapters 1,2,3 and 4, as well as sections 5.1-5.4, 6.1-6.2 and 7.1-7.3 from chapters 5,6 and 7. This is essentially all material covered this term. Watch

More information

MTH Linear Algebra. Study Guide. Dr. Tony Yee Department of Mathematics and Information Technology The Hong Kong Institute of Education

MTH Linear Algebra. Study Guide. Dr. Tony Yee Department of Mathematics and Information Technology The Hong Kong Institute of Education MTH 3 Linear Algebra Study Guide Dr. Tony Yee Department of Mathematics and Information Technology The Hong Kong Institute of Education June 3, ii Contents Table of Contents iii Matrix Algebra. Real Life

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis .. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make

More information

Stat 5101 Lecture Notes

Stat 5101 Lecture Notes Stat 5101 Lecture Notes Charles J. Geyer Copyright 1998, 1999, 2000, 2001 by Charles J. Geyer May 7, 2001 ii Stat 5101 (Geyer) Course Notes Contents 1 Random Variables and Change of Variables 1 1.1 Random

More information

Experimental Design and Data Analysis for Biologists

Experimental Design and Data Analysis for Biologists Experimental Design and Data Analysis for Biologists Gerry P. Quinn Monash University Michael J. Keough University of Melbourne CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv I I Introduction 1 1.1

More information

Independent Component (IC) Models: New Extensions of the Multinormal Model

Independent Component (IC) Models: New Extensions of the Multinormal Model Independent Component (IC) Models: New Extensions of the Multinormal Model Davy Paindaveine (joint with Klaus Nordhausen, Hannu Oja, and Sara Taskinen) School of Public Health, ULB, April 2008 My research

More information

Linear Algebra for Machine Learning. Sargur N. Srihari

Linear Algebra for Machine Learning. Sargur N. Srihari Linear Algebra for Machine Learning Sargur N. srihari@cedar.buffalo.edu 1 Overview Linear Algebra is based on continuous math rather than discrete math Computer scientists have little experience with it

More information