International Journal of Pure and Applied Mathematics Volume 19 No , A NOTE ON BETWEEN-GROUP PCA

Similar documents
Linear Regression and Discrimination

Introduction to Machine Learning

Introduction to Machine Learning

High Dimensional Discriminant Analysis

Classification Methods II: Linear and Quadratic Discrimminant Analysis

CMSC858P Supervised Learning Methods

Regularized Discriminant Analysis and Reduced-Rank LDA

Shrinkage regression

Discriminant Analysis Documentation

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Lecture 16: Small Sample Size Problems (Covariance Estimation) Many thanks to Carlos Thomaz who authored the original version of these slides

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Introduction to Machine Learning Spring 2018 Note 18

From dummy regression to prior probabilities in PLS-DA

Theorems. Least squares regression

Does Modeling Lead to More Accurate Classification?

A Study of Relative Efficiency and Robustness of Classification Methods

FAST CROSS-VALIDATION IN ROBUST PCA

CS 195-5: Machine Learning Problem Set 1

STATISTICAL LEARNING SYSTEMS

Lecture 6: Methods for high-dimensional problems

MSA220 Statistical Learning for Big Data

Inverse of a Square Matrix. For an N N square matrix A, the inverse of A, 1

Supervised Learning. Regression Example: Boston Housing. Regression Example: Boston Housing

A direct formulation for sparse PCA using semidefinite programming

Basics of Multivariate Modelling and Data Analysis

Regularized Discriminant Analysis. Part I. Linear and Quadratic Discriminant Analysis. Discriminant Analysis. Example. Example. Class distribution

Classification 2: Linear discriminant analysis (continued); logistic regression

LDA, QDA, Naive Bayes

Classification 1: Linear regression of indicators, linear discriminant analysis

Classification techniques focus on Discriminant Analysis

Recap from previous lecture

Classification via kernel regression based on univariate product density estimators

Effective Linear Discriminant Analysis for High Dimensional, Low Sample Size Data

High Dimensional Discriminant Analysis

7. Variable extraction and dimensionality reduction

A Bias Correction for the Minimum Error Rate in Cross-validation

University of Cambridge Engineering Part IIB Module 3F3: Signal and Pattern Processing Handout 2:. The Multivariate Gaussian & Decision Boundaries

A SURVEY OF SOME SPARSE METHODS FOR HIGH DIMENSIONAL DATA

Linear & Non-Linear Discriminant Analysis! Hugh R. Wilson

Revision: Chapter 1-6. Applied Multivariate Statistics Spring 2012

A Peak to the World of Multivariate Statistical Analysis

A direct formulation for sparse PCA using semidefinite programming

High Dimensional Discriminant Analysis

MSA200/TMS041 Multivariate Analysis

Statistics for Applications. Chapter 9: Principal Component Analysis (PCA) 1/16

Variable Selection and Weighting by Nearest Neighbor Ensembles

Contents Lecture 4. Lecture 4 Linear Discriminant Analysis. Summary of Lecture 3 (II/II) Summary of Lecture 3 (I/II)

BANA 7046 Data Mining I Lecture 6. Other Data Mining Algorithms 1

Motivating the Covariance Matrix

Chapter 4: Factor Analysis

Classification of high dimensional data: High Dimensional Discriminant Analysis

Classification. The goal: map from input X to a label Y. Y has a discrete set of possible values. We focused on binary Y (values 0 or 1).

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 8: Canonical Correlation Analysis

Accounting for measurement uncertainties in industrial data analysis

A Selective Review of Sufficient Dimension Reduction

Multivariate Statistics Fundamentals Part 1: Rotation-based Techniques

MACHINE LEARNING ADVANCED MACHINE LEARNING

Bayesian Decision and Bayesian Learning

Unsupervised Learning: Dimensionality Reduction

MULTI-VARIATE/MODALITY IMAGE ANALYSIS

A Novel PCA-Based Bayes Classifier and Face Analysis

High-dimensional regression modeling

A Robust Approach to Regularized Discriminant Analysis

Chap 2. Linear Classifiers (FTH, ) Yongdai Kim Seoul National University

PLS discriminant analysis for functional data

Clustering VS Classification

DS-GA 1002 Lecture notes 12 Fall Linear regression

LEC 4: Discriminant Analysis for Classification

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

Association studies and regression

Simultaneous variable selection and class fusion for high-dimensional linear discriminant analysis

Dimension reduction techniques for classification Milan, May 2003

STAT 100C: Linear models

Introduction to machine learning and pattern recognition Lecture 2 Coryn Bailer-Jones

Multivariate Regression

Statistical Data Mining and Machine Learning Hilary Term 2016

A Least Squares Formulation for Canonical Correlation Analysis

What is Principal Component Analysis?

PCA and LDA. Man-Wai MAK

Mathematical foundations - linear algebra

PRINCIPAL COMPONENTS ANALYSIS

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

Lecture 9: Classification, LDA

An Introduction to Statistical and Probabilistic Linear Models

LEC 3: Fisher Discriminant Analysis (FDA)

Bayesian Decision Theory

Announcements (repeat) Principal Components Analysis

Course 495: Advanced Statistical Machine Learning/Pattern Recognition

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Bayesian Learning. Tobias Scheffer, Niels Landwehr

Chemometrics. Matti Hotokka Physical chemistry Åbo Akademi University

IV. Matrix Approximation using Least-Squares

MA 575 Linear Models: Cedric E. Ginestet, Boston University Revision: Probability and Linear Algebra Week 1, Lecture 2

Accepted author version posted online: 28 Nov 2012.

Probability and Information Theory. Sargur N. Srihari

Generative Model (Naïve Bayes, LDA)

Midterm: CS 6375 Spring 2015 Solutions

Sparse PCA with applications in finance

Transcription:

International Journal of Pure and Applied Mathematics Volume 19 No. 3 2005, 359-366 A NOTE ON BETWEEN-GROUP PCA Anne-Laure Boulesteix Department of Statistics University of Munich Akademiestrasse 1, Munich, D-80799, GERMANY e-mail: boulesteix@stat.uni-muenchen.de Abstract: In the context of binary classification with continuous predictors, we prove two properties concerning the connections between Partial Least Squares (PLS) dimension reduction and between-group PCA, and between linear discriminant analysis and between-group PCA. Such methods are of great interest for the analysis of high-dimensional data with continuous predictors, such as microarray gene expression data. AMS Subject Classification: 46N30, 11R29 Key Words: applied statistics, classification, dimension reduction, feature extraction, linear discriminant analysis, partial least squares, principal component analysis 1. Introduction Classification, i.e. prediction of a categorical variable using predictor variables is an important field of applied statistics. Suppose we have a p 1 random vector x = (X 1,...,X p ) T, where X 1,...,X p are continuous predictor variables. Y is a categorical response variable and can take values 1,...,K (K 2). It can also be denoted as group membership. Many dimension reduction and classification Received: February 2, 2005 c 2005, Academic Publications Ltd.

360 A.-L. Boulesteix methods are based on linear transformations of the random vector x of the type Z = a T x, (1) where a is a p 1 vector. In linear dimension reduction, the focus is on the linear transformations themselves, whereas linear classification methods aim to predict the response variable Y via linear transformations of x. However, both approaches are strongly connected, since the linear transformations which are output by dimension reduction methods can sometimes be used as new predictor variables for classification. In this article, we study the connection between some well-known dimension reduction and classification methods. Principal component analysis (PCA) consists to find uncorrelated linear transformations of the random vector x which have high variance. The same analysis can be performed on the variable E(x Y ) instead of x. In this paper, this approach is denoted as between-group PCA and examined in Section 2. An alternative approach for linear dimension reduction is Partial Least Squares (PLS), which aims to find linear transformations which have high covariance with the response Y. In Section 3, the PLS approach is briefly presented and a connection between between-group PCA and the first PLS component is shown for the case K = 2. If one assumes that x has a multivariate normal distribution within each group and that the within-group covariance matrix is the same for all the groups, decision theory tells us that the optimal decision function is a linear transformation of x. This approach is called linear discriminant analysis. An overview of discriminant analysis can be found in Hastie and Tibshirani [4]. For K = 2, we show in Section 4 that under a stronger assumption, the linear transformation of x obtained in linear discriminant analysis is the same as in between-group PCA. In the whole paper, µ denotes the mean of the random vector x and Σ its covariance. For k = 1,...,K, µ k denotes the within-group mean vector of x and Σ k the within-group covariance matrix for group k. In addition, we assume µ i µ j, i j. x i = (x i1,...,x ip ) T for i = 1,...,n denote independent identically distributed realizations of the random vector x and Y i denotes the group membership of the i-th realization. ˆµ 1,..., ˆµ K are the sample withingroup mean vectors calculated from the data set (x i,y i ) i=1,...,n. 2. Between-Group PCA Linear dimension reduction consists to define new random variables Z 1,...,Z m as linear combinations of X 1,...,X p, where m is the number of new variables.

A NOTE ON BETWEEN-GROUP PCA 361 For j = 1,...,m, Z j has the form Z j = a T j x, where a j is a p 1 vector. In Principal Component Analysis (PCA), a 1,...,a m R p are defined successively as follows. Definition 1. (Principal Components) a 1 is the p 1 vector maximizing VAR(a T x) = a T Σa under the constraint a T 1 a 1 = 1. For j = 2,...,m, a j is the p 1 vector maximizing VAR(a T x) under the constraints a T j a j = 1 and a T j a i = 0 for i = 1,...,j 1. The vectors a 1,...,a m defined in definition 1 are the (normalized) eigenvectors of the matrix Σ. The number of eigenvectors with strictly positive eigenvalues equals rank(σ), which is p if X 1,...,X p are linearly independent. a 1 is the eigenvector of Σ with the greatest eigenvalue, a 2 is the eigenvector of Σ with the second greatest eigenvalue, and so on. For an extensive overview of PCA, see e.g. Jolliffe [5]. In PCA, the new variables Z 1,...,Z m are built independently of Y and the number of new variables m is at most p. If one wants to build new variables which contain information on the categorical response variable Y, an alternative to PCA is to look for linear combinations of x which maximize VAR(E(a T x Y )) instead of VAR(a T x). In the following, this approach is denoted as betweengroup PCA. Σ B denotes the between-group covariance matrix: Σ B = COV(E(x Y )). (2) In between-group PCA, a 1,...,a m are defined as follows. Definition 2. (Between-Group Principal Components) a 1 is the p 1 vector maximizing VAR(E(a T x Y )) = a T Σ B a under the constraint a T 1 a 1 = 1. For j = 2,...,m, a j is the p 1 vector maximizing VAR(a T x Y ) under the constraints a T j a j = 1 and a T j a i = 0 for i = 1,...,j 1. The vectors a 1,...,a m defined in definition 2 are the eigenvectors of the matrix Σ B. Since Σ B is of rank at most K 1, there are at most K 1 eigenvectors with strictly positive eigenvalues. Since E(a T x Y ) = a T E(x Y ), between-group PCA can be seen as PCA performed on the random vector E(x Y ) instead of x. In the next section, the special case K = 2 is examined. If K = 2, Σ B has only one eigenvector with strictly positive eigenvalue. This eigenvector is denoted as a B. a B can be derived from simple computations on Σ B. Σ B = p 1 (µ 1 µ)(µ 1 µ) T + p 2 (µ 2 µ)(µ 2 µ) T

362 A.-L. Boulesteix = p 1 (µ 1 p 1 µ 1 p 2 µ 2 )(µ 1 p 1 µ 1 p 2 µ 2 ) T + p 2 (µ 2 p 1 µ 1 p 2 µ 2 )(µ 2 p 1 µ 1 p 2 µ 2 ) T = p 1 p 2 2 (µ 1 µ 2 )(µ 1 µ 2 ) T + p 2 p 2 1 (µ 1 µ 2 )(µ 1 µ 2 ) T = p 1 p 2 (µ 1 µ 2 )(µ 1 µ 2 ) T Σ B (µ 1 µ 2 ) = p 1 p 2 (µ 1 µ 2 )(µ 1 µ 2 ) T (µ 1 µ 2 ). Since p 1 p 2 (µ 1 µ 2 ) T (µ 1 µ 2 ) > 0, (3) (µ 1 µ 2 ) is an eigenvector of Σ B with strictly positive eigenvalue. Since a B has to satisfy a T B a B = 1, we obtain a B = (µ 1 µ 2 )/ µ 1 µ 2. (4) In practice, µ 1 and µ 2 are often unknown and must be estimated from the available data set (x i,y i ) i=1,...,n. a B may be estimated by replacing µ 1 and µ 2 by ˆµ 1 and ˆµ 2 in equation (4): â B = (ˆµ 1 ˆµ 2 )/ ˆµ 1 ˆµ 2. (5) Between-group PCA is applied by Culhane et al [1] in the context of highdimensional microarray data. However, Culhane et al [1] formulate the method as a data matrix decomposition (singular value decomposition) and do not define the between-group principal components theoretically. In the following section, we examine the connection between-group PCA and Partial Least Squares. 3. A Connection Between PLS and Between-Group PCA Partial Least Squares (PLS) dimension reduction is another linear dimension reduction method. It is especially appropriate to construct new components which are linked to the response variable Y. Studies of the PLS approach from the point of view of statisticians can be found in e.g. Stone and Brooks [9], Frank and Friedman [2] and Garthwaite [3]. In the PLS framework, Z 1,...,Z m are not random variables which are theoretically defined and then estimated from a data set: their definition is based on a specific data set. Here, we focus on the binary case (Y = 1,2), although the PLS approach can be generalized to

A NOTE ON BETWEEN-GROUP PCA 363 multicategorical response variables (de Jong [6]). For the data set (x i,y i ) i=1,...,n, the vectors a 1,...,a m are defined as follows (Stone and Brooks [9]). Definition 3. (PLS components) Let COV ˆ denote the sample covariance computed from (x i,y i ) i=1,...,n. a 1 is the p 1 vector maximizing COV(a ˆ T 1 x,y ) under the constraint a T 1 a 1 = 1. For j = 2,...,m, a j is the p 1 vector maximizing COV(a ˆ T x,y ) under the constraints a T j a j = 1 and COV(a ˆ T j x,at i x) = 0 for i = 1,...,j 1. In the following, the vector a 1 defined in Definition 3 is denoted as a PLS. An exact algorithm to compute the PLS components can be found in Martens and Naes [7]. In the following, we study the connection between the first PLS component and the first between-group principal component. Proposition 1. For a given data set (x i,y i ) i=1,...,n, the first PLS component equals the first between-group principal component: a PLS = â B. Proof. For all a R p, COV(a ˆ T x,y ) = a T COV(x,Y ˆ ) COV(x,Y ˆ ) = 1 n x i Y i 1 n n n n 2( x i )( Y i ) i=1 i=1 i=1 = 1 n (n 1ˆµ 1 + 2n 2ˆµ 2 ) 1 n 2(n 1ˆµ 1 + n 2ˆµ 2 )(n 1 + 2n 2 ) = 1 n 2(nn 1ˆµ 1 + 2nn 2ˆµ 2 n 2 1ˆµ 1 2n 1 n 2ˆµ 1 n 1 n 2ˆµ 2 2n 2 2ˆµ 2) = n 1 n 2 (ˆµ 2 ˆµ 1 )/n 2. The only unit vector maximizing n 1 n 2 a T (ˆµ 2 ˆµ 1 )/n 2 is a PLS = (ˆµ 2 ˆµ 1 )/ ˆµ 2 ˆµ 1 = â B. Thus, the first component obtained by PLS dimension reduction is the same as the first component obtained by between-group PCA. This is an argument to support the (controversial) use of PLS dimension reduction in the context of binary classification. The connection between between-group PCA and linear discriminant analysis is examined in the next section.

364 A.-L. Boulesteix 4. A Connection Between LDA and Between-Group PCA If x is assumed to have a multivariate normal distribution with mean µ k and covariance matrix Σ k within class k, P(Y = k x) = p k f(x Y = k)/f(x), ln P(Y = k x) = ln p k ln f(x) ln( 2π Σ k 1/2 ) 1 2 (x µ k) T Σ 1/2 k (x µ k ), where f represents the density function. The Bayes classification rule predicts the class of an observation x 0 as C(x 0 ) = arg max P(Y = k x) k = arg max k (ln p k ln( 2π Σ k 1/2 ) 1 2 (x µ k) T Σ 1/2 k (x µ k )). For K = 2, the discriminant function d 12 is d 12 (x) = ln P(Y = 1 x) lnp(y = 2 x) = 1 2 (x µ 1) T Σ 1/2 1 (x µ 1 ) + 1 2 (x µ 2) T Σ 1/2 2 (x µ 2 ) + lnp 1 ln p 2 ln( 2π Σ 1 1/2 ) + ln( 2π Σ 2 1/2 ). If one assumes Σ 1 = Σ 2 = Σ, d 12 is a linear function of x (hence the term linear discriminant analysis): d 12 (x) = (x µ 1 + µ 2 ) T Σ 1/2 (µ 2 1 µ 2 ) + ln p 1 ln p 2 = a T LDAx + b, where a LDA = Σ 1/2 (µ 1 µ 2 ) (6) and b = 1 2 (µ 1 + µ 2 ) T Σ 1/2 (µ 1 µ 2 ) + ln p 1 ln p 2. (7) Proposition 2. If Σ is assumed to be of the form Σ = σ 2 I p, where I p is the identity matrix of dimensions p p and σ is a scalar, a LDA and a B are collinear. Proof. The proof follows from equations (4) and (6). Thus, we showed the strong connection between linear discriminant analysis and between-group PCA in the case K = 2. In practice, a B is estimated by â B

A NOTE ON BETWEEN-GROUP PCA 365 and a LDA is estimated by â LDA = (ˆµ 1 ˆµ 2 )/ˆσ, where ˆσ is an estimator of σ. Thus, â B and â LDA are also collinear. The assumption about the structure of Σ is quite strong. However, such an assumption can be wise in practice when the available data set contains a large number of variables p and a small number of observations n. If p > n, which often occurs in practice (for instance in microarray data analysis), ˆΣ can not be inverted, since it has rank at most n 1 and dimensions p p. In this case, it is sensible to make strong assumptions on Σ. Proposition 2 tells us that between-group PCA takes only between-group correlations into account, not within-group correlations. 5. Discussion We showed the strong connection between PLS dimension reduction for classification, between-group PCA and linear discriminant analysis for the case K = 2. PCA and PLS are useful techniques in practice, especially when the number of observations n is smaller than the number of variables p, for instance in the context of microarray data analysis (Nguyen and Rocke [8]). The connection between PLS and between-group PCA can also support the use of PLS dimension reduction in the classification framework. The conclusion of this theoretical study is that PLS and between-group PCA, which are the two main dimension reduction methods allowing n < p are tightly connected to a special case of linear discriminant analysis with strong assumptions. In future work, one could examine the connection between the three approaches for multicategorical response variables. The connection between linear dimension reduction methods and alternative methods such as shrinkage methods could also be investigated in future. References [1] A.C. Culhane, G. Perriere, E. Considine, T. Gotter, D. Higgins, Betweengroup analysis of microarray data, Bioinformatics, 18 (2002), 1600-1608. [2] I.E. Frank, J.H. Friedman, A statistical view of some chemometrics regression tools, Technometrics, 35 (1993), 109-135. [3] P.H. Garthwaite, An interpretation of partial least squares, Journal of the American Statistical Association, 89 (1994), 122-127.

366 A.-L. Boulesteix [4] T. Hastie, R. Tibshirani, J.H. Friedman, The Elements of Statistical Learning, Springer-Verlag, New York (2001). [5] I.T. Jolliffe, Principal Component Analysis, Springer-Verlag, New York (1986). [6] S. de Jong, SIMPLS, an alternative approach to partial least squares regression, Chemometrics and Intelling Laboratory Systems, 18 (1993), 251-253. [7] H. Martens, T. Naes, Multivariate Calibration, Wiley, New York (1989). [8] D. Nguyen, D. M. Rocke, Tumor classification by partial least squares using microarray gene expression data, Bioinformatics, 18 (2002), 39-50. [9] M. Stone, R.J. Brooks, Continuum regression: cross-validated sequentially constructed prediction embracing ordinaray least squares, partial least squares and principal component regression, Journal of the Royal Statistical Society B, 52 (1990), 237-269.