New Routes from Minimal Approximation Error to Principal Components

Size: px

Start display at page:

Download "New Routes from Minimal Approximation Error to Principal Components"

Jasmine Caldwell
5 years ago
Views:

1 Neural Process Lett (2008) 27: DOI 0.007/s New Routes from Miimal Approximatio Error to Pricipal Compoets Abhilash Alexader Mirada Ya-Aël Le Borge Gialuca Botempi Published olie: 5 Jauary 2008 Spriger Sciece+Busiess Media, LLC Abstract We itroduce two ew methods of derivig the classical PCA i the framework of miimizig the mea square error upo performig a lower-dimesioal approximatio of the data. These methods are based o two forms of the mea square error fuctio. Oe of the ovelties of the preseted methods is that the commoly employed process of subtractio of the mea of the data becomes part of the solutio of the optimizatio problem ad ot a pre-aalysis heuristic. We also derive the optimal basis ad the miimum error of approximatio i this framework ad demostrate the elegace of our solutio i compariso with a recet solutio i the framework. Keywords Pricipal compoets aalysis Eigevalue Matrix trace Itroductio The problem of approximatig a give set of data usig a weighted liear combiatio of a fewer umber of vectors tha the origial dimesioality is classic. May applicatios that require such a dimesioality reductio desire that the ew represetatio retai the maximum variability i the data for further aalysis. A popular method that attais simultaeous dimesioality reductio, miimum mea square error of approximatio ad retaimet of maximum variace of the origial data represetatio i the ew represetatio is called the Pricipal Compoets Aalysis (PCA) [7,]. The most popular framework for derivig PCA starts with the aalysis of variace. A very commo derivatio of PCA i this framework geerates the basis by iteratively fidig the orthogoal directios of maximum retaied variaces [7,0,,4]. Sice variace is implied i the statemet of the problem here, the mea is subtracted from the data as a prelimiary step. The secod most predomiat framework derives PCA by miimizig the A. A. Mirada (B) Y.-A. Le Borge G. Botempi Machie Learig Group, Départemet d Iformatique, Uiversité Libre de Bruxelles, Boulevard du Triomphe CP22, 050 Brussels, Belgium abalexa@ulb.ac.be

2 98 A. A. Mirada et al. mea square error of approximatio [ 3]. Aided by the derivatio i the variace-based framework above, it has become acceptable to resort to mea subtractio of the data prior to ay aalysis i this framework too i order to keep the aalysis simple. I this letter our focus is o the latter framework withi which we demostrate two distict ad elegat aalytical methods of derivig the PCA. I each of these methods of derivatio, subtractio of data mea becomes part of the solutio istead of beig a iitial assumptio. The letter is orgaized as follows: i Sect. 2 we describe the motivatio behid the eed for yet aother derivatio of the classical PCA. I particular, we highlight the issue of mea ceterig i Sect. 2.. The otatios are itroduced i Sect. 2.2 ad the PCA problem ad its iterpretatios are discussed i Sect. 3. After reviewig a recet solutio i Sect. 4, we make it evidet i Sect. 5 that our two methods are due to two forms of the optimizatio fuctio. The we itroduce these two methods of solvig the PCA problem i Sects. 6 ad 7 ad arrive at a simple commo form of the optimizatio fuctio i both these methods. This is aalyzed further i Sect. 8 where we show the relatio of the variace to the optimal basis as well as the miimum approximatio error attaied i PCA. I Sect. 8.3, we revisit a recet solutio i our framework of PCA itroduced i Sect. 4 ad equate it with our approach. 2 Motivatio There are may stadard textbooks of multivariate ad statistical aalysis [0,,4] detailig PCA as a techique that seeks the best approximatio of a give set of data poits usig a liear combiatio of a set of vectors which retai maximum variace alog their directios. Sice this framework of PCA starts by fidig the covariaces, the mea has to be subtracted from the data ad becomes the de facto origi of the ew coordiate system. The subsequet aalysis is simple: fid the eigevector correspodig to the largest eigevalue of the covariace matrix as the first basis vector. The fid the secod basis vector o which the data compoets bear zero correlatio with the data compoets o the first basis vector. This turs out to be the eigevector correspodig to the secod largest eigevalue. I successively fidig the basis vectors that have ucorrelated compoets as the eigevectors of decreasig retaied variaces, the secod order cross momets betwee the compoets are successively elimiated. Computatioally, a widely employed trick i this framework fids the eigevectors usig sigular value decompositio of the mea cetered data matrix which effectively diagoalizes the covariace matrix without actually computig it [,4]. The set of orthogoal vectors correspodig to the largest few sigular values proportioal to the variaces yields those directios which retai the maximum variace i the ew represetatio of the data. The secod framework derives the PCA approximatio by usig its property of miimizig the mea square error. We thik that this framework is more effective i itroducig PCA to a ovice because the two outcomes of optimal dimesioality reductio, viz. error miimizatio ad retaied variace maximizatio, are attaied here simultaeously. Followig the path of the retaied variace maximizatio framework ad to keep the aalysis simple, may textbooks [2,9,0,20] advocate a mea subtractio for this framework too without sesible justificatio. Pearso stated i his ow classical paper [8]: The secod momet of a system about a series of parallel lies is always least for the lie goig through the cetroid. Hece: The best-fittig straight lie for a system of poits i a space of ay order goes through the cetroid of the system. Elimiatio of higher order cross momets is dealt i Idepedet Compoets Aalysis (ICA) [9].

3 New Routes from Miimal Approximatio Error to Pricipal Compoets 99 A procedure equivalet to rephrasig of this statemet is followed i a much refereced textbook [3] which reasos that sice the mea is the zero-dimesioal hyperplae which satisfies the miimum average square error criterio, ay higher dimesioal hyperplae should be excused to pass through it too. I order to keep our aalysis coheret with the cocept of simultaeous dimesioality reductio, retaied variace maximizatio ad approximatio error miimizatio, we do ot ivite the reader to such geometric ituitios. Note that the error miimizatio framework ca also be viewed as a total least squares regressio problem with all variables thought to be free so that the task is to fit a lower dimesioal hyperplae that miimizes the perpedicular distaces from the data poits to the hyperplae [2]. We will also be reviewig [] who derives PCA i the same framework as that of ours. Ulike i their approach, we either udertake a complete orthogoal decompositio or force ay basis vectors to bear a commo statistic eticed by the prospect of a evetual mea subtractio. Also for the beefit of practitioers who would like to deal data as realizatios of a radom variable, our treatmet i the data samples domai ca be readily exteded to a populatio domai. 2. To Mea Ceter or Not I the framework of fidig the basis of a lower dimesioal space which miimizes the mea square error of approximatio, the process of mea subtractio has so far bee part of the heuristics that the data eeds to be cetered before istallig the ew low-dimesioal coordiate system motivated by the philosophy accordig to [8] that, had the mea of the data ot bee subtracted, the best fittig hyperplae would pass through the origi ad ot through the cetroid. But there exist situatios where a hyperplae is merely expected to partitio the data space ito orthogoal subspaces ad as a result subtractio of mea is ot desired. Note that i such situatios, the term pricipal compoet does ot strictly hold as the basis vectors for the ew space are ot obtaied from the data covariace matrix ad the mai cocer there is the decompositio of the data rather tha its approximatio. Oe such set of situatios are addressed by the Fukuaga Kootz Trasform [5,6] ad it works by ot requirig a subtractio of mea but istead fids the pricipal compoets of the autocorrelatio matrices of two classes of data. It is widely used i automatic target recogitio where eigevalue decompositio geerates basis for a target space orthogoal to the clutter space. But such is the issue of mea subtractio i usig this trasform that researchers of [2] ad[8] use autocorrelatio ad covariace matrices, respectively, for the same task without a justificatio of the impact of their choice to mea ceter or ot. A similar approach called Eigespace Separatio Trasformatio [9] aimed at classificatio also does ot ivolve mea subtractio. A family of techiques called Orthogoal Subspace Projectio that is widely applied i oise rejectio of sigals use data that are ot mea cetered for the geeralized PCA that follows [6]. Although the theory of PCA demads mea subtractio for optimal low dimesioal approximatio, for may applicatios it is ot without cosequece. For example, the researchers of ecology ad climate studies have extesively debated the purpose ad result of mea ceterig for their PCA-based data aalysis. I [7], the characteristics ad apparet advatages of the pricipal compoets geerated without mea subtractio are compared for data sampled homogeously i the origial space or otherwise. The claim made therei is that if data form distict clusters, the ifluece of variace withi a cluster o aother ca be miimized by ot subtractig the mea. Aother ogoig debate amed Hockey Stick cotroversy [5] ivolves the appropriateess of mea subtractio for PCA i a much cited global warmig study [3].

4 200 A. A. Mirada et al. It should be bore i mid that this letter is either solely about the aforemetioed issue of mea ceterig that researchers usig PCA ofte take it for grated or does it chage the results of PCA that is previously kow to them. But we demostrate i a ew comprehesive framework that (i) the mea subtractio becomes a solutio to the optimizatio problem i PCA ad we reach this solutio through two simple distict methods that borrow little from traditioal textbook derivatios of PCA, ad (ii) the derivatio of the basis for the low dimesioal space coverges to miimum approximatio error ad maximum retaied variace i the framework. Cosequetly, we believe that may problems which raise questios about their choice regardig mea subtractio ca be revisited with ease usig our proposed PCA framework. 2.2 Notatios The otatios that will be used throughout this letter are summarized i the table below J q : error fuctio q : ew dimesioality p : origial dimesioality : umber of samples x k R p ; k th data sample ˆx k R p ; approximatio of x k θ R p ; ew geeral origi x k = x k θ R p e i R p ; i th orthoormal basis vector of R p W = [e e q ] R p q B = I WW T R p p W = [e q+ e p ] R p p q z k R q ; depedet o x k b R p q ; a costat Tr(A) : Trace of the matrix A rak(a) : Rak of the matrix A µ R p ; sample mea S R p p ; sample covariace matrix λ i : i th largest eigevalue of S r = rak(s) 3 Problem Defiitio i the Sample Domai Let x k R p,k =,...,be a give set of data poits. Suppose we are iterested i orthoormal vectors e i R p,i =,...,q p whose resultat of weighted liear combiatio ˆx k R p ca approximate x k with a miimum average (sample mea) square error or i other words miimize J q ( ˆx k ) = x k ˆx k 2. () The problem stated above meas that we eed a approximatio x k ˆx k such that ˆx k = q i= ( ) ei T x k e i (2) so that we attai the miimum for J q. This approximatio assumes that the origi of all orthoormal e i is the same as that of the coordiate system i which the data is defied. We assume orthoormality here because (i) orthogoality guaratees liearly idepedet e i so

5 New Routes from Miimal Approximatio Error to Pricipal Compoets 20 that they form a basis for R q (ii) ormalizig e i maitais otatioal simplicity i ot havig to divide the scalars e T i x k i (2) by the orm e i which is uity due to our assumptio. We reformulate the approximatio ˆx k = θ + q i= ( ) ei T (x k θ) e i (3) to assume that the ew represetatio usig basis vectors e i has a geeral origi θ R p ad ot the origi as i the approximatio (2). Hece, the PCA problem may be defied as ˆx k = θ + q ( argmi x k ˆx k 2 i= e T i (x k θ) ) e i ; : (4) e i,θ ei T e j = 0, i = j; ei T e i = i, j. which seeks a set of orthoormal basis vectors e i with a ew origi θ which miimizes the errorfuctioi() i order to fid a low-dimesioal approximatio W T (x k θ) R q for ay x k R p,where It is ow easy to see that (3) becomes W =[e e q ]. (5) ˆx k = θ + WW T (x k θ). (6) Hece the displacemet vector directed from the approximatio ˆx k towards x k is x k ˆx k = (x k θ) WW T (x k θ), which usig x k = x k θ ca be writte cocisely as x k ˆx k = x k WW T x k. By settig B = I WW T for simplicity of otatio, we write the displacemet vector as x k ˆx k = B x k. (7) 4 Review of a Recet Solutio The most recet PCA solutio i the framework of approximatio error miimizatio, derived i [], is reviewed here. They derive PCA by udertakig a complete decompositio ˆx k = Wz k + Wb (8) ito basis vectors cotaied i the colums of matrix W of (5) ad W =[e q+ e p ] R p p q such that compoets of z k R q deped o x k, whereas compoets of b R p q are costats commo for all data poits. By takig the derivative of the error fuctio with respect to b, they fid that b = W T µ (9) so that the commo compoets are those of the sample mea vector µ. This implies that by subtractig the sample mea they are o loger obliged to retai the p q dimesios correspodig to the colums of W which preserve little iformatio regardig the variatio i the data. The first drawback of this approach is that it couples the process of dimesioality reductio with mea subtractio although the two will be show to be idepedet i our derivatio. By takig the derivative of the error fuctio with respect to z k, they also show that z k = W T x k. Hece the approximatio they are seekig is ˆx k = WW T x k + W W T µ. (0)

6 202 A. A. Mirada et al. The secod drawback of their approach is the requiremet of yet aother costraied miimizatio of the error fuctio before they reach the solutio for the optimal colums of W. 5 Methods of PCA We have discussed the eed for a ew derivatio of PCA by (i) explaiig the lack of proper justificatio i the literature for subtractig the mea i a miimum mea square error framework, (ii) remidig its chroic ecessity for the beefit of may applicatios i Sect. 2, ad (iii) reviewig a recet attempt to solve this problem i Sect. 4. Our derivatios of the solutio for the problem i (4) are due to two simple forms of the error fuctio J q of () whichwe state as follows: Form : J q ( ˆx k ) = Form 2 : J q ( ˆx k ) = Tr ( ) T ( ) xk ˆx k xk ˆx k ( ) ( )( ) T xk ˆx k xk ˆx k () (2) We aalyze Form i () i Sect. 6 to arrive at a simplified J q which is exactly the same as we get by followig a differet method of aalyzig Form 2 i (2) i Sect. 7. These two methods pursue differet paths towards the commo error fuctio, viz., the first usig straightforward expasio of the terms i J q ad the secod usig the property of matrix trace. The commo form of J q is subsequetly treated i Sect. 8 to reveal the rest of the solutio to our origial problem. 6 Aalysis of Form of Error Fuctio Usig (7), the error fuctio J q of Form i () ca be developed as J q (B, θ) = x k T BT B x k. (3) The property that B = I WW T is idempotet ad symmetric, i.e., B = B 2 = B T, (4) or B is simply a orthogoal projector, may be used to reduce J q further as Expadig J q above usig x k = x k θ gives J q (B, θ) = x k T B x k. (5) J q (B, θ) = [ ] xk T Bx k 2θ T Bx k + θ T Bθ (6)

7 New Routes from Miimal Approximatio Error to Pricipal Compoets 203 I order to get the θ which miimizes J q, we fid the partial derivative J q / θ = 2B [ x k θ ] ad settig it to zero results i θ = x k = µ, (7) which is as simple as regardig the sample mea of the data poits as the ew origi. Heceforth, we ca assume that x k is the data poit x k from which the sample mea has bee subtracted. 6. Simplifyig the Error Fuctio We may aalyze the error fuctio i (5) as follows: J q (W) = = = x T k ( I WW T ) x k x k T x k x k T x k Tr We have the sample covariace matrix S = x k T WWT x k (W T [ ] ) x k x k T W. x k x k T θ=µ (8) so that the term x T k x k θ=µ equals Tr(S), ad we ca write ( ) J q (W) = Tr(S) Tr W T SW. (9) 7 Aalysis of Form 2 of Error Fuctio We ow aalyze the Form 2 of the error fuctio J q by substitutig (7) i(2)as 7. Fidig θ ( J q (B, θ) = Tr B [ ] ) x k x k T B T. (20) As i the previous sectio, we deote the sample mea ad sample covariace matrix by µ ad S, respectively, ad we may develop the term i (20):

8 204 A. A. Mirada et al. x k x k T = = (x k θ)(x k θ) T [x ] k xk T x kθ T θxk T + θθt = S + µµ T µθ T θµ T + θθ T, (2) where we have used the sample autocorrelatio matrix [4]giveby x k xk T = S+µµT. We get J q (B) = Tr ( B ( S + µµ T µθ T θµ T + θθ T ) B T ) upo substitutig (2)i(20). Usig (4) ad the cyclic permutatio property of trace of matrix products 2 we get ( ( )) J q (B) = Tr B S + µµ T µθ T θµ T + θθ T (22) ad usig the property of the derivative of trace 3 ad the chai rule of the derivatives, 4 we fid that J q / θ = 2B ( µ + θ) which whe equated to zero results i leadig to the same solutio of Form i (7). 7.2 Simplifyig the Error Fuctio θ = µ (23) Havig foud θ, we ca substitute it i (22)togetJ q (B) = Tr (BS). O substitutio for B i terms of W, we may write J q (W) = Tr (S) Tr ( WW T S ). Utilizig the cyclic permutatio property of matrix trace agai, we get J q (W) = Tr (S) Tr ( W T SW ). (24) 8 Optimal Basis ad Miimum Error Note that we have arrived at the same set of equatios i both (9) ad(24) offormad Form 2, respectively, whereby substitutig W as defied i (5) i either of them gives J q (e i ) = Tr(S) 8. Relatio of Variace to Optimal Basis q ei T Se i. (25) Let us ow fid the variace λ i of the data projected o the basis vector e i. It is the average of the square of the differece betwee projectios e T i x k of the data poits ad the projectio i= 2 Tr ( (ϒ ) ( = ))/ Tr ( ϒ ) = Tr ( ϒ). 3 Tr T =. [ ( )] 4 ( )/ u = ( )/ uv T v.

9 New Routes from Miimal Approximatio Error to Pricipal Compoets 205 e T i µ of the sample mea, i.e., λ i = = = e T i ) 2 (ei T x k ei T µ ) T (ei T x k ei )(e T µ i T x k ei T µ [ ] (x k µ)(x k µ) T e i = ei T Se i. (26) Thus, the term q i= et i Se i i (25) gives the portio of the total variace Tr (S) retaied alog the directios of orthoormal e i. Hece, we are lookig for vectors e i of the form λ i = ei T (Se i ), which is satisfied if Se i = λ i e i. Such a relatio implies (e i,λ i ) form a eige-pair of S. Note that sice there is o uique basis for ay otrivial vector space, ay basis that spas the q-dimesioal space geerated by the eigevectors of S are solutios to e i too. I (25), sice argmi e i J q = argmax e i q ei T Se i, (27) the vectors e i have to be the eigevectors correspodig to the q largest ( pricipal ) eigevalues of S. This is the classical result of the PCA. i= 8.2 Relatio of Variace to Miimum Approximatio Error It follows from (26) that the term q i= et i Se i = q i= λ i of (25)isthesumoftheq pricipal eigevalues of S; this is the maximum variace that could be retaied upo approximatio usig ay q basis vectors. Also, Tr (S) = r i= λ i,r = rak (S) is the total variace i the data. Substitutig these i J q i (25) gives the differece of the total variace ad the maximum retaied variace; the result is the miimum of the elimiated variace. Hece, for λ i λ j,j >i, the miimum mea square approximatio error ca be expressed as J q = r λ i i= }{{} total variace q i= λ i }{{} retaied variace = r i=q+ λ i. }{{} elimiated variace 8.3 Compariso of the Reviewed Solutio with the Preset Work (28) I order to compare the solutio of [] reviewed i Sect. 4, let us first write the approximatio i (6)as ˆx k = WW T x k + Bθ. We kow from (7)ad(23)thatθ = µ ad, hece, ˆx k = WW T x k + Bµ. (29) If W W T = B, we have the approximatio accordig to [] i(0) ofsect.4 equivalet to the approximatio i (29). While the drawbacks of (6) highlighted i Sect. 4 exist, let us outlie the differece i these two approaches: we have demostrated i the proposed framework that the ew origi θ R p of the low dimesioal coordiate system should be the mea µ R p so that the

10 206 A. A. Mirada et al. error of the approximatio is reduced. But [] ecessitates a orthogoal projectio of certai data-idepedet compoets b R p q to µ R p to achieve the same objective. The framework preseted i this letter has show that such a dimesioality reductio coupled with mea subtractio is uecessary for derivig PCA. 8.4 Populatio PCA For populatio PCA [0,4], where the samples that form the data are assumed to be realizatios of a radom variable, we have made it easy for the reader to follow our aalysis by just replacig all occurreces of E, the expectatio operator; ad bold faces for radom variables as i x k x, ˆx k ˆx, ad x k x. 9Coclusio Motivated by the eed to justify the heuristics of pre-aalysis mea ceterig i PCA ad related questios, we have demostrated through two distict methods that the mea subtractio becomes part of the solutio of the stadard PCA problem i a approximatio error miimizatio framework. We believe that the framework, i which we have compared a recet solutio with ours, is more effective i justifyig mea subtractio i PCA. Also, we have show that the framework is comprehesive i the sese that the two outcomes of optimal dimesioality reductio, viz. approximatio error miimizatio ad retaied variace maximizatio, are attaied here simultaeously. Ackowledgmets This work was fuded by the Project TANIA (WALEO II) of the Walloo Regio, Belgium. The authors thak their colleague Olivier Caele for his appreciable commets. Thaks are also due to Dr. P.P. Mohalal of ISRO Iertial Systems Uit, Idia for his valuable isights. The authors are very grateful to the Editor-i-Chief ad three aoymous reviewers for their excellet suggestios o a earlier versio of this letter. Refereces. Bishop CM (2006) Patter recogitio ad machie learig. Iformatio sciece ad statistics. Spriger, New York 2. Diamataras KI, Kug SY (996) Pricipal compoet eural etworks: theory ad applicatios. Joh wiley, NewYork 3. Duda RO, Hart PE, Stork DG (200) Patter classificatio. 2d ed. Wiley Itersciece, New York 4. Fukuaga K (990) Itroductio to statistical patter recogitio. Computer sciece ad scietific computig,. 2d ed. Academic Press, Sa Diego 5. Fukuaga K, Kootz WLG (970) Applicatio of the Karhue Loeve expasio to feature selectio ad orderig. IEEE Trasac Comput C-9(4): Harsayi JC, Chag C-I (994) Hyperspectral image classificatio ad dimesioality reductio: a orthogoal subspace projectio approach. IEEE Trasac Geosci Remote Ses 32(4): Hotellig H (933) Aalysis of a complex of statistical variables ito pricipal compoets. J Educ Psychol 24: Huo X, Elad M, Flesia AG, Muise B, Stafill R, Mahalaobis A et al (2003) Optimal reduced-rak quadratic classifiers usig the Fukuaga Kootz trasform with applicatios to automated target recogitio. Proc SPIE 5094: Hyvarie A, Karhue J, Oja E (200) Idepedet compoet aalysis, vol 27 of adaptive ad learig systems for sigal processig, commuicatios ad cotrol. Wiley-Itersciece, New York 0. Johso RA, Wicher DW (992) Applied multivariate statistical aalysis, 3rd ed. Pretice-Hall, Ic., Upper Saddle River. Jolliffe IT (2002) Pricipal compoet aalysis, 2d ed. Spriger, New York

11 New Routes from Miimal Approximatio Error to Pricipal Compoets Mahaalobis A, Muise RR, Stafill SR, Va Nevel A (2004) Desig ad applicatio of quadratic correlatio filters for target detectio. IEEE Trasac Aerosp Electro Syst 40(3): Ma ME, Bradley RS, Hughes MK (998) Global-scale temperature patters ad climate forcig over the past six ceturies. Nature 392: Mardia K, Ket J, Bibby J (979) Multivariate aalysis. Academic Press, Lodo 5. McItyre S, McKitrick R (2005) Reply to commet by Huybers o hockey sticks, pricipal compoets, ad spurious sigificace. Geophys Res Lett 32:L Mirada AA, Whela PF (2005) Fukuaga Kootz trasform for small sample size problems. I: Proceedigs of the IEE Irish sigals ad systems coferece, pp 56 6, Dubli 7. Noy-Meir I (973) Data trasformatios i ecological ordiatio: I. some advatages of o-ceterig. J Ecol 6(2): Pearso K (90) O lies ad plaes of closest fit to systems of poits i space. Philos Mag 2: Plett GL, Doi T, Torrieri D (997) Mie detectio usig scatterig parameters ad a artificial eural etwork. IEEE Trasac Neural Netw 8(6): Ripley BD (996) Patter recogitio ad eural etworks. Cambridge Uiversity Press, Cambridge 2. Va Huffel S (ed) (997) Recet advaces i total least squares techiques ad errors-i-variables modelig. SIAM, Philadelphia

New Routes from Minimal Approximation Error to Principal Components

New Routes from Minimal Approximation Error to Principal Components New Routes from Miimal Approximatio Error to Pricipal Compoets Abhilash Alexader Mirada, Ya-Aël Le Borge, Gialuca Botempi Machie Learig Group, Départemet d Iformatique Uiversité Libre de Bruxelles, Boulevard