Regularized Discriminant Analysis for High Dimensional, Low Sample Size Data

Size: px

Start display at page:

Download "Regularized Discriminant Analysis for High Dimensional, Low Sample Size Data"

Garry Byrd
6 years ago
Views:

1 Regularzed Dscrmnant Analyss for Hgh Dmensonal, Low Sample Sze Data Jepng Ye Arzona State Unversty Tempe, AZ Te Wang Arzona State Unversty Tempe, AZ ABSTRACT Lnear and Quadratc Dscrmnant Analyss have been used wdely n many areas of data mnng, machne learnng, and bonformatcs. Fredman proposed a compromse between Lnear and Quadratc Dscrmnant Analyss, called Regularzed Dscrmnant Analyss (), whch has been shown to be more flexble n dealng wth varous class dstrbutons. apples the regularzaton technques by employng two regularzaton parameters, whch are chosen to jontly maxmze the classfcaton performance. The optmal par of parameters s commonly estmated va crossvaldaton from a set of canddate pars. It s computatonally prohbtve for hgh dmensonal data, especally when the canddate set s large, whch lmts the applcatons of to low dmensonal data. In ths paper, a novel algorthm for s presented for hgh dmensonal data. It can estmate the optmal regularzaton parameters from a large set of parameter canddates effcently. Experments on a varety of datasets confrm the clamed theoretcal estmate of the effcency, and also show that, for a properly chosen par of regularzaton parameters, performs favorably n classfcaton, n comparson wth other exstng classfcaton methods. Categores and Subject Descrptors: H.2.8 [Database Management]: Database Applcatons - Data Mnng General Terms: Algorthms Keywords: Dmensonalty reducton, Quadratc Dscrmnant Analyss, regularzaton, cross-valdaton. INTRODUCTION Statstcal dscrmnant analyss s a frequently used and wdely applcable tool n a varety of areas [6, 2, 26, 27]. The am of dscrmnant analyss s to assgn a data pont to one of several classes (groups) on the bass of a number of feature varables. Numerous methods on dscrmnant analyss have been proposed and appled n the past. The most Permsson to mae dgtal or hard copes of all or part of ths wor for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. KDD 06, August 20 23, 2006, Phladelpha, Pennsylvana, USA. Copyrght 2006 ACM /06/ $5.00. frequently used methods are parametrc approaches, especally Lnear and Quadratc Dscrmnant Analyss. Lnear Dscrmnant Analyss (LDA) s based on the assumpton that the varables are multvarate normally dstrbuted n each class wth dfferent mean vectors and a common covarance matrx. It has been used n varous applcatons [, 6, 9, 24]. In Quadratc Dscrmnant Analyss (QDA), the varables are assumed to be multvarate normally dstrbuted n each class wth dfferent mean vectors and dfferent covarance matrces [4]. QDA provdes a less restrctve procedure by allowng dfferent covarance matrces and may thus ft the data better than LDA. However, LDA nvolves a much smaller number of parameters to estmate than QDA, and s thus more robust and relable than QDA n the parameter estmaton. Fredman [8] proposed a compromse between LDA and QDA, called Regularzed Dscrmnant Analyss (or n short), whch allows one to shrn the separate covarances of QDA toward a common covarance as n LDA. The regularzed covarance matrx of the -th class has the followng form: β (ασ + ( α)s w ) + ( β)i d, where Σ s the covarance of the -th class, S w, the so-called pooled covarance matrx as used n LDA, s also nown as the wthn-class scatter matrx [9], I d s the dentty matrx of sze d by d, and d s the dmensonalty of the data. Here α [0, ] and β [0, ] are two regularzaton parameters. The trace term n the orgnal formulaton n [8] s absorbed nto the β parameter for smplcty. provdes a farly rch class of regularzaton alternatves. The four corners defnng the extremes of the (α, β) plane represent well-nown classfcaton procedures. The upper rght corner (α =, β = ) represents QDA. The upper left corner (α = 0, β = ) represents LDA. The lne connectng the lower left and lower rght corner,.e., β = 0 wth 0 α, corresponds to the nearest-centrod classfer well nown n pattern recognton, where a test data pont s assgned to the class wth the closest (Eucldean dstance) centrod. Varyng α wth β fxed at produces models between QDA and LDA. For a gven tranng dataset, α and β are commonly estmated va cross-valdaton. Selectng an optmal value for a parameter par such as (α, β) s called model selecton [4]. The computatonal cost of model selecton for s hgh, especally when the data dmensonalty, d, s large, snce t requres expensve matrx computatons for each canddate par. Ths restrcts to low dmensonal data. In ths paper, we mae the frst attempt n extendng

2 the applcablty of to hgh dmensonal, low sample sze (HDLSS) data, as HDLSS data are emergng from varous felds. In hgh throughput gene expresson experments, technologes have been desgned to measure the gene expresson levels of tens of thousands of genes n a sngle mcroarray chp. However, the sample sze n each dataset s typcally small rangng from tens to low hundreds due to the cost of the experments. In mage-based object or face recognton applcatons, two or three dmensonal mages are usually converted to column representatons, resultng n hgh dmensonal data whle the number of mages s usually small. In text document classfcaton, the number of features equals to the number of dstnct words n the documents, whch s typcally n the thousands, whle the number of documents n the study may be much smaller. A common characterstc of all these datasets s that the dmensonalty, d, of the data vector s much larger than the sample sze. Ths leads to varous statstcal ssues, nown as the hgh dmensonal, low sample sze problem [3]. In ths paper, we propose an effcent algorthm for on HDLSS data. The prmary contrbutons of ths wor nclude: We show that the classfcaton rule n can be decomposed nto two components: the frst component nvolves matrces of low dmensonalty, whle the second component nvolves matrces of hgh dmensonalty. More mportantly, we show that for a gven test data pont, the second component n the classfcaton rule s constant for all classes, whch has no effect on classfcaton and can be smply removed. We call ths the decomposton property of. We present an effcent algorthm for, by applyng the decomposton property above to speed up the model selecton process of. The basc dea s to dvde the computatons n nto two successve stages. The frst stage has a relatvely hgher computatonal cost, but t s ndependent of α and β. The second stage has a relatvely lower computatonal cost. When searchng for the optmal parameter par from a set of canddates va cross-valdaton, we only need to repeat the second stage, thus dramatcally reducng the computatonal cost of model selecton, especally when the canddate set s large. We have conducted expermental studes on a varety of HDLSS data, ncludng text documents, face mages, and mcroarray gene expresson data. Results confrm our theoretcal estmate of the computatonal cost of the proposed algorthm n model selecton. Experments also demonstrate that, wth properly chosen regularzaton parameters, s effectve n classfcaton, n comparson wth several other nown classfcaton algorthms. The rest of the paper s organzed as follows. An overvew of QDA and s gven n Secton 2. An effcent algorthm for s presented n Secton 3. Expermental results are gven n Secton 4. Conclusons are presented n Secton AN OVERVIEW OF QDA AND For convenence, we present n Table the mportant notatons that wll be used n the rest of the paper. Notaton Descrpton n sample sze d number of features (dmensons) number of classes A data matrx A data matrx of the -th class n sze of the -th class µ centrod of the -th class Σ covarance matrx of the -th class ˆΣ regularzed covarance matrx of the -th class µ global centrod of the tranng set S b between-class scatter matrx S w wthn-class scatter matrx S t total scatter matrx t ran of the matrx S t α the frst regularzaton parameter β the second regularzaton parameter g(x) class label of the data pont x Table : Important notatons used n the paper. In ths secton, we brefly revew the Quadratc Dscrmnant Analyss (QDA), some ssues related to the applcaton of QDA on HDLSS data, and the Regularzed Dscrmnant Analyss (). Note that LDA s a specal case of QDA when all classes share a common class covarance. Gven a tranng dataset of n data ponts {(x, y )} n =, where x IR d s the feature vector of the -th data pont, d s the data dmensonalty, y = g(x ) {, 2,, } s the class label of x, and s the number of classes. Let A = [x, x 2,, x n ] IR d n be the data matrx, whch can be decomposed nto classes as A = [A, A 2,, A ], where A contans all data ponts from the -th class. Denote n = A as the sze of the -th class. We have n = = n. Assumng the class denstes follow the normal dstrbuton, we apply the followng classfcaton rule [8, 4]: a test pont x IR d s classfed as class C(x) defned by C(x) = argmn (x µ ) T Σ (x µ ) + ln Σ, () where the centrod µ of -th class s defned as µ = n A e (), (2) e () IR n s a vector of all ones, and the covarance matrx Σ of -th class s defned as Σ = n x A (x µ )(x µ ) T. (3) Note that we have assumed an equal pror for all classes n Eq. () for smplcty. The decson boundary usng the above classfcaton rule s quadratc and the algorthm s thus called Quadratc Dscrmnant Analyss (QDA). In a specal case where all classes share a common covarance, that s, Σ = Σ j, for any class and class j, QDA s reduced to the well-nown Lnear Dscrmnant Analyss (LDA) [5, 7, 9, 4]. The tradtonal QDA formulaton n Eq. () requres all class covarance matrces to be nonsngular. However, for

3 many applcatons nvolvng HDLSS data, such as text document classfcaton, face recognton, and mcroarray gene expresson data analyss, all class covarance matrces may be sngular, snce the data dmensonalty may be much larger than the sample sze of all classes n the tranng dataset. Furthermore, the estmates of the class covarance matrces may be based and unrelable. As ponted out n [8], ths bas s more pronounced, when ther egenvalues tend toward equalty, whle t s correspondngly less severe when ther egenvalues are hghly dsparate. In both cases, ths phenomenon becomes more pronounced as the sample sze decreases. Thus, HDLSS data presents a major challenge for the applcaton of QDA. In [8], Fredman proposed a compromse between LDA and QDA, called Regularzed Dscrmnant Analyss (), whch allows one to shrn the separate covarances of QDA toward a common covarance as n LDA by employng regularzaton technques. Regularzaton has been commonly used n the soluton of ll-posed nverse problems [20], where the number of parameters exceeds the sample sze. In such cases, the parameter estmates can be hghly unstable, gvng rse to hgh varance. By employng a method of regularzaton, one attempts to mprove the estmates by regulatng ths bas varance trade-off. Quadratc Dscrmnant Analyss s ll-posed f n < d for any class. One method of regularzaton s to replace the ndvdual class covarance matrx Σ by S (α) as follows: S (α) = ασ + ( α)s w, (4) where S w s the weghted average of the class covarance matrces, called pooled covarance matrx, or wthn-class scatter matrx [9], whch s defned as S w = n = x A (x µ )(x µ ) T = n = (n Σ ). (5) The regularzaton parameter α taes on values between 0 and. It controls the degree of shrnage of the ndvdual class covarance matrx estmates toward the pooled estmate. The value α = gves rse to QDA, whereas α = 0 yelds LDA. However, the regularzaton n Eq. (4) s stll farly lmted. Frst, t mght not provde enough regularzaton. If the total sample sze, n, s less than the data dmensonalty, d, then even LDA s ll-posed [7, 22]. Second, basng the class covarance matrces toward commonalty may not be the most effectve way to shrn them. Recall that rdge regresson regularzes ordnary lnear least squares regresson by shrnng toward a multple of the dentty matrx [4, 5]. To ths end, a further regularzaton s gven by ˆΣ = βs (α) + ( β)i d, (6) where I d s the dentty matrx of d by d and β s an addtonal regularzaton parameter, whch controls shrnage toward a multple of the dentty matrx. In ths paper, we apply a varant of the regularzed class covarance matrx n Eq. (6), gven by ˆΣ = β (ασ + ( α)s t ) + ( β)i d, (7) where total scatter matrx S t s defned as S t = n n = (x µ)(x µ) T, (8) α [0, ], and β [0, ]. The mnor dfference here les n the matrx S t used n Eq. (7), whle S w s employed n Eq. (6). It s nterestng to note that, when β, the classfcaton rule based on the regularzed class covarance matrx n Eq. (6) may be numercally unstable, whle the one based on the matrx n Eq. (7) s stable even for HDLSS data (see Secton 3). The use of S t nstead of S w has recently been explored n LDA for mprovng numercal stablty [3, 22]. A test pont x n s classfed as class Ĉ(x) gven by Ĉ(x) = argmn (x µ ) T ˆΣ (x µ ) + ln ˆΣ, (9) where ˆΣ s defned n Eq. (7). The performance of may be crtcally dependent on the value of the parameters α and β. Cross-valdaton s commonly used to estmate the optmal α and β from a fnte set, Λ = {(α, β j )}, where =,, r, and j =,, s. The number of canddate pars (α, β) s Λ = rs. In practce, a large number, rs, of canddate pars s often desrable to acheve good classfcaton performance. However, wth a large number of parameter pars, the computatonal cost of model selecton for may be prohbtve for HDLSS data, snce t requres expensve matrx computatons for each canddate par. A drect mplementaton of as the one used n [8] nvolves the formaton of ˆΣ and the nverson of ˆΣ for all. The computaton of the nverson of all class covarance matrces taes O(d 3 ) tme and s prohbtve for HDLSS data, where the data dmensonalty d s large. Ths lmts the applcatons of to low dmensonal data. 3. EFFICIENT MODEL SELECTION FOR In ths secton, we frst establsh a ey property of, whch shows that the classfcaton rule n can be decomposed nto two components. The frst component nvolves matrces of low dmensonalty, whle the second component nvolves matrces of hgh dmensonalty. More mportantly, we show that for a gven test data pont, the second component n the classfcaton rule s constant for all classes, whch has no effect on the classfcaton and can be smply removed. Thus, the computatonal cost of can be sgnfcantly reduced. We call ths the decomposton property of. We show below that the essence of the decomposton property of s that the frst component of the classfcaton rule les n the orthogonal complement of the null space of S t, whch has low dmensonalty for HDLSS data, whle the second component les n the null space of S t of hgh dmensonalty. Defne the between-class scatter matrx S b, used n dscrmnant analyss [9], as follows: S b = n = It follows from the defnton that n (µ µ)(µ µ) T. (0) S t = S b + S w. () We have the followng result concernng the relatonshp between the null space of S t and the null space of Σ and S w :

4 Lemma 3.. Let Σ, S b, S t, and S w be defned as above. The null space of S t denoted as N(S t ) s a subset of the null space, N(S b ) of S b and a subset of the null space, N(Σ ) of Σ, for all. That s, N(S t ) N(S b ) and N(S t ) N(Σ ), for all. Proof. Consder any x N(S t ). That s, S t x = 0 and x T S t x = 0. From Eqs. (5) and (), we have It follows that S t = K = 0 = x T S t x = x T K = K = n n Σ + S b. (2) = n n Σ + S b x n n xt Σ x + x T S b x. Snce Σ, for all, and S b are postve sem-defnte, we have x T Σ x = 0, for all, and x T S b x = 0. It follows that Σ x = 0 and S b x = 0. Therefore, x also les n the null space of Σ and S b. Hence, N(S t ) N(S b ) and N(S t ) N(Σ ). Let S t = UDU T be the Sngular Value Decomposton D (SVD) [0] of S t, where U s orthogonal, D = t 0, 0 0 D t t t s dagonal and t = ran(s t ). Note that t n. Partton U nto U = [U, U 2 ], where U IR d t and U 2 IR d (d t). Then U 2 les n the null space of S t,.e., S t U 2 = 0. We have the followng result concernng the decomposton structure of ˆΣ : Lemma 3.2. Let U = [U, U 2 ] be defned as above and let ˆΣ be defned as n Eq. (7). Then ˆΣ can be expressed as M ˆΣ = U 0 U 0 ( β)i d t T, (3) where and Σ = U T Σ U. M = β α Σ + ( α)d t + ( β)i t, (4) Proof. Recall from Eq. (7) that It follows that ˆΣ = β (ασ + ( α)s t ) + ( β)i d. U T ˆΣ U = β αu T Σ U + ( α)u T S t U + ( β)i d. From Lemma 3., Σ U 2 = 0, for all. It follows that ˆΣ = U β αu T Σ U + ( α)u T S t U + ( β)i d U T = U M 0 0 ( β)i d t U T. Lemma 3.2 mples that all regularzed class covarance matrces share a smlar decomposton structure, whch leads to the decomposton property of as summarzed n the followng proposton: Proposton 3.. Let U, U 2, and M be defned as above. Then the classfcaton rule n Eq. (9) s equvalent to: Ĉ(x)=argmn (x µ ) T U M U T (x µ ) + ln M + ( β) (x µ ) T U 2 U T 2 (x µ ) + ( β) d t. Proof. Denote F from Lemma 3.2 that (5) = (x µ ) T ˆΣ (x µ ). It follows F = (x µ ) T ˆΣ (x µ ) = (x µ ) T M U 0 0 ( β)i d t = (x µ ) T U M U T (x µ ) U T (x µ ) + ( β) (x µ ) T U 2 U T 2 (x µ ). (6) The result follows drectly from Lemma 3.2 and Eq. (6), as ln ˆΣ = ln M + ln ( β)i d t = ln M + ( β) d t. Proposton 3. mples that the classfcaton rule n can be decomposed nto two components as n Eq. (5). The frst component,.e., (x µ ) T U M U T (x µ ), nvolves U, whch les n the orthogonal complement of the null space of S t, whle the second component,.e., ( β) (x µ ) T U 2 U2 T (x µ ) + ( β) d t, nvolves U 2, whch les n the null space of S t. Note that the null space of S t s of dmenson d t, whch s much larger than the dmenson, t, of the orthogonal complement of the null space of S t, for HDLSS data. However, two ssues need to be resolved before we apply the classfcaton rule n Eq. (6). Frst, the computaton may be numercally unstable as β, due to the presence of ( β) n the computaton. Second, fndng the best parameter par (α, β) from a set, Λ, of canddate pars may be expensve, snce U 2 IR d (d t) s of large sze for HDLSS data. Interestngly, both ssues can be addressed smultaneously by smply removng the second term n Eq. (6), based on the lemma below: Lemma 3.3. Let U 2, µ and µ be defned as above, then U T 2 (µ µ) = 0. Thus, U T 2 (x µ ) = U T 2 (x µ), for any x. Proof. Note that U 2 les n the null space of S t. From Lemma 3., U 2 also les n the null space of S b. That s, U T 2 S b = 0. S b n Eq. (0) can be expressed as S b = H b H T b, where H b = n [ n (µ µ), n 2 (µ 2 µ),, n (µ µ)]. It follows from U T 2 S b = 0 that U T 2 H b = 0,.e., U T 2 [ n (µ µ), n 2 (µ 2 µ),, n (µ µ)] = 0, and U T 2 (µ µ) = 0. Hence, U T 2 (x µ ) = U T 2 (x µ). (7) From Lemma 3.3, the classfcaton rule n Eq. (5) can be further smplfed by removng the second component as Ĉ(x) = argmn (x µ ) T U M U T (x µ ) + ln M = argmn where x = U T x and µ = U T µ. ( x µ ) T M ( x µ ) + ln M, (8)

5 3. The computaton of M and M The man computatons n Eq. (8) are the nverson of M and the determnant of M, for all, whch tae O(t 3 ) = O(n 3 ) tme, as M IR t t and t n. Recall from Secton 2 that the drect mplementaton of computes the nverson of ˆΣ for all drectly wth the tme complexty of O(d 3 ), whch s sgnfcantly hgher than O(n 3 ) for HDLSS data. In the followng, we present an effcent way of computng the nverson of M and the determnant of M, for all, wth a tme complexty of O(n 3 /), thus further reducng the complexty of the algorthm. Defne the matrx H IR d n as follows: H = n [A µ (e () ) T ], (9) where A IR d n s the data matrx of the -th class, µ s the centrod of the -th class, and e () s the vector of all ones of length n. Then the class covarance matrx, Σ, of the -th class can be expressed as: It follows that Σ = H H T. (20) Σ = U T H H T U = H HT, (2) where H = U T H. Denote D αβ as the dagonal matrx: From Eq. (4), We have M = αβ H HT + D αβ = Dαβ 0.5 ( αβd 0.5 αβ = Dαβ 0.5 where D αβ = ( α)βd t + ( β)i t. (22) X X T + I t D 0.5 X = αβd 0.5 αβ H )( αβd 0.5 αβ H ) T + I t D 0.5 αβ αβ, (23) H IR t n. It follows from the Sherman-Woodbury-Morrson formula [0] that M = D 0.5 αβ I t X (I n + X T X ) X T D 0.5 αβ. (24) Note that the matrx nverson n Eq. (24) s on N = I n + X T X IR n n. (25) However, the nverson M wll not be formed explctly, as the multplcaton of ( x µ ) T M ( x µ ), taes O(n 2 ) tme, for each x. Note from Eq. (24) that ( x µ ) T M ( x µ ) = ( x µ ) T D αβ ( x µ ) ( x µ ) T D 0.5 αβ X N X T D 0.5 αβ ( x µ ), (26) whch taes O(n 3 ) for computng N and O(nn ) tme for all other computatons, for each x. The total complexty s thus O(nn +n 3 ) tme, for each x. Thus, the computaton of ( x µ ) T M ( x µ ) for all taes O(n 2 + = n3 ) tme. Assumng all classes are of approxmately equal sze, that s, n n/, then the tme complexty for the computaton s O(n 2 + n 3 / 2 ). One ey observaton here s that the computaton of N s ndependent of the test pont x. Note that the total number of test ponts n v-fold cross-valdaton s n/v. In ths case, the total computatonal cost for all test ponts s O(n 3 + n 3 / 2 ), nstead of O(n 3 + n 4 / 2 ). Next, we consder the computaton of M, whch s ndependent of the test pont x. From Eq. (23), M = X X T + I t D αβ = X T X + I n D αβ, (27) where the last equalty follows from the followng lemma: Lemma 3.4. Let X IR t n be any matrx of sze t by n. Then the followng equalty always holds: Proof. Let X = RDS T XX T + I t = X T X + I n. be the SVD of X, where R and S are orthogonal and D = Σ m s dagonal wth Σ m = dag(λ,, λ m ) and m = ran(x). It follows that XX T + I t = R(DD T + I t )R T = DD T + I t = Σ 2 m + I m I t m = m j= (λ 2 j + ), X T X + I n = S(D T D + I n )S T = D T D + I n = Σ 2 m + I m I n m = Thus XX T + I t = X T X + I n. m j= (λ 2 j + ), From Eq. (27), the tme complexty of the computaton of M, for all, s O t n 2 + n 3 = O n n 2 + n 3, = = = = whch s O(n 3 /), assumng all classes are of approxmately equal sze. 3.2 The computaton of U Recall that the ey matrx n the decomposton property of n Proposton 3. s U, whch les n the orthogonal complement of the null space of S t. We have appled the SVD for computng U as S t = UDU T, where U = [U, U 2 ] s a partton of U. When the data dmensonalty d s large, the full SVD computaton of S t IR d d s expensve. However from Lemma 3.3, only the frst component of the classfcaton rule, whch nvolves U, s effectve n, whle U 2, the null space of S t can smply be omtted. Thus, U can be computed effcently wthout the full SVD computaton of S t as follows. Defne matrx H t as: H t = n (A µe T ), (28) where µ s the global centrod and e s the vector of all ones. It follows from the defnton that S t = H t Ht T. Note that H t IR d n, whch s much smaller than S t for HDLSS T data. Let H t = Û ˆΣ ˆV be the reduced SVD of H t, where Û IR d t and ˆV IR n t have orthonormal columns and ˆΣ IR t t s dagonal wth t = ran(h t ). It follows that S t = H t Ht T = Û ˆΣ 2 Û T. Thus, U = Û and D t = ˆΣ 2. The tme complexty of the reduced SVD computaton of H t s O(dn 2 ) [0], nstead of O(d 3 ) for the full SVD computaton.

6 3.3 The man algorthm Let Λ = {(α, β j )}, where =,, r and j =,, s, be the canddate set for the regularzaton parameters. In model selecton, v-fold cross-valdaton s appled, where the data s dvded nto v subsets of (approxmately) equal sze. All subsets are mutually exclusve, and n the -th fold, the - th subset s held out for test and all other subsets are used n tranng. For each (α, β j ), we compute the cross-valdaton accuracy, Accu(, j), defned as the mean of the accuraces for all folds. The best regularzaton par (α, β j ) s the one wth (, j ) = arg max,j Accu(, j). The pseudo-code of the proposed algorthm s gven below. Algorthm Input: data matrx A set of parameters: {α } r = and {β j } s j= Output: the optmal parameter par (α, β j ). For h = : v /* v-fold cross valdaton */ 2. Construct A h and Aĥ; /* A h = h-th fold, for tranng */ /* Aĥ = rest, for testng */ 3. Construct H t usng A h as n Eq. (28); 4. Compute the reduced SVD of H t : H t = Û ˆΣ ˆV T ; 5. t ran(h t ); U Û; D t ˆΣ 2 ; 6. A h L U T Aĥ; AĥL U T Aĥ; /* Null space, U 2, of S t s removed */ 7. Form { H u } u= based on A h L as n Eq. (9); 8. For = : r /* α, α 2,, α r */ 9. For j = : s /* β, β 2,, β s */ 0. D αβ ( α )β j D t + ( β j )I t ;. For u = : 2. X u α β j D 0.5 αβ H u ; 3. N (I + Xu T X u ) as n Eq. (25), 4. Compute M u as n Eq. (27), 5. EndFor 6. temp 0; /* Varable temp counts the number of test ponts correctly classfed */ 7. For each x AĥL /* x = U T x and x Aĥ */ 8. C(x) argmn u {(x µ u ) T Mu (x µ u ) 9. +ln M u }; /* The multplcaton s done as n Eq. (26) */ 20. If (C(x) == g(x)) temp temp +; 2. EndFor 22. Accu(h,, j) temp/ AĥL ; /* AĥL denotes the number of test ponts */ 23. EndFor 24. EndFor 25. EndFor v h= 26. Accu(, j) Accu(h,, j); /* Accu(, j) v denotes the cross-valdaton accuracy */ 27. (, j ) arg max,j Accu(, j); 28. Output (α, β j ) as the best parameter par. 3.4 Tme Complexty Lne 4 taes O(n 2 d) tme for the reduced SVD computaton [0]. Lnes 5 and 6 tae O(dn 2 ) tme for the matrx multplcatons. The For loop from Lne to Lne 5 taes O(n 3 /) tme. There are about n/v elements n AĥL, thus Lne 8 to Lne 20 wthn the For loop run about n/v tmes. Followng the multplcaton n Eq. (26), the computatons from Lne 8 to Lne 20 tae O(n 2 ) tme, and the For loop from Lne 7 to Lne 2 tae O(n 3 /v) tme. Thus, the double For loops from Lne 8 to Lne 26 tae O n 3 rs(/ + /v) tme. The total runnng tme of the algorthm s thus It follows that T (r, s) T (, ) T (r, s) = O v(n 2 d + n 3 rs(/ + /v)) = O vn 2 (d + nrs(/ + /v)). d + nrs(/ + /v) nrs(/ + /v) + d + n(/ + /v) d + n(/ + /v). For HDLSS data, where the sample sze n s much smaller than the data dmensonalty d,.e., n d, the overhead of estmatng the optmal regularzaton par among a large search space may be small. Note that the frst stage of taes O(vn 2 d) tme, whch s expensve for HDLSS data. However, t s ndependent of the parameters. In the second stage of, the most expensve steps are the computatons of M and M, whch tae O vn 3 rs(/ + /v) tme. It s ndependent of the data dmensonalty d. Ths s the ey reason why the proposed algorthm s applcable for HDLSS data for a large canddate set of parameters. 3.5 versus We conclude ths secton by showng an nterestng relatonshp between and Uncorrelated LDA (LDA) [23]. s an extenson of the orgnal formulaton n [6] to hgh dmensonal, small sample sze data. It follows the basc framewor of LDA [5, 7, 9, 4] that computes the optmal transformaton (projecton) by mnmzng the rato of the wthn-class dstance to the between-class dstance, thus achevng maxmum dscrmnaton. One ey property of s that the features n the transformed space are uncorrelated, thus ensurng mnmum redundancy among the features n the reduced space. It has been appled successfully n mcroarray gene expresson data analyss [24]. It was shown n [22] that the optmal transformaton G of conssts of the frst q egenvectors of S + t S b, where q = ran(s b ). Wth the computed G, a test pont x can be classfed n as class h, where h = arg mn G T (x µ ) 2. It has been shown n [22] that arg mn (x µ ) T S + t (x µ ) = arg mn G T (x µ ) 2. Interestngly, we can show that the lmt of when α 0 and β s equvalent to as summarzed n the followng lemma: Theorem 3.. The classfcaton rule n approaches that of, when α 0 and β. That s, f G s the transformaton of. Then, lm Ĉ(x) = arg mn G T (x µ ) 2 α 0,β. Proof. From Eq. (8), the classfcaton rule n s equvalent to Ĉ(x) = argmn ( x µ ) T M ( x µ ) + ln M, (29)

7 sze dmensonalty # of classes dataset (n) (d) () re re ORL PIX ALL ALLAML Table 2: Statstcs for our test datasets. From Eq. (4), we have lm α 0,β M = D t. Thus lm Ĉ(x) = argmn α 0,β = argmn = argmn ( x µ ) T D t ( x µ ) + ln D t, (x µ ) T U D t U T (x µ ) (x µ ) T S + t (x µ ), whch s equvalent to arg mn G T (x µ ) 2, the classfcaton rule n. Theorem 3. shows that s a specal case of when α = 0 and β =. Wth a properly chosen parameter par (α, β) through cross-valdaton, s expected to outperform, whch s confrmed by the emprcal results presented n the next secton. Note that the lmt of ˆΣ, as α 0 and β does not exst for HDLSS data, as the lmt of ˆΣ = β (ασ + ( α)s t ) + ( β)i d s sngular, when d > n. However, Theorem 3. shows that the lmt below exsts: lm (x µ ) T ˆΣ (x µ ) = (x µ ) T S + t (x µ ), α 0,β due to the decomposton property of as n Eq. (8). 4. EXPERIMENTS In ths secton, we expermentally evaluate the performance of the proposed algorthm. v-fold cross valdaton wth v = 5 has been used n for model selecton. All of our experments have been performed on a P4 3.00GHz Wndows XP machne wth 2GB memory. 4. Datasets We have used three types of HDLSS data for the evaluaton, ncludng text documents, face mages, and gene expresson data. The mportant statstcs of these datasets are summarzed below (see also Table 2): re0 and re are two text document datasets, derved from Reuters-2578 text categorzaton test collecton Dstrbuton.0 [8]. re0 ncludes 320 documents belongng to 4 dfferent classes. The dmenson of ths dataset s re has 5 classes, each wth 98 nstances; ts dmenson s ORL s a face mage dataset, whch contans 400 face mages of 40 ndvduals. The mage sze s The face mages are perfectly centralzed. The major challenge on ths dataset s the varaton of the face pose. There s no lghtng varaton, wth mnmal facal expresson varatons, and no occluson. We use the whole mage as an nstance (.e., the dmenson of an nstance s 92 2 = 0304). PIX s a face mage dataset, whch contans 300 face mages of 30 ndvduals. The mage sze s We subsample the mages wth a sample step of 5 5, and the dmenson of each nstance s reduced to = ALL s a gene expresson dataset consstng of sx dagnostc groups [25]. The breadown of the samples s: 5 samples for BCR, 27 samples for E2A, 64 samples for Hyperdp, 20 samples for MLL, 43 samples for T, and 79 samples for TEL. ALLAML4 s a gene expresson dataset, whch contans the gene expresson profles of two acute cases of leuema: acute lymphoblastc leuema (ALL) and acute myeloblastc leuema (AML). The ALL part of the dataset comes from two sample types, B-cell and T-cell, and the AML part s splt nto bone marrow samples and perpheral blood. Ths dataset was fst studed n the semnal paper of Golub et al. []. Golub et al. studed ths problem to address the bnary classfcaton problem between the AML samples and the ALL samples. ALLAML4 s a four-class dataset (B-cell, T-cell, AML-BM, and AML PB). 4.2 Effcency In ths experment, we test the effcency of the proposed algorthm. Table 3 shows the computatonal tme (n seconds) of on dfferent numbers of parameter pars r s. We set r = s for smplcty wth r tang values from to 32, thus the sze of the canddate set, Λ ranges from to 024. It s clear from the table that the computatonal cost of grows slowly as r s s small. When r s s large, the cost, T (r, s), of the proposed algorthm s stll sgnfcantly smaller than rst (, ), the computatonal cost of wthout applyng the optmzatons proposed n ths paper. For example, we can observe that T (6, 6)/T (, ) on dfferent datasets s less than 7, whch s sgnfcantly smaller than 6 6 = 256, whle T (32, 32)/T (, ) s less than 25 n all cases, much smaller than = 024. Among all datasets, the document datasets have relatvely larger ncreasng rates of runnng tme than the others, whle the gene expresson datasets have the smallest ncreasng rates. Note that the rato of the sample sze to the data dmensonalty,.e., n/d, s relatvely large for both document datasets, whle t s relatvely small for both gene expresson datasets. These results are consstent wth the theoretcal estmaton of the effcency n Secton Classfcaton performance In ths experment, we evaluate n classfcaton and compare t wth Uncorrelated LDA [23] and Support Vector Machnes () [2, 4, 2]. For each dataset, we frst set the percentage of the data for tranng to be ether /2 or /3 (by a random partton). Then we apply the proposed algorthm, as well as and, on the tranng data to learn the model, whch s further appled to the remanng Note that the cost of wll be even hgher f the decomposton property of from ths paper s not appled.

8 Classfcaton Accuracy Classfcaton Accuracy re0 (rato = /2) re (rato = /2) Classfcaton Accuracy Classfcaton Accuracy re0 (rato = /3) re (rato = /3) Fgure : Comparson of,, and n classfcaton accuracy usng re0 and re. The x-axs denotes 30 dfferent parttons nto tranng and testng set, where rato s the percentage of the data for tranng. # of parameter pars (r s) dataset re re ORL PIX ALL ALLAML Table 3: Computatonal tme (n seconds) of for dfferent numbers of parameter pars. test data to get the accuracy of classfcaton. To gve a better estmaton of accuracy, the procedure s repeated 30 tmes and the resultng accuraces are averaged. Note that for, we choose the optmal model from 900 parameter pars wth r = 30 and s = 30. Because of the mproved effcency of the proposed algorthm, t s practcal to select the optmal model from such a large search space. The classfcaton accuraces of the 30 dfferent parttons for all sx datasets are shown n Fg. 3. In Table 4, we report the mean accuracy and standard dervaton of the 30 dfferent parttons for each dataset, where rato denotes the percentage of the data for tranng and s ether /3 or /2 n our experments. For all the datasets, the performance usng rato = /2 s better than that usng rato = /3 n terms of classfcaton accuracy. Ths conforms to our expectaton that the classfcaton performance may be mproved wth a larger number of tranng data. We can observe from the accuracy curves n Fg. 3 that and often follow smlar trends. For both document datasets re and re0, all three algorthms acheve comparable performance n re (all three accuracy curves n Fg. are very close to each other), whle and outperform n re0. For both face mage datasets, and outperform by a large margn, whle acheves slghtly hgher accuraces than. As for both gene expresson datasets, the accuracy curves are smlar for all three algorthms, whle acheves a smaller overall varance than and. Overall, s very compettve wth and n classfcaton. Recall from Theorem 3. that s a specal case of when α = 0 and β =. Wth a properly chosen parameters, s expected to outperform, whch s confrmed by our emprcal results above. 5. CONCLUSIONS We present n ths paper a novel algorthm for that s applcable for hgh dmensonal, low sample sze data. s a compromse between LDA and QDA, regulated by two regularzaton parameters. A major advantage of the proposed algorthm s ts low computatonal cost n selectng the optmal parameters from a large canddate set, n comparson wth the tradtonal formulaton. Thus t facltates effcent model selecton for. The ey to the proposed effcent model selecton procedure les n the decomposton property of establshed n ths paper. We evaluate the proposed algorthm usng document, mage, and gene expresson datasets. s compared wth and n classfcaton. Results confrm the hgh effcency of the proposed algorthm. Our experments also demonstrate that wth the proposed effcent model selecton algorthm, can be effectvely appled to hgh dmensonal, low sample sze data. The relatve performance of over vares a lot for dfferent types of data. outperforms for both

9 Classfcaton Accuracy Classfcaton Accuracy 5 5 ORL (rato = /2) PIX (rato = /2) Classfcaton Accuracy Classfcaton Accuracy ORL (rato = /3) PIX (rato = /3) Fgure 2: Comparson of,, and n classfcaton accuracy usng ORL and PIX. The x-axs denotes 30 dfferent parttons nto tranng and testng set, where rato s the percentage of the data for tranng. mage datasets by a large margn, whle they are comparable for both gene expresson datasets. One of the future wor s to study the effect of the characterstcs of the data on the performance of. We also plan to apply to other applcatons nvolvng HDLSS data such as gene expresson pattern mages, proten expresson data, etc. Table 4: Comparson of classfcaton accuracy (n percentage) of,, and. The mean and standard devaton of 30 dfferent parttons wth a rato of /2 and /3 are reported. dataset rato mean std mean std mean std re0 / re0 / re / re / ORL / ORL / PIX / PIX / ALL / ALL / ALLAML4 / ALLAML4 / Acnowledgements Research of JY s sponsored, n part, by the Center for Evolutonary Functonal Genomcs of the Bodesgn Insttute at the Arzona State Unversty. 6. REFERENCES [] P.N. Belhumeour, J.P. Hespanha, and D.J. Kregman. Egenfaces vs. Fsherfaces: Recognton usng class specfc lnear projecton. IEEE Trans Pattern Analyss and Machne Intellgence, 9(7):7 720, 997. [2] C. J. C. Burges. A tutoral on support vector machnes for pattern recognton. Data Mnng and Knowledge Dscovery, 2(2):2 67, 998. [3] L.F. Chen, H.Y.M. Lao, M.T. Ko, J.C. Ln, and G.J. Yu. A new LDA-based face recognton system whch can solve the small sample sze problem. Pattern Recognton, 33:73 726, [4] N. Crstann and J.S. Taylor. Support Vector Machnes and other Kernel-based Learnng Methods. Cambrdge Unversty Press, [5] R.O. Duda, P.E. Hart, and D. Stor. Pattern Classfcaton. Wley, [6] S. Dudot, J. Frdlyand, and T. P. Speed. Comparson of dscrmnaton methods for the classfcaton of tumors usng gene expresson data. Journal of the Amercan Statstcal Assocaton, 97(457):77 87, [7] R.A. Fsher. The use of multple measurements n taxonomc problems. Annals of Eugencs, 7:79 88, 936. [8] J.H. Fredman. Regularzed dscrmnant analyss. Journal of the Amercan Statstcal Assocaton, 84(405):65 75, 989. [9] K. Fuunaga. Introducton to Statstcal Pattern Classfcaton. Academc Press, USA, 990. [0] G. H. Golub and C. F. Van Loan. Matrx Computatons. The Johns Hopns Unversty Press, USA, thrd edton, 996. [] T.R. Golub and et al. Molecular classfcaton of cancer: class dscovery and class predcton by gene

10 Classfcaton Accuracy Classfcaton Accuracy ALL (rato = /2) ALLAML4 (rato = /2) Classfcaton Accuracy Classfcaton Accuracy 5 5 ALL (rato = /3) ALLAML4 (rato = /3) Fgure 3: Comparson of,, and n classfcaton accuracy usng ALL and ALLAML4. The x-axs denotes 30 dfferent parttons nto tranng and testng set, where rato s the percentage of the data for tranng. expresson montorng. Scence, 286:53 537, 999. [2] U. Grouven, F. Bergel, and A. Schultz. Implementaton of lnear and quadratc dscrmnant analyss ncorporatng costs of msclassfcaton. Computer Methods and Programs n Bomedcne, 49():55 60, 996. [3] P. Hall, J.S. Marron, and A. Neeman. Geometrc representaton of hgh dmenson, low sample sze data. Journal of the Royal Statstcal Socety seres B, 67: , [4] T. Haste, R. Tbshran, and J.H. Fredman. The Elements of Statstcal Learnng : Data Mnng, Inference, and Predcton. Sprnger, 200. [5] A. Hoerl and R. Kennard. Rdge regresson: Based estmaton for nonorthogonal problems. Technometrcs, 2(3):55 67, 970. [6] Z. Jn, J. Y. Yang, Z.S. Hu, and Z. Lou. Face recognton based on the uncorrelated dscrmnant transformaton. Pattern Recognton, 34:405 46, 200. [7] W.J. Krzanows, P. Jonathan, W.V McCarthy, and M.R. Thomas. Dscrmnant analyss wth sngular covarance matrces: methods and applcatons to spectroscopc data. Appled Statstcs, 44:0 5, 995. [8] D.D. Lews. Reuters-2578 text categorzaton test collecton dstrbuton.0. lews, 999. [9] D. L. Swets and J. Weng. Usng dscrmnant egenfeatures for mage retreval. IEEE Trans Pattern Analyss and Machne Intellgence, 8(8):83 836, 996. [20] A. N. Thonov and V. Y. Arsenn. Solutons of Ill-posed problems. John Wley and Sons, Washngton D.C., 977. [2] V.N. Vapn. Statstcal Learnng Theory. Wley, 998. [22] J. Ye. Characterzaton of a famly of algorthms for generalzed dscrmnant analyss on undersampled problems. Journal of Machne Learnng Research, 6: , [23] J. Ye, R. Janardan, Q. L, and H. Par. Feature extracton va generalzed uncorrelated lnear dscrmnant analyss. In ICML Conference Proceedngs, [24] J. Ye, T. L, T. Xong, and R. Janardan. Usng uncorrelated dscrmnant analyss for tssue classfcaton wth gene expresson data. IEEE/ACM Trans. Computatonal Bology and Bonformatcs, (4):8 90, [25] E.J. Yeoh et al. Classfcaton, subtype dscovery, and predcton of outcome n pedatrc lymphoblastc leuema by gene expresson proflng. Cancer Cell, (2):33 43, [26] L. Zhang and L. Luo. Splce ste predcton wth quadratc dscrmnant analyss usng dversty measure. Nuclec Acds Research, 3(2): , [27] M. Zhang. Identfcaton of proten codng regons n the human genome by quadratc dscrmnant analyss. Proceedngs of the Natonal Academy of Scences, USA, 94: , 997.

Regularized Discriminant Analysis for Face Recognition

Regularized Discriminant Analysis for Face Recognition 1 Regularzed Dscrmnant Analyss for Face Recognton Itz Pma, Mayer Aladem Department of Electrcal and Computer Engneerng, Ben-Guron Unversty of the Negev P.O.Box 653, Beer-Sheva, 845, Israel. Abstract Ths