Efficient and Robust Feature Extraction by Maximum Margin Criterion

Size: px

Start display at page:

Download "Efficient and Robust Feature Extraction by Maximum Margin Criterion"

Alexia Booth
5 years ago
Views:

1 Effcent and Robust Feature Extracton by Maxmum Margn Crteron Hafeng L Tao Jang Department of Computer Scence Unversty of Calforna Rversde, CA 95 {hl,jang}@cs.ucr.edu Keshu Zhang Department of Electrcal Engneerng Unversty of New Orleans New Orleans, LA 7048 kzhang@uno.edu Abstract A new feature extracton crteron, maxmum margn crteron (MMC), s proposed n ths paper. Ths new crteron s general n the sense that, when combned wth a sutable constrant, t can actually gve rse to the most popular feature extractor n the lterature, lnear dscrmnate analyss (LDA). We derve a new feature extractor based on MMC usng a dfferent constrant that does not depend on the nonsngularty of the wthn-class scatter matrx S w. Such a dependence s a major drawback of LDA especally when the sample sze s small. The kernelzed (nonlnear) counterpart of ths lnear feature extractor s also establshed n ths paper. Our prelmnary expermental results on face mages demonstrate that the new feature extractors are effcent and stable. Introducton In statstcal pattern recognton, the hgh-dmensonalty s a major cause of the practcal lmtatons of many pattern recognton technologes. In the past several decades, many dmensonalty reducton technques have been proposed. Lnear dscrmnant analyss (LDA, also called Fsher s Lnear Dscrmnant) [] s one of the most popular lnear dmensonalty reducton method. In many applcatons, LDA has been proven to be very powerful. LDA s gven by a lnear transformaton matrx W R D d maxmzng the so-called Fsher crteron (a knd of Raylegh coeffcent) J F (W) = WT S b W W T S w W where S b = c = p (m m)(m m) T and S w = c = p S are the betweenclass scatter matrx and the wthn-class scatter matrx, respectvely; c s the number of classes; m and p are the mean vector and a pror probablty of class, respectvely; m = c = p m s the overall mean vector; S s the wthn-class scatter matrx of class ; D and d are the dmensonaltes of the data before and after the transformaton, respectvely. To maxmze (), the transformaton matrx W must be consttuted by the largest egenvectors of S w S b. The purpose of LDA s to maxmze the between-class scatter whle smultaneously mnmzng the wthn-class scatter. The two-class LDA has a close connecton to optmal lnear Bayes classfers. In the two-class case, the transformaton matrx W s just a vector, whch s n the same drecton as the dscrmnant n the correspondng optmal Bayes classfer. However, t has been shown that LDA s suboptmal for mult-class problems []. A major drawback of LDA s that t cannot be appled when S w s sngular because of the small sample sze problem [3]. The small sample sze problem ()

2 arses whenever the number of samples s smaller than the dmensonalty of the samples. For example, a mage n a face recognton system has 4096 dmensons, whch requres more than 4096 tranng ponts to ensure that S w s nonsngular. So, LDA s not a stable method n practce when the tranng data s scarce. In recent years, many researchers have notced ths problem and tred to overcome the computatonal dffculty wth LDA. Tan et al. [4] used the pseudo-nverse matrx S + w nstead of the nverse matrx S w. For the same purpose, Hong and Yang [5] tred to add a sngular value perturbaton to S w to make t nonsngular. Nether of these methods are theoretcally sound because Fsher s crteron s not vald when S w s sngular. When S w s sngular, any postve S b makes Fsher s crteron nfntely large. Thus, these nave attempts to calculate the (pseudo or approxmate) nverse of S w may lead to arbtrary (meanngless) results. Besdes, t s also known that an egenvector could be very senstve to small perturbaton f ts correspondng egenvalue s close to another egenvalue of the same matrx [6]. In 99, Lu et al. [7] modfed Fsher s crteron by usng the total scatter matrx S t = S b + S w as the denomnator nstead of S w. It has been proven that the modfed crteron s exactly equvalent to Fsher s crteron. However, when S w s sngular, the modfed crteron reaches the maxmum value (.e., ) no matter what the transformaton W s. Such an arbtrary transformaton cannot guarantee the maxmum class separablty unless W T S b W s maxmzed. Besdes, ths method need stll calculate an nverse matrx, whch s tme consumng. In 000, Chen et al. [8] proposed the LDA+PCA method. When S w s of full rank, the LDA+PCA method just calculates the maxmum egenvectors of S t S b to form the transformaton matrx. Otherwse, a two-stage procedure s employed. Frst, the data are transformed nto the null space V 0 of S w. Second, t tres to maxmze the betweenclass scatter n V 0, whch s accomplshed by performng prncpal component analyss (PCA) on the between-class scatter matrx n V 0. Although ths method solves the small sample sze problem, t s obvously suboptmal because t maxmzes the between-class scatter n the null space of S w nstead of the orgnal nput space. Besdes, the performance of the LDA+PCA method drops sgnfcantly when n c s close to the dmensonalty D, where n s the number of samples and c s the number of classes. The reason s that the dmensonalty of the null space V 0 s too small n ths stuaton, and too much nformaton s lost when we try to extract the dscrmnant vectors n V 0. LDA+PCA also need calculate the rank of S w, whch s an ll-defned operaton due to floatng-pont mprecsons. At last, ths method s complcated and slow because too much calculaton s nvolved. Kernel Fsher s Dscrmnant (KFD) [9] s a well-known nonlnear extenson to LDA. The nstablty problem s more severe for KFD because S w n the (nonlnear) feature space F s always sngular (the rank of S w s n c). Smlar to [5], KFD smply adds a perturbaton µi to S w. Of course, t has the same stablty problem as that n [5] because egenvectors are senstve to small perturbaton. Although the authors also argued that ths perturbaton acts as some knd of regularzaton,.e., a capacty control n F, the real nfluence n ths settng of regularzaton s not yet fully understood. Besdes, t s hard to determne an optmal µ snce there are no theoretcal gudelnes. In ths paper, a smpler, more effcent, and stable method s proposed to calculate the most dscrmnant vectors based on a new feature extracton crteron, the maxmum margn crteron (MMC). Based on MMC, new lnear and nonlnear feature extractors are establshed. It can be shown that MMC represents class separablty better than PCA. As a connecton to Fsher s crteron, we may also derve LDA from MMC by ncorporatng some sutable constrant. On the other hand, the new feature extractors derved above (based on MMC) do not suffer from the small sample sze problem, whch s known to cause serous stablty problems for LDA (based on Fsher s crteron). Dfferent from LDA+PCA, the new feature extractors based on MMC maxmze the between-class scatter n the nput space nstead of the null space of S w. Hence, t has a better overall performance than LDA+PCA, as confrmed by our prelmnary expermental results.

3 Maxmum Margn Crteron Suppose that we are gven emprcal data (x, y ),..., (x n, y n ) X {C,..., C c } Here, the doman X R D s some nonempty set that the patterns x are taken from. The y s are called labels or targets. By studyng these samples, we want to predct the label y {C,..., C c } of some new pattern x X. In other words, we choose y such that (x, y) s n some sense smlar to the tranng examples. For ths purpose, some measure need be employed to assess smlarty or dssmlarty. We want to keep such smlarty/dssmlarty nformaton as much as possble after the dmensonalty reducton,.e., transformng x from R D to R d, where d D. If some dstance metrc s used to measure the dssmlarty, we would hope that a pattern s close to those n the same class but far from those n dfferent classes. So, a good feature extractor should maxmze the dstances between classes after the transformaton. Therefore, we may defne the feature extracton crteron as J = p p j d(c, C j ) () We call () the maxmum margn crteron (MMC). It s actually the summaton of c(c ) nterclass margns. Lke the weghted parwse Fsher s crtera n [], one may also defne a weghted maxmum margn crteron. Due to the page lmt, we omt the dscusson n ths paper. One may use the dstance between mean vectors as the dstance between classes,.e. d(c, C j ) = d(m, m j ) (3) where m and m j are the mean vectors of the class C and the class C j, respectvely. However, (3) s not sutable snce t neglects the scatter of classes. Even f the dstance between the mean vectors s large, t s not easy to separate two classes that have the large spread and overlap wth each other. By consderng the scatter of classes, we defne the nterclass dstance (or margn) as d(c, C j ) = d(m, m j ) s(c ) s(c j ) (4) where s(c ) s some measure of the scatter of the class C. In statstcs, we usually use the generalzed varance S or overall varance tr(s ) to measure the scatter of data. In ths paper, we use the overall varance tr(s ) because t s easy to analyze. The weakness of the overall varance s that t gnores covarance structure altogether. Note that, by employng the overall/generalzed varance, the expresson (4) measures the average margn between two classes whle the mnmum margn s used n support vector machnes (SVMs) [0]. Wth (4) and s(c ) beng tr(s ), we may decompose () nto two parts J = p p j (d(m, m j ) tr(s ) tr(s j )) = p p j d(m, m j ) p p j (tr(s ) + tr(s j )) The second part s easly smplfed to tr(s w ) ( ) p p j (tr(s ) + tr(s j )) = p tr(s ) = tr p S = tr(s w ) (5) = =

4 By employng the Eucldean dstance, we may also smplfy the frst part to tr(s b ) as follows p p j d(m, m j ) = p p j (m m j ) T (m m j ) = p p j (m m + m m j ) T (m m + m m j ) After expandng t, we can smplfy the above equaton to c = p (m m) T (m m) by usng the fact c j= p j(m m j ) = 0. So ( ) p p j d(m, m j ) = tr p (m m)(m m) T = tr(s b ) (6) = Now we obtan J = tr(s b S w ) (7) Snce tr(s b ) measures the overall varance of the class mean vectors, a large tr(s b ) mples that the class mean vectors scatter n a large space. On the other hand, a small tr(s w ) mples that every class has a small spread. Thus, a large J ndcates that patterns are close to each other f they are from the same class but are far from each other f they are from dfferent classes. Thus, ths crteron may represent class separablty better than PCA. Recall that PCA tres to maxmze the total scatter after a lnear transformaton. But the data set wth a large wthn-class scatter can also have a large total scatter even when t has a small between-class scatter because S t = S b + S w. Obvously, such data are not easy to classfy. Compared wth LDA+PCA, we maxmze the between-class scatter n nput space rather than the null space of S w when S w s sngular. So, our method can keep more dscrmnatve nformaton than LDA+PCA does. 3 Lnear Feature Extracton When performng dmensonalty reducton, we want to fnd a (lnear or nonlnear) mappng from the measurement space M to some feature space F such that J s maxmzed after the transformaton. In ths secton, we dscuss how to fnd an optmal lnear feature extractor. In the next secton, we wll generalze t to the nonlnear case. Consder a lnear mappng W R D d. We would lke to maxmze J(W) = tr(s W b S W w ) where S W b and S W w are the between-class scatter matrx and wthn-class scatter matrx n the feature space F. Snce W s a lnear mappng, t s easy to show S W b = W T S b W and = W T S w W. So, we have S W w J(W) = tr ( W T (S b S w )W ) (8) In ths formulaton, we have the freedom to multply W wth some nonzero constant. Thus, we addtonally requre that W s consttuted by the unt vectors,.e. W = [w w... w d ] and wk T w k =. Ths means that we need solve the followng constraned optmzaton max wk T (S b S w )w k subject to w T k w k = 0 k =,..., d

5 Note that, we may also use other constrants n the above. For example, we may requre tr ( W T S w W ) = and then maxmze tr ( W T S b W ). It s easy to show that maxmzng MMC wth such a constrant n fact results n LDA. The only dfference s that t nvolves a constraned optmzaton whereas the tradtonal LDA solves an unconstraned optmzaton. The motvaton for usng the constrant w T k w k = s that t allows us to avod calculatng the nverse of S w and thus the potental small sample sze problem. To solve the above optmzaton problem, we may ntroduce a Lagrangan L(w k, λ k ) = wk T (S b S w )w k λ k (wk T w k ) (9) wth multplers λ k. The Lagrangan L has to be maxmzed wth respect to λ k and w k. The condton that at the statonary pont, the dervatves of L wth respect to w k must vansh L(w k, λ k ) = ((S w b S w ) λ k I)w k = 0 k k =,..., d (0) leads to (S b S w )w k =λ k w k k =,..., d () whch means that the λ k s are the egenvalues of S b S w and the w k s are the correspondng egenvectors. Thus J(W) = wk T (S b S w )w k = λ k wk T w k = λ k () Therefore, J(W) s maxmzed when W s composed of the frst d largest egenvectors of S b S w. Here, we need not calculate the nverse of S w, whch allows us to avod the small sample sze problem easly. We may also requre W to be orthonormal, whch may help preserve the shape of the dstrbuton. 4 Nonlnear Feature Extracton wth Kernel In ths secton, we follow the approach of nonlnear SVMs [0] to kernelze the above lnear feature extractor. More precsely, we frst reformulate the maxmum margn crteron n terms of only dot-product Φ(x), Φ(y) of nput patterns. Then we replace the dot-product by some postve defnte kernel k(x, y), e.g. Gaussan kernel e γ x y. Consder the maxmum margn crteron n the feature space F J Φ (W) = wk T (S Φ b SΦ w)w k where S Φ b and S Φ w are the between-class scatter matrx and wthn-class scatter matrx n F,.e., S Φ b = c = p (m Φ m Φ )(m Φ m Φ ) T, S Φ w = c = p S Φ and S Φ = n n j= (Φ(x() j ) mφ )(Φ(x() j ) mφ )T wth m Φ = n n j= Φ(x() j ), mφ = c = p m Φ, and x() j s the pattern of class C that has n samples. For us, an mportant fact s that each w k les n the span of Φ(x ), Φ(x ),..., Φ(x n ). Therefore, we can fnd an expanson for w k n the form w k = n l= α(k) l Φ(x l ). Usng ths expanson and the defnton of m Φ, we have n wk T m Φ = α (k) n l Φ(x n l ), Φ(x () j ) l= j=

6 Replacng the dot-product by some kernel functon k(x, y) and defnng ( m ) l = n n j= k(x l, x () j ), we get wt k mφ = α T k m wth (α k ) l = α (k) l. Smlarly, we have w T k m Φ = w T k p m Φ = α T k = p m = α T k m wth m = c = p m. Ths means w T k (mφ mφ ) = α T k ( m m). and wk T S Φ b w k = = = = = p (wk T (m Φ m Φ ))(wk T (m Φ m Φ )) T p T α T k ( m m)( m m) T α k = where S b = c = p ( m m)( m m) T. α T S k b α k Smlarly, one can smplfy W T S Φ ww. Frst, we have wk T (Φ(x() j ) mφ ) = αt k (k() j m ) wth (k () j ) l = k(x l, x () j ). Consderng wt k SΦ w k = n n j= (wt k (Φ(x() j ) m Φ ))(wt k (Φ(x() j ) mφ ))T, we have w T k S Φ w k = n n j= α T k (k () j m )(k () j m ) T α k = n α T n S k (e j n )(e j n ) T ST n n α k j= = n α T n S k (e j e T j e j T n n n e T j + j= n n n T n ) S T α k = n α T k S (I n n n n T n ) S T α k where ( S ) lj = k(x l, x () j ), I n n s the n n dentty matrx, n s the n -dmensonal vector of s, and e j s the canoncal bass vector of n dmensons. Thus, we obtan wk T S Φ ww k = = α T k ( = = p n α T k S (I n n n T n ) S T α k p n S (I n n n T n ) S T ) α k = α T S k w α k where S w = c = p n S (I n n n T n ) S T. So the maxmum crteron n the feature space F s J(W) = α T k ( S b S w )α k (3) Smlar to the observatons n Secton 3, the above crteron s maxmzed by the largest egenvectors of S b S w.

7 error rate RAW LDA+PCA MMC KMMC class no. class no. (a) The comparson n term of error rate. (b) The comparson n term of tranng tme. Fgure : Expermental results obtaned usng a lnear SVM on the orgnal data (RAW), and the data extracted by LDA+PCA, the lnear feature extractor based on MMC (MMC) and the nonlnear feature extractor based on MMC (KMMC), whch employs the Gaussan kernel wth γ = tranng tme (second) LDA+PCA MMC KMMC Experments To evaluate the performance of our new methods (both lnear and nonlnear feature extractors), we ran both LDA+PCA and our methods on the ORL face dataset []. The ORL dataset conssts of 0 face mages from 40 subjects for a total of 400 mages, wth some varaton n pose, facal expresson and detals. The resoluton of the mages s 9, wth 56 gray-levels. Frst, we reszed the mages to 8 3 to save the expermental tme. Then, we reduced the dmensonalty of each mage set to c, where c s the number of classes. At last we traned and tested a lnear SVM on the dmensonalty-reduced data. As a control, we also traned and tested a lnear SVM on the orgnal data before ts dmensonalty was reduced. In order to demonstrate the effectveness and the effcency of our methods, we conducted a seres of experments and compared our results wth those obtaned usng LDA+PCA. The error rates are shown n Fg.(a). When traned wth 3 samples and tested wth 7 other samples for each class, our method s generally better than LDA+PCA. In fact, our method s usually better than LDA+PCA on other numbers of tranng samples. To save space, we do not show all the results here. Note that our methods can even acheve lower error rates than a lnear SVM on the orgnal data (wthout dmensonalty reducton). However, LDA+PCA does not demonstrate such a clear superorty over RAW. Fg. (a) also shows that the kernelzed (nonlnear) feature extractor based on MMC s sgnfcantly better than the lnear one, n partcular when the number of classes c s large. Besdes accuracy, our methods are also much more effcent than LDA+PCA n the sense of the tranng tme requred. Fg. (b) shows that our lnear feature extractor s about 4 tmes faster than LDA+PCA. The same speedup was observed on other numbers of tranng samples. Note that our nonlnear feature extractor s also faster than LDA+PCA n ths case although t s very tme-consumng to calculate the kernel matrx n general. An explanaton of the speedup s that the kernel matrx sze equals the number of samples, whch s pretty small n ths case. Furthermore, our method performs much better than LDA+PCA when n c s close to the dmensonalty D. Because the amount of tranng data was lmted, we reszed the mages to 68 dmensons to create such a stuaton. The expermental results are shown n Fg.. In ths stuaton, the performance of LDA+PCA drops sgnfcantly because the null space of S w has a small dmensonalty. When LDA+PCA tres to maxmze the between-class scatter n ths small null space, t loses a lot of nformaton. On the other hand, our method tres to maxmze the between-class scatter n the orgnal nput space. From Fg., we can

8 error rate LDA+PCA MMC KMMC class no. class no. (a) Each class contans three tranng samples. (b) Each class contans four tranng samples. Fgure : Comparson between our new methods and LDA+PCA when n c s close to D. error rate LDA+PCA MMC KMMC see that LDA+PCA s neffectve n ths stuaton because t s even worse than a random guess. But our method stll produced acceptable results. Thus, the expermental results show that our method s better than LDA+PCA n terms of both accuracy and effcency. 6 Concluson In ths paper, we proposed both lnear and nonlnear feature extractors based on the maxmum margn crteron. The new methods do not suffer from the small sample sze problem. The expermental results show that t s very effcent, accurate, and robust. Acknowledgments We thank D. Gunopulos, C. Domencon, and J. Peng for valuable dscussons and comments. Ths work was partally supported by NSF grants CCR and ACI References [] R. A. Fsher. The use of multple measurements n taxonomc problems. Annual of Eugencs, 7:79 88, 936. [] M. Loog, R. P. W. Dun, and R. Haeb-Umbach. Multclass lnear dmenson reducton by weghted parwse fsher crtera. IEEE Transactons on Pattern Analyss and Machne Intellgence, 3(7):76 766, 00. [3] K. Fukunaga. Introducton to Statstcal Pattern Recognton. Academc Press, New York, nd edton, 990. [4] Q. Tan, M. Barbero, Z. Gu, and S. Lee. Image classfcaton by the foley-sammon transform. Optcal Engneerng, 5(7): , 986. [5] Z. Hong and J. Yang. Optmal dscrmnant plane for a small number of samples and desgn method of classfer on the plane. Pattern Recognton, 4(4):37 34, 99. [6] G. W. Stewart. Introducton to Matrx Computatons. Academc Press, New York, 973. [7] K. Lu, Y. Cheng, and J. Yang. A generalzed optmal set of dscrmnant vectors. Pattern Recognton, 5(7):73 739, 99. [8] L. Chen, H. Lao, M.Ko, J. Ln, and G. Yu. A new LDA-based face recognton system whch can solve the small sample sze problem. Pattern Recognton, 33(0):73 76, 000. [9] S. Mka, G. Rätsch, J. Weston, B. Schölkopf, and K.-R. Müller. Fsher dscrmnant analyss wth kernels. In Y.-H. Hu, J. Larsen, E. Wlson, and S. Douglas, edtors, Neural Networks for Sgnal Processng IX, pages IEEE, 999. [0] V. N. Vapnk. Statstcal Learnng Theory. John Wley & Sons, New York, 998. [] F. Samara and A. Harter. Parametersaton of a stochastc model for human face dentfcaton. In Proceedngs of nd IEEE Workshop on Applcatons of Computer Vson, Sarasota, FL, 994.

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest