Feature Selection in Multi-instance Learning

Size: px

Start display at page:

Download "Feature Selection in Multi-instance Learning"

Shannon Dean
5 years ago
Views:

1 The Nnth Internatonal Symposum on Operatons Research and Its Applcatons (ISORA 10) Chengdu-Juzhagou, Chna, August 19 23, 2010 Copyrght 2010 ORSC & APORC, pp Feature Selecton n Mult-nstance Learnng Chun-Hua Zhang 1 Jun-Yan Tan 2, Na-Yang Deng 2, 1 Informaton School, Ren Unversty of Chna, Bejng, Chna, College of Scence, Chna Agrcultural Unversty, Bejng, Chna, Abstract Ths paper focuses on the feature selecton n mult-nstance learnng. A new verson of support vector machne named p-misvm s proposed. In the p-misvm model, the problem needs to be solved s non-dfferentable and non-convex. By usng the constraned concave-convex procedure (CCCP), a lnearzaton algorthm s presented that solves a successon of fast lnear programs that converges to a local optmal soluton. Furthermore, the lower bounds for the absolute value of nonzero components n every local optmal soluton s establshed, whch can elate zero components n any numercal soluton. The numercal experments show that the p-misvm s effectve n selectng relevant features, compared wth the popular MICA. Keywords Support vector machne; feature selecton; p-norm; mult-nstance learnng 1 Introducton Feature selecton s very mportant n many applcatons of data ng. By restrctng the nput space to a small subset of nput varables, t has obvous benefts n terms of data storage, computatonal reurements, and cost of future data collecton. Ths paper focuses on feature secton n mult-nstance learnng va a new verson of support vector machne (SVM). Mult-nstance learnng (MIL) s a growng feld of research n data ng. In the MIL problem, the tranng set s composed of many bags, each nvolves n many nstances. A bag s postvely labeled f t contans at least one postve nstance; otherwse t s labeled as a negatve bag. The task s to fnd some decson functon from the tranng set for correctly labelng unseen bag. MIL problem was frst ntroduced by Detterch et al.[1] n drug actvty predcton. So far, MIL has been appled to many felds such as mage retreval ([2]), face detecton ([3]), scene classfcaton, text categorzaton, etc and s often found to be superor than a conventonal supervsed learnng approaches. ([4]) proposed a framework called Dverse Densty algorthm. Snce then varous varants of standard sngle nstance learnng algorthms lke Boost-ng ([3], [5]), SVM ([2], [6] ), Logstc Regresson ([7]), nearest neghbor ([8]) etc. have been modfed to adapt to the MIL problem. Ths work s supported by the Key Project of the Natonal Natural Scence Foundaton of Chna (No ), the Natonal Natural Scence Foundaton of Chna (No ) Correspondng author. E-mal: tanjunyan0@126.com Correspondng author. E-mal: dengnayang@cau.edu.cn

2 Feature Selecton n Mult-nstance Learnng 463 Based on the standard SVM, some methods ncludng MI, m [9], etc. have been proposed for the MIL problem. There are few works on feature selecton n MIL. In [10], the MICA algorthm s ntroduced, whch employs 1-norm, rather than 2-norm n MI and m. Because the 1-norm SVM formulaton s known to lead to sparse solutons ([11], [12]), MICA can get few features when a lnear classfer s used. Recently, an effectve method, named p-norm SVM (0 < p < 1), s proposed on feature selecton n the standard classfcaton problems n [13], whch motvates us to apply t to the MIL problem. Ths paper proposes p-norm mult-nstance SVM (p-misvm), whch replaces the 2-norm penalty by the p-norm (0 < p < 1) penalty n the objectve functon of the prmal problem n the MI. The p-misvm conducts feature selecton and classfcaton smultaneously. However, there are two dffcultes n solvng p-misvm model: (). It s mpossble to solve the prmal problem va ts dual problem and the prmal problem tself s hard to be solved, because t s nether dfferentable nor convex; (). Feature selecton needs to fnd the nonzero components of the soluton to the prmal problem. However, usually algorthms can only provde an approxmate soluton where nonzero components n the soluton can not be dentfed theoretcally. Frstly, for the dffculty (), by usng the constraned concave-convex procedure (CCCP) ([14], [15]), a lnearzaton algorthm s presented that solves a successon of fast lnear programs that converges to a local optmal soluton to the prmal problem. Furthermore, for the dffculty (), the lower bounds for the absolute value of nonzero entres n every local optmal soluton s establshed, whch can elate zero entres n any numercal soluton. Lastly, the performance of p-misvm s llustrated on the smulated datasets. Now we descrbe our notaton. For a vector x n R n, [x] ( = 1,2,,n) denotes the -th component of x. x denotes a vector n R n of absolute value of the components of x. x p denotes that ( [x] 1 p + + [x] n p ) 1 p. Strctly speakng, x p s not a general norm when 0 < p < 1, but we stll follow ths term p-norm, because the forms are same except that the values of p are dfferent. x 0 s the number of nonzero components of x. Ths paper s organzed as follows. In secton 2, the p-misvm for feature secton s ntroduced. In secton 3, the CCCP s proposed to solve p-misvm model. In secton 4, the absolute lower bounds of the local optmal soluton s establshed. In secton 5, numercal experments are gven to demonstrate the effectveness of our method. We conclude ths paper n secton 6. 2 p-norm mult-nstance support vector machne For feature selecton, p-misvm s an embedded method n whch tranng data are gven to a learnng machne, whch returns a predctor and a subset of features on whch t performs predctons. In fact, feature selecton s performed n the process of learnng. Consder the mult-nstance classfcaton problem wth the tranng set T s gven by {(X 1,y 1 ),,(X l,y l )}, (1) where X = {x 1,,x l },x j R n ( = 1,,l, j = 1,,l ),y { 1,1}. Here, when y = 1, X s called as a postve bag and (X,y ) mples that there exsts at least one nstance wth postve label n X ; when y = 1, X s called as a negatve bag and there exsts no any nstance wth postve label n X. The task s to fnd a functon g(x) such that

3 464 The 9th Internatonal Symposum on Operatons Research and Its Applcatons the label of any nstance n R n can be deduced by the decson functon f (x) = sgn(g(x)). For convenence, the tranng set (1) s represented as {(X 1,y 1 ),,(X,y ),(x r+1,y r+1 ),,(x,y )}, (2) where y 1 = = y = 1,y r+1 = = y = 1, (X,1) mples that there exsts at least one nstance wth postve label and (x, 1) mples that the label of the nstance x s negatve. All of nstances n postve bags X 1,,X are x 1,,x r. I() ( = 1,,) denotes the ndex set of nstances n X. The feature vector g = ([x 1 ],[x 2 ],,[x ] ) T,( = 1,,n) (3) denotes the values of -th feature n all nstances. Suppose the decson functon s gven by f (x) = sgn((w x) + b), the p-misvm solves the optmzaton problem: w,b,ξ w p p +C 1 ξ +C 2 ξ, max ((w x j) + b) 1 ξ, = 1,,, (w x ) + b 1 + ξ, = r + 1,,r + s, ξ 0, = 1,,,r + 1,,r + s, where C 1 (C 1 > 0), C 2 (C 2 > 0) and p(0 < p < 1) are parameters. Now we descrbe our new method such as followng: Algorthm 1. (p-misvm) (1) Gven a tranng set (2); Select the parameters C 1 (C 1 > 0),C 2 (C 2 > 0) and p (0 < p < 1); (2) Solve the optmzaton problem (4) and get ts global soluton (w,b,ξ ); (3) Select the feature set: { [w ] 0,( = 1,,n)}; (4) Construct the decson functon f (x) = sgn(w x) + b ). Note that, n the Algorthm 1, there are two dffcultes () and () that have been addressed n Secton 1, so the followng sectons wll consder them respectvely. 3 CCCP for the p-misvm model The constraned concave-convex procedure (CCCP) ([14], [15]) s an optmzaton tool for problems whose objectve and constraned functons can be expressed as the dfferences of convex functons. Consder the followng optmzaton problem: x f 0 (x) g 0 (x) f (x) g (x) c, = 1,,m, where f,g ( = 0,,m) are real-valued, convex and dfferentable functons on R n, and c R. Gven an ntal x (0), CCCP computes x (t+1) from x (t) by replacng g (x) wth ts frst-order Taylor expanson at x (t), and then settng x (t+1) to the soluton of the followng optmzaton problem: (4) (5) x f 0 (x) [g 0 (x (t) ) + g 0 (x (t) ) (x x (t) )] f (x) [g (x (t) ) + g (x (t) ) (x x (t) )] c, = 1,,m. (6)

4 Feature Selecton n Mult-nstance Learnng 465 Here, g( x) s the gradent of the functon g at x. For non-smooth functons,the gradent should be replaced by the subgradent. It can be shown that CCCP converges to a local mum soluton of (5) n [15]. Consder the problem (4), we frstly ntroduce the varable v = ([v] 1,,[v] n ) to elate the absolute value from the objectve functon, whch leads to the followng euvalent problem: w,b,ξ v p p +C 1 ξ +C 2 ξ, (7) max ((w x j) + b) 1 ξ, = 1,,, (8) (w x ) + b 1 + ξ, = r + 1,,r + s, (9) ξ 0, = 1,,,r + 1,,r + s, (10) v w v (11) where v p p = [v] p [v]p n, due to the last constrant (11). Furthermore, we note that the objectve functon and the constrant functons n the problem (7)-(11) can be regarded as the dfferences of two convex functons. Hence, we can solve the problem (7)-(11) wth CCCP. Note that max (w x j ) n (8) s convex, but a non-smooth functon of w.to use the CCCP, we have to replace the gradent by the subgradents. It s easy to obtan that for = 1,,, the subgradents max (w x j ) = { β j x j β j R,β j 0}, where { = 0, f(w x j ) max β j = (w x j), wth β j = 1. At the k-th teraton, denote 0, otherwse the current w,b,ξ,v estmate and the correspondng β j by w (k),b (k),ξ (k),v (k) and β (k) j, respectvely. In the experments, we ntalzed β (0), for j I(). For convenence, we pck the subgradent wth: β (k) j = then the optmzaton problem s: j = 1 I() { 1, f j = argmaxk I() (w x k ), 0, otherwse. w,b,ξ,v p(v (k) ) p 1 v +C 1 ξ +C 2 ξ, (w x (k) ) + b 1 ξ, = 1,,, (w x ) + b 1 + ξ, = r + 1,,r + s, ξ 0, = 1,,,r + 1,,r + s, v w v. (12) whch s a standard lnear programg, then the followng algorthm s establshed: Algorthm 2. (1) Gven a tranng set (2); Select the parameters C 1 (C 1 > 0), C 2 (C 2 > 0) and p(0 < p < 1); (2) Let k = 0, select x (k) = 1 I() x j, = 1,, and v (k) = 0; (3) Solve the followng optmzaton problem (12), and get ts soluton (w (k+1),b (k+1), ξ (k+1),v (k+1) );

5 466 The 9th Internatonal Symposum on Operatons Research and Its Applcatons (4) Compute g(x j ) = (w (k+1) x j )+b (k+1), for j I() and = 1,,, select x (k+1) = argmax g(x j ); (5) If x (k) x (k+1) = 0, for = 1,,, then let w = w (k+1),b = b (k+1),ξ = ξ (k+1) and stop; otherwse, let k = k + 1, go to step (3). 4 The lower bounds of nonzero entres of local optmal soluton to the problem (4) Usng the same strategy n [16], we get the followng theorem 1, whch can be used to dentfy nonzero components n the local optmal solutons to the problem (4), even though the Algorthm 2 can only fnd the approxmate local optmal soluton. Theorem 1 For any local optmal soluton (w,b,ξ ) to the problem (4), f [w ] p ( ) C 2 1 +C2 2 1 p 1, then [w ] = 0, = 1,2,,n, where g s defned n (3). s g Proof: Suppose w 0 = k. Wthout loss of generalty, let w = ([w ] 1,[w ] 2,,[w ] k, 0,0 0) T and z = ([w ] 1,[w ] 2,,[w ] k ) T. For the new nstance x = ([x ] 1,[x ] 2,,[x ] k ), we consder the followng optmzaton problem z,b, ξ z p p +C 1 ξ +C 2 ξ, max ((z x j) + b) 1 ξ, = 1,,, (z x ) + b 1 + ξ, = r + 1,,r + s, ξ 0, = 1,,,r + 1,,r + s. (13) It has been ponted out by [10] that the constrant max ((z x j )+b) 1 ξ, s euvalent to the fact that there exst convex combnaton coeffcents v j 0, v j = 1, such that (w v j x j) + b 1 ξ. Then, the above problem (13) s euvalent to: z,b, ξ z p p +C 1 ξ +C 2 ξ, (z v j x j) + b 1 ξ, = 1,,, (z x ) + b 1 + ξ, = r + 1,,r + s, ξ 0, = 1,,,r + 1,,r + s, v j 0, j I(), = 1,,, v j = 1, = 1,,. (14) The Lagrange functon of (14) s: L(z,b,ξ,α,ζ,λ,µ) = z p p +C 1 ξ +C 2 α ((z x ) + b + 1 ξ ) ζ ξ ξ ζ ξ α ((z v j x j) + b 1 + ξ )+ λ jv j µ( v j 1).

6 Feature Selecton n Mult-nstance Learnng 467 It s easy to know that (z,b,ξ,v ) s a local optmal soluton of (14), accordng to the KKT condton, we have p z p 1 sgn(z ) = α v j x α x, (15) 0 α C 1, = 1,,, (16) 0 α C 2, = r + 1,,r + s. (17) Accordng to (15)-(17) and Cauchy-Schwarz neualty, we have p [z ] p 1 = [ α v j x (α 1,,α, α r+1,, α ) ([ I(1) α x ] (18) v 1 j x j],,[ v j x j],[ x r+1 ],,[ x ] ) (19) I(p) C 2 1 +C2 2 s g (20) whch means [z p ] ( ) C 2 1 +C2 2 1 p 1, then the concluson s obtaned. s g Accordng to Theorem 1, we can dentfy the nonzero components of the local optmal soluton to (4). Based on the Algorthm 2 and the Theorem 1, the new algorthm 3 s establshed as follows: Algorthm 3. (1) Gven a tranng set (2); Select the parameters C 1 (C 1 > 0),C 2 (C 2 > 0) and p (0 < p < 1); (3) Usng the Algorthm 2 to get the local optmal soluton (w,b,ξ ) to the problem (4); p (4) Compute L = ( ) C 2 1 +C2 2 1 p 1, for = 1,,n; Select the feature set: { [w ] > s g L,( = 1,,n)}; (5) Construct the decson functon f (x) = sgn(( w x) + b ), where the components of w are nonzero components of w and the components of x are also correspondng to nonzero components of w. Note that, n the followng secton, our experments are conducted accordng to the algorthm 3. 5 Numercal experments In ths secton, some experments on four smulated datasets are conducted, by comparng p-misvm wth MICA. The four smulaton datasets (I, II, III, IV) are generated by the followng steps: Accordng to two dfferent dstrbutons, ndependently generate n 0 postve and negatve feature vectors g + R l +, g R l, = 1,2,,n 0 where l + and l are respectvely the number of the postve and negatve ponts;

7 468 The 9th Internatonal Symposum on Operatons Research and Its Applcatons Table 1: Four smulate datasets Data features relevant features Dstrbuton of g + Dstrbuton of g I 20 3 N(1, 0.5) N( 1, 0.5) II 20 3 U( 0.5, 1) U( 1, 0) III N(1, 0.5) N( 1, 0.5) IV U( 0.5, 1) U( 1, 0) Table 2: Results on the four Smulated datasets Dataset Methods No. of sele- Percent of rele- Average Parameters cted features vant features(%) accuracy(%) I p-misvm p=0.5, C = 2.8 MICA C=2 II p-misvm p = 0.5,C = 1 MICA C=0.7 III p-misvm p = 0.6,C = 0.57 MICA C=2 IV p-misvm p = 0.6,C = 0.7 MICA C=0.7 Accordng to other dstrbutons, ndependently generate some stochastc vectors that are rrelevant to the class; The postve bags contans three ponts that are stochastc generated n the rectangular regon, the radus of ths rectangular regon s 2 and the center s the postve ponts generated by the frst step. The negatve bags s just the negatve ponts generated by the frst step. The descrpton of the four data sets s lsted n Table 1. Accordng to Algorthm 3, 100 experments are conducted for every dataset. Note that, there are three parameters C 1, C 2 and p n Algorthm 3. Usually we set C 1 = C 2 = C n our experments, and the best value of these parameters s chosen by ten-fold cross valdaton. Our expermental results are llustrated n Table 2, where the best results are gven by the bold form. Obvously, p-misvm performs the best among two methods. In Table 2, the data n 4th column shows the percentage of the number of the rght features over the number of the selected features, whch means the bgger the value the better the result. The average accuracy s computed by averagng the test accuracy among 100 experments. It s easy to see that p-misvm selects the least features wth the hgh accuracy, compared wth the MICA. 6 Concluson Feature selecton s very mportant n many applcatons of data ng. Ths paper ntroduces a new verson of SVM named p-misvm to feature selecton and mult-nstance

8 Feature Selecton n Mult-nstance Learnng 469 classfcaton. By usng the CCCP method, a lnearzaton algorthm s proposed to get the approxmate local optmal soluton to p-misvm. And the lower bounds for the absolute value of nonzero components n every local optmal soluton s establshed, whch can elate zero components n any numercal soluton. The numercal experments show that the p-norm support vector machnes s effectve n selectng relevant features, compared wth the popular MICA. References [1] Detterch, T. G., Lathrop, R. H., Lozano Perez, T. Solvng the multple-nstance problem wth axs-parallel rectangles. Artfcal Intellgence, 89, 31-71, [2] Andrews, S., Tsochantards, I., Hofmann, T. Support vector machnes for multple nstance learnng. In Advances n neural nformaton processng systems 15, [3] Vola, P., Platt, J., Zhang, C. Multple nstance boostng for object detecton. In Advances n neural nformaton processng systems 18, , [4] Maron, O., Lozano-Perez, T. A framework for multple-nstance learnng. In Advances n neural nformaton processng systems 10, , [5] Xn, X., Frank, E. Logstc regresson and boostng for labeled bags of nstances. Proc 8th Pacfc-Asa Conference on Knowledge Dscovery and Data Mnng, , [6] Fung, G., Dundar, M., Krshnapuram, B., Rao, R. B. Multple nstance learnng for computer aded dagnoss. In Advances n neural nfor- maton processng systems 19, , [7] Settles, B., Craven, M., Ray, S. Multple nstance actve learnng. In Advances n neural nformaton processng systems 20, [8] Wang, J., Zucker, J. Solvng the multple- nstance problem: A lazy learnng approach. In Proceedngs of the 17th nternatonal conference on machne learnng, , [9] Andrews S., Tsochantards I., Hofmann T. Support vector machnes for multple-nstance learnng. Advances n Neural Informaton Processng Systems. 15 Cambrdge, MA: MIT Press, [10] Mangasaran O.L., Wld E.W. Multple nstance classfcaton va successve lnear programg. Journal of Optmzaton Theory and Applcaton. 137(1):, ,2008. [11] Bradley P.S., Mangasaran O.L. Feature selecton va concave mzaton and support vector machnes. In Proc. 13th ICML, 82-90, [12] Zhu J., Rosset S., Haste T. Tbshran R. 1-norm svms, Advances n Neural Informaton Processng Systems 16, [13] Tan, J.Y., Zhang, C.H., Deng, N.Y. Cancer gene dentfcaton va p-norm support vector machne, ISB2010, to be accepted. [14] Yulle A., Rangarajan A. The concave-convex procedure. Neural Computaton. 15:, , [15] Smola A.J, Vshwanathan S.V.N., Hofmann T. Kernel methods for mssng varables. Proceedngs of the Tenth Internatonal Workshop on Artfcal Intellgence and Statstcs. Barbados [16] Chen X.J.,Xu F.M., and Ye Y.Y. Lower bound theory of nonzero entres n solutons of l 2 -l p mzaton. Techncal report, Department of Appled Mathematcs, the Hong Kong Polytechnc Unversty, 2009.

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed