Sem-supervsed Classfcaton wth Actve Query Selecton Jao Wang and Swe Luo School of Computer and Informaton Technology, Beng Jaotong Unversty, Beng 00044, Chna Wangjao088@63.com Abstract. Labeled samples are crucal n sem-supervsed classfcaton, but whch samples should we choose to be the labeled samples? In other words, whch samples, f labeled, would provde the most nformaton? We propose a method to solve ths problem. Frst, we gve each unlabeled examples an ntal class label usng unsupervsed learnng. Then, by maxmzng the mutual nformaton, we choose the samples wth most nformaton to be user-specfed labeled samples. After that, we run sem-supervsed algorthm wth the userspecfed labeled samples to get the fnal classfcaton. Expermental results on synthetc data show that our algorthm can get a satsfyng classfcaton results wth actve query selecton. Introducton Recently, there has been great nterest n Sem-supervsed classfcaton. The goal of sem-supervsed learnng s to use unlabeled data to mprove the performance of standard supervsed learnng algorthms. Snce n many felds, obtanng labeled data s hard or expensve, sem-supervsed learnng methods wth small labeled sample sze s of great use. In case the unsupervsed learnng methods can separate the ponts well (see e.g. Fg.a), there s no need for sem-supervsed methods. However, n case of nose (see e.g. Fg.b), or n case of two modes whch belong to two dfferent classes overlap (see e.g. Fg.c), sem-supervsed learnng wth a few labeled ponts n each class can mprove the performance sgnfcantly. A number of algorthms have been proposed for sem-supervsed learnng, ncludng EM [8], Co-tranng [, 4], Tr-tranng [5], random feld models [9, ], graph based approaches [, 6, 3]. Dfferent methods have dfferent assumptons, and can be used n dfferent stuaton. Especally, when data resdes on a low-dmensonal manfold wthn a hgh-dmensonal representaton space, sem-supervsed learnng methods should be adjusted to work on manfold. Belkn gves a soluton to ths problem wth manfold Regularzaton methods n [4]. Query selecton s extensvely studed n the supervsed framework. In [0], the queres are selected to mnmze the verson space sze for support vector machne. In [7], a commttee of classfers s employed, and a pont s quered whenever the commttee members dsagree. Many other methods are proposed to actvely choose the samples n supervsed learnng, but few are done to choose samples n sem-supervsed learnng. D.-Y. Yeung et al. (Eds.): SSPR&SPR 006, LNCS 409, pp. 74 746, 006. Sprnger-Verlag Berln Hedelberg 006
74 J. Wang and S. Luo The labeled samples play an mportant role n sem-supervsed learnng. Then a queston rses: whch samples should be the labeled samples? Among the exstng sem-supervsed learnng methods, some choose the labeled samples manually[6], to do ths, one has to have some doman knowledge of whch samples need most to be labeled; some choose the labeled samples randomly, whch may not contan the rght samples; and n [], Zhu et al. choose the samples actvely by greedly selectng queres from the unlabeled data to mnmze the estmated expected classfcaton error, but Zhu s actve learnng method can only be used together wth hs semsupervsed learnng method. In ths paper, we gve a more general and automatc query selecton method n the sem-supervsed framework. Our method can be appled to most of the exstng semsupervsed learnng methods. It s the pre-process of the exstng sem-supervsed methods. The man dea s to consder whch samples, f labeled, would gve more nformaton. Followng ths dea, we use the mutual nformaton I ( Y; y) ( y represents one sample s class label and Y represents the whole sample s class labels) as a measure of actve query selecton. By maxmzng the mutual nformaton, we get the sample whch needs most to be labeled. Usng ths method, we can choose the samples to be labeled actvely and automatcally, and t does not need any doman knowledge. In ths paper, n order to explan our method, we work wth the Laplacan Egenmaps [3] and manfold Regularzaton [4] of Belkn to show the entre process. We can see how the actve query selecton method works on manfold. We do not clam that ths method can only use on manfold, and ndeed we am to llumnate that applyng our method, to any sem-supervsed method, would always yeld satsfyng results. Ths paper s organzed as follows: In secton, we ntroduce our algorthm n a bref way. Secton 3 gves detals of every part of our algorthm. Expermental results on synthetc data are shown n secton 4, followed by conclusons n secton 5. Fg.. (a) Example of stuaton n whch unsupervsed learnng methods (here we use Laplacan Egenmaps) can work well. (b)(c) Examples of stuatons n whch unsupervsed learnng can not gve satsfyng results, and they need some labeled samples to help. Our Algorthm To explan the entre process of our actve query selecton method, we work wth the Laplacan Egenmaps [3] and manfold Regularzaton [4] of Belkn. The steps are as follow:
Sem-supervsed Classfcaton wth Actve Query Selecton 743 Step. Gve each sample (unlabeled) an ntal class label usng unsupervsed learnng. Here, we use Laplacan Egenmaps to map the sample to a real value functon f. Step. By maxmzng the mutual nformaton I ( Y; y) ( y represents one sample s class label and Y represents the whole sample s class labels), we actvely choose the samples wth most uncertan class labels to be user-specfed labeled samples. Step 3. Gve the chosen samples ther class label. Step 4. Run the sem-supervsed algorthm (here, we use the manfold Regularzaton algorthm) wth the user-specfed labeled samples to get the fnal classfcaton. 3 Detals of Our Method 3. Usng Laplacan Egenmaps to Get Intal Class Label x,..., x n R m Gven a sample set. Construct ts neghborhood graph G = ( V, E), whose vertces are sample ponts V = x, L, x }, and whose edge weghts n { w}, j = { n represent approprate parwse smlarty relatonshps between samples. For example, w can be the radal bass functon: w = exp( σ m d = ( x d x where σ s a scale parameter. The radal bass functon of w ensure that nearby ponts are assgned large edge weghts. We frst consder two-class stuaton. Assume that f s a real value functon whose value s bounded from 0 to (0 and each represents a class label). T y = f ( x ), Y = ( y, y,..., y n ). Laplacan Egenmaps try to mnmze the followng objectve functon ( y y ) W () By mnmzng ths objectve functon, we get of each sample, wth y [0,]. j jd ) ) y,..., (), y yn, the ntal class label 3. Usng Mutual Informaton to Choose the Samples wth Most Uncertan Class Labels Ths s our actve query selecton step. And we use the mutual nformaton I ( Y; y) ( y represents one sample s class label and Y represents the whole sample s class labels) as a measure of query selecton. By maxmzng the mutual
744 J. Wang and S. Luo nformaton, we get the sample whch would gve most nformaton, that s, whch needs most to be labeled. In order to calculate I ( Y; y), nspred by the work of [5], we defne a Gaussan random feld on the vertces of V { y T Δ / } p( y) exp λ y (3) where Δ = D W, and D s a dagonal matrx gven by = n D W The mutual nformaton between Y and y * s the expected decrease n entropy of Y when y * s observed: I ( Y; y ) = H ( Y) E = { H ( Y y )} (/ )log( + p( y )( p( y )) x * T j= H. x ) where H = ( log p( Y)), and s the Hessan matx. The best sample to label s the one that maxmzes I ( Y ; y ). And the mutual nformaton s largest when p ) 0. 5,.e., for samples wth most nformaton. ( y 3.3 Usng the User-Specfed Labeled Samples to Get the Fnal Classfcaton After we actvely choose the sample to gve label, we can run the sem-supervsed classfcaton methods to get the fnal result. In Manfold Regularzaton methods of Belkn, the author mnmzes the followng cost functon (4) mn H[ f ] = f H l l = ( f ( x ) y ) n + γ f + γ ( f ( x ) f ( x )) W (5) A K I, j= j Where l s the number of labeled samples. γ A, γ I are regularzaton parameters. K f s some form of constrant to ensure the smoothness of the learned manfold. Here, the l samples n the above cost functon are not chosen randomly or manually as n the orgnal work of Belkn. But rather, they are chosen wth the actve query selecton methods dscussed n 3.. 4 Expermental Results As we pont out at the begnnng of ths paper, unsupervsed learnng can not work well n case of nose, and n case of two modes whch belong to two dfferent classes overlapped. In these stuatons, sem-supervsed learnng wth a few labeled samples can help. Usng some synthetc data, we show that our actve query selecton method can choose the most nformatve samples to gve labels. Fg. (a) s a nose case of fg. (a), and wthout labeled samples, the Laplacan Egenmaps can not fnd a satsfyng
Sem-supervsed Classfcaton wth Actve Query Selecton 745 classfcaton (the yellow curve). Usng our actve query selecton method, the algorthm chooses some samples to be labeled, these samples are shown n (b) wth purple color. After that, user gve the class label of these chosen samples (the red and blue samples n (c), each color represents a class), then, wth these user-specfed labeled samples, manfold regularzaton method fnd the more satsfyng classfcaton as shown n (c) (the yellow curve). Fg.. (a) Laplacan Egenmaps can not fnd a satsfyng classfcaton wthout labeled samples. (b) The samples automatcally chosen to be labeled. (c) The manfold regularzaton results wth the labeled samples. 5 Conclusons A key problem of sem-supervsed learnng s to choose the most nformatve samples to be labeled at the very begnnng of sem-supervsed algorthms. Usng mutual nformaton, we gve a soluton to ths problem. Our method of samples chosen can apply to most of the exstng sem-supervsed learnng methods, and n ths paper, we combne t wth manfold regularzaton to show how t works. We also do experments on some synthetc data, and yeld satsfyng results. In future works, we wll try ths method on some real world experments. Another problem of sem-supervsed learnng s how many labeled sample are sutable, for example, should we choose fve samples to gve label, or, should we choose ten? In future work, we wll consder ths problem n the framework of the actve query selecton of ths paper. Acknowledgements The research s supported by natonal natural scence foundatons of chna (6037309), the Research Fund for the Doctoral Program of Hgher Educaton of Chna (005000400) and Co-Constructon Project of Key Subject of Beng. References. A. Blum and T. Mtchell: Combnng labeled and unlabeled data wth co-tranng. In Proceedngs of the th Annual Conference on Computatonal Learnng Theory. Madson, WI, pp.9 00, (998).. A. Blum, S. Chawla: Learnng from Labeled and Unlabeled Data usng Graph Mncuts. ICML (00).
746 J. Wang and S. Luo 3. Belkn M., Nyog P: Laplacan Egenmaps for Dmensonalty Reducton and Data Representaton. Neural Computaton, June (003) 4. Belkn M., Nyog P., Sndhwan V: On Manfold Regularzaton. Department of Computer Scence, Unversty of Chcago, TR-004-05. 5. B. Krshnapuram, D. Wllams, Ya Xue, A. Hartemnk, L. Carn, and M. A. T. Fgueredo: On Sem-Supervsed Classfcaton. NIPS (004). 6. D. Zhou, O. Bousquet, T.N. Lal, J. Weston and B. Schoelkopf: Learnng wth Local and Global Consstency. NIPS (003). 7. Freund, Y., Seung, H. S., Shamr, E., & Tshby, N.: Selectve samplng usng the query by commttee algorthm. Machne Learnng, 8, 33-68. (997). 8. K. Ngam: Usng Unlabeled Data to Improve Text Classfcaton. PhD thess, Carnege Mellon Unversty Computer Scence Dept, (00). 9. M. Szummer and T. Jaakkola: Partally labeled classfcaton wth markov random walks. NIPS (00). 0. Tong, S., and Koller, D.: Support vector machne actve learnng wth applcatons to text classfcaton. ICML (000).. Xaojn Zhu, J. Lafferty, and Z. Ghahraman: Combnng actve learnng and semsupervsed learnng usng Gaussan felds and harmonc functons. ICML (003).. Xaojn Zhu, Z. Ghahraman, and J. Lafferty: Sem-supervsed learnng usng Gaussan felds and harmonc functons. ICML (003). 3. Xaojn Zhu: Sem-Supervsed Learnng wth Graphs. PhD thess, Carnege Mellon Unversty Computer Scence Dept, (005). 4. Zhou, Z.-H., & L, M.: Sem-supervsed regresson wth co-tranng. Internatonal Jont Conference on Artfcal Intellgence, (005). 5. Zhou, Z.-H., & L, M.: Tr-tranng: explotng unlabeled data usng three classfers. IEEE Trans. Knowledge and Data Engneerng, 7, 59 54, (005).