Soft-Supervised Learning for Text Classification

Size: px

Start display at page:

Download "Soft-Supervised Learning for Text Classification"

Ginger McDaniel
6 years ago
Views:

1 Soft-Supervsed Learnng for Text Classfcaton Amarnag Subramanya & Jeff Blmes Dept. of Electrcal Engneerng, Unversty of Washngton, Seattle, WA 98195, USA. Abstract We propose a new graph-based semsupervsed learnng (SSL) algorthm and demonstrate ts applcaton to document categorzaton. Each document s represented by a vertex wthn a weghted undrected graph and our proposed framework mnmzes the weghted Kullback-Lebler dvergence between dstrbutons that encode the class membershp probabltes of each vertex. The proposed objectve s convex wth guaranteed convergence usng an alternatng mnmzaton procedure. Further, t generalzes n a straghtforward manner to mult-class problems. We present results on two standard tasks, namely Reuters and WebKB, showng that the proposed algorthm sgnfcantly outperforms the state-of-the-art. 1 Introducton Sem-supervsed learnng (SSL) employs small amounts of labeled data wth relatvely large amounts of unlabeled data to tran classfers. In many problems, such as speech recognton, document classfcaton, and sentment recognton, annotatng tranng data s both tme-consumng and tedous, whle unlabeled data are easly obtaned thus makng these problems useful applcatons of SSL. Classc examples of SSL algorthms nclude self-tranng (Yarowsky, 1995) and co-tranng (Blum and Mtchell, 1998). Graphbased SSL algorthms are an mportant class of SSL technques that have attracted much of attenton of late (Blum and Chawla, 2001; Zhu et al., 2003). Here one assumes that the data (both labeled and unlabeled) s embedded wthn a low-dmensonal manfold expressed by a graph. In other words, each data sample s represented by a vertex wthn a weghted graph wth the weghts provdng a measure of smlarty between vertces. Most graph-based SSL algorthms fall under one of two categores those that use the graph structure to spread labels from labeled to unlabeled samples (Szummer and Jaakkola, 2001; Zhu and Ghahraman, 2002) and those that optmze a loss functon based on smoothness constrants derved from the graph (Blum and Chawla, 2001; Zhu et al., 2003; Joachms, 2003; Belkn et al., 2005). Sometmes the two categores are smlar n that they can be shown to optmze the same underlyng objectve (Zhu and Ghahraman, 2002; Zhu et al., 2003). In general graph-based SSL algorthms are non-parametrc and transductve. 1 A learnng algorthm s sad to be transductve f t s expected to work only on a closed data set, where a test set s revealed at the tme of tranng. In practce, however, transductve learners can be modfed to handle unseen data (Zhu, 2005a; Sndhwan et al., 2005). A common drawback of many graph-based SSL algorthms (e.g. (Blum and Chawla, 2001; Joachms, 2003; Belkn et al., 2005)) s that they assume bnary classfcaton tasks and thus requre the use of sub-optmal (and often computatonally expensve) approaches such as one vs. rest to solve mult-class problems, let alone structured domans such as strngs and trees. There are also ssues related to degenerate solutons (all unlabeled samples classfed as belongng to a sngle 1 Excludng Manfold Regularzaton (Belkn et al., 2005).

2 class) (Blum and Chawla, 2001; Joachms, 2003; Zhu and Ghahraman, 2002). For more background on graph-based and general SSL and ther applcatons, see (Zhu, 2005a; Chapelle et al., 2007; Bltzer and Zhu, 2008). In ths paper we propose a new algorthm for graph-based SSL and use the task of text classfcaton to demonstrate ts benefts over the current stateof-the-art. Text classfcaton nvolves automatcally assgnng a gven document to a fxed number of semantc categores. Each document may belong to one, many, or none of the categores. In general, text classfcaton s a mult-class problem (more than 2 categores). Tranng fully-supervsed text classfers requres large amounts of labeled data whose annotaton can be expensve (Dumas et al., 1998). As a result there has been nterest s usng SSL technques for text classfcaton (Joachms, 1999; Joachms, 2003). However past work n semsupervsed text classfcaton has reled prmarly on one vs. rest approaches to overcome the nherent mult-class nature of ths problem. We beleve such an approach may be sub-optmal because, dsregardng data overlap, the dfferent classfers have tranng procedures that are ndependent of one other. In order to address the above drawback we propose a new framework based on optmzng a loss functon composed of Kullback-Lebler dvergence (KL-dvergence) (Cover and Thomas, 1991) terms between probablty dstrbutons defned for each graph vertex. The use of probablty dstrbutons, rather than fxed nteger labels, not only leads to a straghtforward mult-class generalzaton, but also allows us to explot other well-defned functons of dstrbutons, such as entropy, to mprove system performance and to allow for the measure of uncertanty. For example, wth a sngle nteger, at most all we know s ts assgnment. Wth a dstrbuton, we can contnuously move from knowng an assgnment wth certanty (.e., an entropy of zero) to expressons of doubt or multple vald possbltes (.e., an entropy greater than zero). Ths s partcularly useful for document classfcaton as we wll see. We also show how one can use the alternatng mnmzaton (Csszar and Tusnady, 1984) algorthm to optmze our objectve leadng to a relatvely smple, fast, easy-to-mplement, guaranteed to converge, teratve, and closed form update for each teraton. 2 Proposed Graph-Based Learnng Framework We consder the transductve learnng problem,.e., gven a tranng set D = {D l, D u }, where D l and D u are the sets of labeled and unlabeled samples respectvely, the task s to nfer the labels for the samples n D u. In other words, D u s the test-set. Here D l = {(x, y )} l =1, D u = {x } l+u =l+1, x X (the nput space of the classfer, and corresponds to vectors of features) and y Y (the space of classfer outputs, and for our case s the space of non-negatve ntegers). Thus Y = 2 yelds bnary classfcaton whle Y > 2 yelds mult-class. We defne n = l + u, the total number of samples n the tranng set. Gven D, most graph-based SSL algorthms utlze an undrected weghted graph G = (V, E) where V = {1,..., n} are the data ponts n D and E = V V are the set of undrected edges between vertces. We use w j W to denote the weght of the edge between vertces and j. W s referred to as the weght (or affnty) matrx of G. As wll be seen shortly, the nput features x effect the fnal classfcaton results va W,.e., the graph. Thus graph constructon s crucal to the success of any graph-based SSL algorthm. Graph constructon s more of an art, than scence (Zhu, 2005b) and s an actve research area (Alexandrescu and Krchhoff, 2007). In general the weghts are formed as w j = sm(x, x j )δ(j K()). Here K() s the set of s k-nearest-neghbors (KNN), sm(x, x j ) s a gven measure of smlarty between x and x j, and δ(c) returns a 1 f c s true and 0 otherwse. Gettng the smlarty measure rght s crucal for the success of any SSL algorthm as that s what determnes the graph. Note that settng K() = V = n results n a fully-connected graph. Some popular smlarty measures nclude sm(x, x j ) = e x x j σ 2 x, x j sm(x, x j ) = cos(x, x j ) = x 2 2 x j 2 2 where x 2 s the L 2 norm, and x, x j s the nner product of x and x j. The frst smlarty measure s an RBF kernel appled on the squared Eucldean dstance whle the second s cosne smlarty. In ths paper all graphs are constructed usng cosne smlarty. 2 2 or

3 We next ntroduce our proposed approach. For every V, we defne a probablty dstrbuton p over the elements of Y. In addton let r j, j = 1... l be another set of probablty dstrbutons agan over the elements of Y (recall, Y s the space of classfer outputs). Here {r j } j represents the labels of the supervsed porton of the tranng data. If the label for a gven labeled data pont conssts only of a sngle nteger, then the entropy of the correspondng r j s zero (the probablty of that nteger wll be unty, wth the remanng probabltes beng zero). If, on the other hand, the label for a gven labeled data pont conssts of a set of ntegers (e.g., f the object s a member of multple classes), then r j s able to represent ths property accordngly (see below). We emphasze agan that both p and r j are probablty dstrbutons, wth r j fxed throughout tranng. The goal of learnng n ths paper s to fnd the best set of dstrbutons p, that attempt to: 1) agree wth the labeled data r j wherever t s avalable; 2) agree wth each other (when they are close accordng to a graph); and 3) be smooth n some way. These crtera are captured n the followng new mult-class SSL optmzaton procedure: [ l ( ) mn C 1 (p), where C 1 (p) = D KL r p p =1 n ( ) n +µ w j D KL p p j ν H(p ), j =1 (1) and where p (p 1,..., p n ) denotes the entre set of dstrbutons to be learned, H(p ) = y p (y) log p (y) s the standard Shannon entropy functon of p, D KL (p q j ) s the KLdvergence between p and q j, and µ and ν are hyperparameters whose selecton we dscuss n secton 5. The dstrbutons r are derved from D l (as mentoned above) and ths can be done n one of the followng ways: (a) f ŷ s the sngle supervsed label for nput x then r (y) = δ(y = ŷ ), whch means that r gves unty probablty for y equalng the label ŷ ; (b) f ŷ = {ŷ (1),..., ŷ (k) }, k Y s a set of possble outputs for nput x, meanng an object valdly falls nto all of the correspondng categores, we set r (y) = (1/k)δ(y ŷ ) meanng that r s unform over only the possble categores and zero otherwse; (c) f the labels are somehow provded n the form of a set of non-negatve scores, or even a probablty dstrbuton tself, we just set r to be equal to those scores (possbly) normalzed to become a vald probablty dstrbuton. Among these three cases, case (b) s partcularly relevant to text classfcaton as a gven document many belong to (and n practce may be labeled as) many classes. The fnal classfcaton results,.e., the fnal labels for D u, are then gven by ŷ = argmax p (y). y Y We next provde further ntuton on our objectve functon. SSL on a graph conssts of fndng a labelng D u that s consstent wth both the labels provded n D l and the geometry of the data nduced by the graph. The frst term of C 1 wll penalze the soluton p {1,..., l}, when t s far away from the labeled tranng data D l, but t does not nsst that p = r, as allowng for devatons from r can help especally wth nosy labels (Bengo et al., 2007) or when the graph s extremely dense n certan regons. As explaned above, our framework allows for the case where supervsed tranng s uncertan or ambguous. We consder t reasonable to call our approach soft-supervsed learnng, generalzng the noton of sem-supervsed learnng, snce there s even more of a contnuum here between fully supervsed and fully unsupervsed learnng than what typcally exsts wth SSL. Soft-supervsed learnng allows uncertanty to be expressed (va a probablty dstrbuton) about any of the labels ndvdually. The second term of C 1 penalzes a lack of consstency wth the geometry of the data and can be seen as a graph regularzer. If w j s large, we prefer a soluton n whch p and p j are close n the KLdvergence sense. Whle KL-dvergence s asymmetrc, gven that G s undrected mples W s symmetrc (w j = w j ) and as a result the second term s nherently symmetrc. The last term encourages each p to be close to the unform dstrbuton f not preferred to the contrary by the frst two terms. Ths acts as a guard aganst degenerate solutons commonly encountered n SSL (Blum and Chawla, 2001; Joachms, 2003). For example, consder the case where part of the graph s almost completely dsconnected from any labeled vertex (whch s possble n the k-nearest neghbor case). In such stuatons the thrd term en-

4 sures that the nodes n ths dsconnected regon are encouraged to yeld a unform dstrbuton, valdly expressng the fact that we do not know the labels of these nodes based on the nature of the graph. More generally, we conjecture that by maxmzng the entropy of each p, the classfer has a better chance of producng hgh entropy results n graph regons of low confdence (e.g. close to the decson boundary and/or low densty regons). Ths overcomes a common drawback of a large number of state-of-the-art classfers that tend to be confdent even n regons close to the decson boundary. We conclude ths secton by summarzng some of the features of our proposed framework. It should be clear that C 1 uses the manfold assumpton for SSL (see chapter 2 n (Chapelle et al., 2007)) t assumes that the nput data can be embedded wthn a low-dmensonal manfold (the graph). As the objectve s defned n terms of probablty dstrbutons over ntegers rather than just ntegers (or to real-valued relaxatons of ntegers (Joachms, 2003; Zhu et al., 2003)), the framework generalzes n a straghtforward manner to mult-class problems. Further, all the parameters are estmated jontly (compare to one vs. rest approaches whch nvolve solvng Y ndependent problems). Furthermore, the objectve s capable of handlng label tranng data uncertanty (Pearl, 1990). Of course, ths objectve would be useless f t wasn t possble to effcently and easly optmze t on large data sets. We next descrbe a method that can do ths. 3 Learnng wth Alternatng Mnmzaton As long as µ, ν 0, the objectve C 1 (p) s convex. Ths follows snce D KL (p p j ) s convex n the par (p, p j ) (Cover and Thomas, 1991), negatve entropy s convex, and a postve-weghted lnear combnaton of a set of convex functons s convex. Thus, the problem of mnmzng C 1 over the space of collectons of probablty dstrbutons (a convex set) consttutes a convex programmng problem (Bertsekas, 2004). Ths property s extremely benefcal snce there s a unque global optmum and there are a varety of methods that can be used to yeld that global optmum. One possble method mght take the dervatve of the objectve along wth Lagrange multplers to ensure that we stay wthn the space of probablty dstrbutons. Ths method can sometmes yeld a closed form sngle-step analytcal expresson for the globally optmum soluton. Unfortunately, however, our problem does not admt such a closed form soluton because the gradent of C 1 (p) wth respect to p (y) s of the form, k 1 p (y) log p (y) + k 2 p (y) + k 3 (where k 1, k 2, k 3 are fxed constants). Sometmes, optmzng the dual of the objectve can also produce a soluton, but unfortunately agan the dual of our objectve also does not yeld a closed form soluton. The typcal next step, then, s to resort to teratve technques such as gradent descent along wth modfcatons to ensure that the soluton stays wthn the set of probablty dstrbutons (the gradent of C 1 alone wll not necessarly pont n the drecton where p s stll a vald dstrbuton) - one such modfcaton s called the method of multplers (MOM). Another soluton would be to use computatonally complex (and complcated) algorthms lke nteror pont methods (IPM). Whle all of the above methods (descrbed n detal n (Bertsekas, 2004)) are feasble ways to solve our problem, they each have ther own drawbacks. Usng MOM, for example, requres the careful tunng of a number of addtonal parameters such as learnng rates, growth factors, and so on. IPM nvolves nvertng a matrx of the order of the number of varables and constrants durng each teraton. We nstead adopt a dfferent strategy based on alternatng mnmzaton (Csszar and Tusnady, 1984). Ths approach has a sngle addtonal optmzaton parameter (contrasted wth MOM), admts a closed form soluton for each teraton not nvolvng any matrx nverson (contrasted wth IPM), and yelds guaranteed convergence to the global optmum. In order to render our approach amenable to AM, however, we relax our objectve C 1 by defnng a new (thrd) set of dstrbutons for all tranng samples q, = 1,..., n denoted collectvely lke the above usng the notaton q (q 1,..., q n ). We defne a new objectve to be optmzed as follows: mn p,q C 2(p, q), where C 2 (p, q) = +µ n =1 j N () [ l ( ) D KL r q =1 w jd ( ) n KL p q j ν H(p ). =1

5 Before gong further, the reader may be wonderng at ths juncture how mght t be desrable for us to have apparently complcated the objectve functon n an attempt to yeld a more computatonally and methodologcally superor machne learnng procedure. Ths s ndeed the case as wll be spelled out below. Frst, n C 2 we have defned a new weght matrx [W ] j = w j of the same sze as the orgnal where W = W + αi n, where I n s the n n dentty matrx, and where α 0 s a non-negatve constant (ths s the optmzaton related parameter mentoned above). Ths has the effect that w w. In the orgnal objectve C 1, w s rrelevant snce D KL (p p) = 0 for all p, but snce there are now two dstrbutons for each tranng pont, there should be encouragement for the two to approach each other. Lke C 1, the frst term of C 2 ensures that the labeled tranng data s respected and the last term s a smoothness regularzer, but these are done va dfferent sets of dstrbutons, q and p respectvely ths choce s what makes possble the relatvely smple analytcal update equatons gven below. Next, we see that the two objectve functons n fact have dentcal solutons when the optmzaton enforces the constrant that p and q are equal: mn C 2(p, q) = mn C 1 (p). (p,q):p=q p Indeed, as α gets large, the solutons consdered vable are those only where p = q. We thus have that: lm mn C 2(p, q) = mn C 1 (p). α p,q p Therefore, the two objectves should yeld the same soluton as long as α w j for all, j. A key advantage of ths relaxed objectve s that t s amenable to alternatng mnmzaton, a method to produce a sequence of sets of dstrbutons (p n, q n ) as follows: p n = argmnc 2 (p, q n 1 ), q n = argmnc 2 (p n, q). p q It can be shown (we omt the rather lengthy proof due to space constrants) that the sequence generated usng the above mnmzatons converges to the mnmum of C 2 (p, q),.e., lm C 2(p (n), q (n) ) = nf C 2(p, q), n p,q provded we start wth a dstrbuton that s ntalzed properly q (0) (y) > 0 y Y. The update equatons for p (n) and q (n) are gven by p (n) (y) = 1 exp Z q (n) where β (n 1) (y) γ, (y) = r (y)δ( l) + µ j w j p(n) j δ( l) + µ j w j γ = ν + µ j w j, β (n 1) (y) = ν + µ j (y), w j(log q (n 1) j (y) 1) and where Z s a normalzng constant to ensure p s a vald probablty dstrbuton. Note that each teraton of the proposed framework has a closed form soluton and s relatvely smple to mplement, even for very large graphs. Henceforth we refer to the proposed objectve optmzed usng alternatng mnmzaton as AM. 4 Connectons to Other Approaches Label propagaton (LP) (Zhu and Ghahraman, 2002) s a graph-based SSL algorthms that performs Markov random walks on the graph and has a straghtforward extenson to mult-class problems. The update equatons for LP (whch also we use for our LP mplementatons) may be wrtten as p (n) (y) = r (y)δ( l) + δ( > l) j w jp (n 1) j (y) δ( l) + δ( > l) j w j Note the smlarty to the update equaton for q (n) n our AM case. It has been shown that the squaredloss based SSL algorthm (Zhu et al., 2003) and LP have smlar updates (Bengo et al., 2007). The proposed objectve C 1 s smlar n sprt to the squared-loss based objectve n (Zhu et al., 2003; Bengo et al., 2007). Our method, however, dffers n that we are optmzng the KL-dvergence over probablty dstrbutons. We show n secton 5 that KL-dvergence based loss sgnfcantly outperforms the squared-loss. We beleve that ths could be due

6 to the followng: 1) squared loss s approprate under a Gaussan loss model whch may not be optmal under many crcumstances (e.g. classfcaton); 2) KL-dvergence D KL (p q) s based on a relatve (relatve to p) rather than an absolute error; and 3) under certan natural assumptons, KL-dvergence s asymptotcally consstent wth respect to the underlyng probablty dstrbutons. AM s also smlar to the spectral graph transducer (Joachms, 2003) n that they both attempt to fnd labellngs over the unlabeled data that respect the smoothness constrants of the graph. Whle spectral graph transducton s an approxmate soluton to a dscrete optmzaton problem (whch s NP hard), AM s an exact soluton obtaned by optmzng a convex functon over a contnuous space. Further, whle spectral graph transducton assumes bnary classfcaton problems, AM naturally extends to mult-class stuatons wthout loss of convexty. Entropy Mnmzaton (EnM) (Grandvalet and Bengo, 2004) uses the entropy of the unlabeled data as a regularzer whle optmzng a parametrc loss functon defned over the labeled data. Whle the objectves n the case of both AM and EnM make use of the entropy of the unlabeled data, there are several mportant dfferences: (a) EnM s not graphbased, (b) EnM s parametrc whereas our proposed approach s non-parametrc, and most mportantly, (c) EnM attempts to mnmze entropy whle the proposed approach ams to maxmze entropy. Whle ths may seem a trvalty, t has catastrophc consequences n terms of both the mathematcs and meanng. The objectve n case of EnM s not convex, whereas n our case we have a convex formulaton wth smple update equatons and convergence guarantees. (Wang et al., 2008) s a graph-based SSL algorthm that also employs alternatng mnmzaton style optmzaton. However, t s nherently squared-loss based whch our proposed approach out-performs (see secton 5). Further, they do not provde or state convergence guarantees and one sde of ther update approxmates an NP-complete optmzaton procedure. The nformaton regularzaton (IR) (Corduneanu and Jaakkola, 2003) algorthm also makes use of a KL-dvergence based loss for SSL. Here the nput space s dvded nto regons {R } whch mght or mght not overlap. For a gven pont x R, IR attempts to mnmze the KL-dvergence between p (y x ) and ˆp R (y), the agglomeratve dstrbuton for regon R. Gven a graph, one can defne a regon to be a vertex and ts neghbor thus makng IR amenable to graph-based SSL. In (Corduneanu and Jaakkola, 2003), the agglomeraton s performed by a smple averagng (arthmetc mean). Whle IR suggests (wthout proof of convergence) the use of alternatng mnmzaton for optmzaton, one of the steps of the optmzaton does not admt a closedform soluton. Ths s a serous practcal drawback especally n the case of large data sets. (Tsuda, 2005) (hereafter referred to as PD) s an extenson of the IR algorthm to hypergraphs where the agglomeraton s performed usng the geometrc mean. Ths leads to closed form solutons n both steps of the alternatng mnmzaton. There are several mportant dfferences between IR and PD on one sde and our proposed approach: (a) nether IR nor PD use an entropy regularzer, and (b) the update equaton for one of the steps of the optmzaton n the case of PD (equaton 13 n (Tsuda, 2005)) s actually a specal case of our update equaton for p (y) and may be obtaned by settng w j = 1/2. Further, our work here may be easly extended to hypergraphs. 5 Results We compare our algorthm (AM) wth other state-of-the-art SSL-based text categorzaton algorthms, namely, (a) SVM (Joachms, 1999), (b) Transductve-SVM (TSVM) (Joachms, 1999), (c) Spectral Graph Transducton (SGT) (Joachms, 2003), and (d) Label Propagaton (LP) (Zhu and Ghahraman, 2002). Note that only SGT and LP are graph-based algorthms, whle SVM s fullysupervsed (.e., t does not make use of any of the unlabeled data). We mplemented SVM and TSVM usng SVM Lght (Joachms, b) and SGT usng SGT Lght (Joachms, a). In the case of SVM, TSVM and SGT we traned Y classfers (one for each class) n a one vs. rest manner precsely followng (Joachms, 2003). 5.1 Reuters We used the ModApte splt of the Reuters dataset collected from the Reuters newswre n

7 1987 (Lews et al., 1987). The corpus has 9,603 tranng (not to be confused wth D) and 3,299 test documents (whch represents D u ). Of the 135 potental topc categores only the 10 most frequent categores are used (Joachms, 1999). Categores outsde the 10 most frequent were collapsed nto one class and assgned a label other. For each document n the tranng and test sets, we extract features x n the followng manner: stop-words are removed followed by the removal of case and nformaton about nflecton (.e., stemmng) (Porter, 1980). We then compute TFIDF features for each document (Salton and Buckley, 1987). All graphs were constructed usng cosne smlarty wth TFIDF features. For ths task Y = { earn, acq, money, gran, crude, trade, nterest, shp, wheat, corn, average}. For LP and AM, we use the output space Y = Y { other }. For documents n D l that are labeled wth multple categores, we ntalze r to have equal non-zero probablty for each such category. For example, f document s annotated as belongng to classes { acq, gran, wheat}, then r (acq) = r (gran) = r (wheat) = 1/3. We created 21 transducton sets by randomly samplng l documents from the tranng set wth the constrant that each of 11 categores (top 10 categores and the class other) are represented at least once n each set. These samples consttute D l. All algorthms used the same transducton sets. In the case of SGT, LP and AM, the frst transducton set was used to tune the hyperparameters whch we then held fxed for all the remanng 20 transducton sets. For all the graph-based approaches, we ran a search over K {2, 10, 50, 100, 250, 500, 1000, 2000, n} (note K = n represents a fully connected graph). In addton, n the case of AM, we set α = 2 for all experments, and we ran a search over µ {1e 8, 1e 4, 0.01, 0.1, 1, 10, 100} and ν {1e 8, 1e 6, 1e 4, 0.01, 0.1}, for SGT the search was over c {3000, 3200, 3400, 3800, 5000, } (see (Joachms, 2003)). We report precson-recall break even pont (PRBEP) results on the 3,299 test documents n Table 1. PRBEP has been a popular measure n nformaton retreval (see e.g. (Raghavan et al., 1989)). It s defned as that value for whch precson and recall are equal. Results for each category n Table 1 were obtaned by averagng the PRBEP over Category SVM TSVM SGT LP AM earn acq money gran crude trade nterest shp wheat corn average Table 1: P/R Break Even Ponts (PRBEP) for the top 10 categores n the Reuters data set wth l = 20 and u = All results are averages over 20 randomly generated transducton sets. The last row s the macroaverage over all the categores. Note AM s the proposed approach. the 20 transducton sets. The fnal row average was obtaned by macro-averagng (average of averages). The optmal value of the hyperparameters n case of LP was K = 100; n case of AM, K = 2000, µ = 1e 4, ν = 1e 2; and n the case of SGT, K = 100, c = The results show that AM outperforms the state-of-the-art on 6 out of 10 categores and s compettve n 3 of the remanng 4 categores. Further t sgnfcantly outperforms all other approaches n case of the macro-averages. AM s sgnfcant over ts best compettor SGT at the level accordng to the dfference of proportons sgnfcance test. Fgure 1 shows the varaton of average PRBEP aganst the number of labeled documents (l). For each value of l, we tuned the hyperparameters over the frst transducton set and used these values for all the other 20 sets. Fgure 1 also shows errorbars (± standard devaton) all the experments. As expected, the performance of all the approaches mproves wth ncreasng number of labeled documents. Once agan n ths case, AM, outperforms the other approaches for all values of l. 5.2 WebKB Collecton World Wde Knowledge Base (WebKB) s a collecton of 8282 web pages obtaned from four academc

8 Average PRBEP AM SGT 50 LP TSVM SVM Number of Labeled Documents Average PRBEP AM SGT LP TSVM SVM Number of Labeled Documents Fgure 1: Average PRBEP over all classes vs. number of labeled documents (l) for Reuters data set domans. The web pages n the WebKB set are labeled usng two dfferent polychotomes. The frst s accordng to topc and the second s accordng to web doman. In our experments we only consdered the frst polychotomy, whch conssts of 7 categores: course, department, faculty, project, staff, student, and other. Followng (Ngam et al., 1998) we only use documents from categores course, department, faculty, project whch gves 4199 documents for the four categores. Each of the documents s n HTML format contanng text as well as other nformaton such as HTML tags, lnks, etc. We used both textual and non-textual nformaton to construct the feature vectors. In ths case we dd not use ether stop-word removal or stemmng as ths has been found to hurt performance on ths task (Ngam et al., 1998). As n the the case of the Reuters data set we extracted TFIDF features for each document and constructed the graph usng cosne smlarty. As n (Bekkerman et al., 2003), we created four roughly-equal random parttons of the data set. In order to obtan D l, we frst randomly choose a splt and then sample l documents from that splt. The other three splts consttute D u. We beleve ths s more realstc than samplng the labeled web-pages from a sngle unversty and testng web-pages from the other unverstes (Joachms, 1999). Ths method of creatng transducton sets allows us to better evaluate the generalzaton performance of the varous algorthms. Once agan we create 21 transducton sets and the frst set was used to tune the hyperparameters. Further, we ran a search over the same grd as used n the case of Reuters. We report precson- Fgure 2: Average PRBEP over all classes vs. number of labeled documents (l) for WebKB collecton. Class SVM TSVM SGT LP AM course faculty project student average Table 2: P/R Break Even Ponts (PRBEP) for the WebKB data set wth l = 48 and u = All results are averages over 20 randomly generated transducton sets. The last row s the macro-average over all the classes. AM s the proposed approach. recall break even pont (PRBEP) results on the 3,148 test documents n Table 2. For ths task, we found that the optmal value of the hyperparameter were: n the case of LP, K = 1000; n case of AM, K = 1000, µ = 1e 2, ν = 1e 4; and n case of SGT, K = 100, c = Once agan, AM s sgnfcant at the level over ts closest compettor LP. Fgure 2 shows the varaton of PRBEP wth number of labeled documents (l) and was generated n a smlar fashon as n the case of the Reuters data set. 6 Dscusson We note that LP may be cast nto an AM-lke framework by usng the followng sequence of updates, p (n) (y) = δ( l)r (y) + δ( > l)q (n 1) q (n) (y) = j w jp (n) j w j (y),

9 To compare the behavor of AM and LP, we appled ths form of LP along wth AM on a smple 5-node bnary-classfcaton SSL graph where two nodes are labeled (node 1 and 2) and the remanng nodes are unlabeled (see Fgure 3, top). Snce ths s bnary classfcaton ( Y = 2), each dstrbuton p or q can be depcted usng only a sngle real number between 0 and 1 correspondng to the probablty that each vertex s class 2 (yes two). We show how both LP and AM evolve startng from exactly the same random startng pont q 0 (Fgure 3, bottom). For each algorthm, the fgure shows that both algorthms clearly converge. Each alternate teraton of LP s such that the labeled vertces oscllate due to ts clampng back to the labeled dstrbuton, but that s not the case for AM. We see, moreover, qualtatve dfferences n the solutons as well e.g., AM s soluton for the pendant node 5 s less confdent than s LP s soluton. More emprcal comparatve analyss between the two algorthms of ths sort wll appear n future work. We have proposed a new algorthm for semsupervsed text categorzaton. Emprcal results show that the proposed approach sgnfcantly outperforms the state-of-the-art. In addton the proposed approach s relatvely smple to mplement and has guaranteed convergence propertes. Whle n ths work, we use relatvely smple features to construct the graph, use of more sophstcated features and/or smlarty measures could lead to further mproved results. Acknowledgments Ths work was supported by ONR MURI grant N , by NSF grant IIS , by the Companons project (IST programme under EC grant IST-FP ), and by a Mcrosoft Research Fellowshp. vertex (data pont) number vertex (data pont) number Node 2 Label Node 3 Unlabeled 0.8 Node 5 Unlabeled Node 1 Label 1 Node 4 Unlabeled q (0) q (1) q (2) q (3) q (4) q (5) q (6) q (7) q (8) q(9) q (10) q (11) q (12) q (13) q (14) q (15) p (1) p (2) p (3) p (4) p (5) p (6) p (7) p (8) p (9) p (10) p (11) p (12) p (13) p (14) p (15) AM teraton (and dstrbuton par) number References Alexandrescu, A. and Krchhoff, K. (2007). Data-drven graph constructon for sem-supervsed graph-based learnnng n nlp. In Proc. of the Human Language Technologes Conference (HLT-NAACL). Bekkerman, R., El-Yanv, R., Tshby, N., and Wnter, Y. (2003). Dstrbutonal word clusters vs. words for text categorzaton. J. Mach. Learn. Res., 3: q (0) q (1) q (2) q (3) q (4) q (5) q (6) q (7) q (8) q(9) q (10) q (11) q (12) q (13) q (14) q (15) p (1) p (2) p (3) p (4) p (5) p (6) p (7) p (8) p (9) p (10) p (11) p (12) p (13) p (14) p (15) LP teraton (and dstrbuton par) number Fgure 3: Graph (top), and alternatng values of p n, q n for ncreasng n for AM and LP.

10 Belkn, M., Nyog, P., and Sndhwan, V. (2005). On manfold regularzaton. In Proc. of the Conference on Artfcal Intellgence and Statstcs (AISTATS). Bengo, Y., Delalleau, O., and Roux, N. L. (2007). Sem- Supervsed Learnng, chapter Label Propogaton and Quadratc Crteron. MIT Press. Bertsekas, D. (2004). Nonlnear Programmng. Athena Scentfc Publshng. Bltzer, J. and Zhu, J. (2008). ACL 2008 tutoral on Sem-Supervsed learnng. wkdot.com/. Blum, A. and Chawla, S. (2001). Learnng from labeled and unlabeled data usng graph mncuts. In Proc. 18th Internatonal Conf. on Machne Learnng, pages Morgan Kaufmann, San Francsco, CA. Blum, A. and Mtchell, T. (1998). Combnng labeled and unlabeled data wth co-tranng. In COLT: Proceedngs of the Workshop on Computatonal Learnng Theory. Chapelle, O., Scholkopf, B., and Zen, A. (2007). Sem- Supervsed Learnng. MIT Press. Corduneanu, A. and Jaakkola, T. (2003). On nformaton regularzaton. In Uncertanty n Artfcal Intellgence. Cover, T. M. and Thomas, J. A. (1991). Elements of Informaton Theory. Wley Seres n Telecommuncatons. Wley, New York. Csszar, I. and Tusnady, G. (1984). Informaton Geometry and Alternatng Mnmzaton Procedures. Statstcs and Decsons. Dumas, S., Platt, J., Heckerman, D., and Saham, M. (1998). Inductve learnng algorthms and representatons for text categorzaton. In CIKM 98: Proceedngs of the seventh nternatonal conference on Informaton and knowledge management, New York, NY, USA. Grandvalet, Y. and Bengo, Y. (2004). Sem-supervsed learnng by entropy mnmzaton. In Advances n Neural Informaton Processng Systems (NIPS). Joachms, T. SGT Lght. org. Joachms, T. SVM Lght. joachms.org. Joachms, T. (1999). Transductve nference for text classfcaton usng support vector machnes. In Proc. of the Internatonal Conference on Machne Learnng (ICML). Joachms, T. (2003). Transductve learnng va spectral graph parttonng. In Proc. of the Internatonal Conference on Machne Learnng (ICML). Lews, D. et al. (1987). Reuters http: // testcollectons/reuters Ngam, K., McCallum, A., Thrun, S., and Mtchell, T. (1998). Learnng to classfy text from labeled and unlabeled documents. In AAAI 98/IAAI 98: Proceedngs of the ffteenth natonal/tenth conference on Artfcal ntellgence/innovatve applcatons of artfcal ntellgence, pages Pearl, J. (1990). Jeffrey s Rule, Passage of Experence and Neo-Bayesansm n Knowledge Representaton and Defeasble Reasonng. Kluwer Academc Publshers. Porter, M. (1980). An algorthm for suffx strppng. Program, 14(3): Raghavan, V., Bollmann, P., and Jung, G. S. (1989). A crtcal nvestgaton of recall and precson as measures of retreval system performance. ACM Trans. Inf. Syst., 7(3): Salton, G. and Buckley, C. (1987). Term weghtng approaches n automatc text retreval. Techncal report, Ithaca, NY, USA. Sndhwan, V., Nyog, P., and Belkn, M. (2005). Beyond the pont cloud: from transductve to semsupervsed learnng. In Proc. of the Internatonal Conference on Machne Learnng (ICML). Szummer, M. and Jaakkola, T. (2001). Partally labeled classfcaton wth Markov random walks. In Advances n Neural Informaton Processng Systems, volume 14. Tsuda, K. (2005). Propagatng dstrbutons on a hypergraph by dual nformaton regularzaton. In Proceedngs of the 22nd Internatonal Conference on Machne Learnng. Wang, J., Jebara, T., and Chang, S.-F. (2008). Graph transducton va alternatng mnmzaton. In Proc. of the Internatonal Conference on Machne Learnng (ICML). Yarowsky, D. (1995). Unsupervsed word sense dsambguaton rvalng supervsed methods. In Proceedngs of the 33rd Annual Meetng of the Assocaton for Computatonal Lngustcs. Zhu, X. (2005a). Sem-supervsed learnng lterature survey. Techncal Report 1530, Computer Scences, Unversty of Wsconsn-Madson. Zhu, X. (2005b). Sem-Supervsed Learnng wth Graphs. PhD thess, Carnege Mellon Unversty. Zhu, X. and Ghahraman, Z. (2002). Learnng from labeled and unlabeled data wth label propagaton. Techncal report, Carnege Mellon Unversty. Zhu, X., Ghahraman, Z., and Lafferty, J. (2003). Semsupervsed learnng usng gaussan felds and harmonc functons. In Proc. of the Internatonal Conference on Machne Learnng (ICML).

Semi-supervised Classification with Active Query Selection

Semi-supervised Classification with Active Query Selection Sem-supervsed Classfcaton wth Actve Query Selecton Jao Wang and Swe Luo School of Computer and Informaton Technology, Beng Jaotong Unversty, Beng 00044, Chna Wangjao088@63.com Abstract. Labeled samples