Soft-Supervised Learning for Text Classification

Size: px
Start display at page:

Download "Soft-Supervised Learning for Text Classification"

Transcription

1 Soft-Supervsed Learnng for Text Classfcaton Amarnag Subramanya & Jeff Blmes Dept. of Electrcal Engneerng, Unversty of Washngton, Seattle, WA 98195, USA. Abstract We propose a new graph-based semsupervsed learnng (SSL) algorthm and demonstrate ts applcaton to document categorzaton. Each document s represented by a vertex wthn a weghted undrected graph and our proposed framework mnmzes the weghted Kullback-Lebler dvergence between dstrbutons that encode the class membershp probabltes of each vertex. The proposed objectve s convex wth guaranteed convergence usng an alternatng mnmzaton procedure. Further, t generalzes n a straghtforward manner to mult-class problems. We present results on two standard tasks, namely Reuters and WebKB, showng that the proposed algorthm sgnfcantly outperforms the state-of-the-art. 1 Introducton Sem-supervsed learnng (SSL) employs small amounts of labeled data wth relatvely large amounts of unlabeled data to tran classfers. In many problems, such as speech recognton, document classfcaton, and sentment recognton, annotatng tranng data s both tme-consumng and tedous, whle unlabeled data are easly obtaned thus makng these problems useful applcatons of SSL. Classc examples of SSL algorthms nclude self-tranng (Yarowsky, 1995) and co-tranng (Blum and Mtchell, 1998). Graphbased SSL algorthms are an mportant class of SSL technques that have attracted much of attenton of late (Blum and Chawla, 2001; Zhu et al., 2003). Here one assumes that the data (both labeled and unlabeled) s embedded wthn a low-dmensonal manfold expressed by a graph. In other words, each data sample s represented by a vertex wthn a weghted graph wth the weghts provdng a measure of smlarty between vertces. Most graph-based SSL algorthms fall under one of two categores those that use the graph structure to spread labels from labeled to unlabeled samples (Szummer and Jaakkola, 2001; Zhu and Ghahraman, 2002) and those that optmze a loss functon based on smoothness constrants derved from the graph (Blum and Chawla, 2001; Zhu et al., 2003; Joachms, 2003; Belkn et al., 2005). Sometmes the two categores are smlar n that they can be shown to optmze the same underlyng objectve (Zhu and Ghahraman, 2002; Zhu et al., 2003). In general graph-based SSL algorthms are non-parametrc and transductve. 1 A learnng algorthm s sad to be transductve f t s expected to work only on a closed data set, where a test set s revealed at the tme of tranng. In practce, however, transductve learners can be modfed to handle unseen data (Zhu, 2005a; Sndhwan et al., 2005). A common drawback of many graph-based SSL algorthms (e.g. (Blum and Chawla, 2001; Joachms, 2003; Belkn et al., 2005)) s that they assume bnary classfcaton tasks and thus requre the use of sub-optmal (and often computatonally expensve) approaches such as one vs. rest to solve mult-class problems, let alone structured domans such as strngs and trees. There are also ssues related to degenerate solutons (all unlabeled samples classfed as belongng to a sngle 1 Excludng Manfold Regularzaton (Belkn et al., 2005).

2 class) (Blum and Chawla, 2001; Joachms, 2003; Zhu and Ghahraman, 2002). For more background on graph-based and general SSL and ther applcatons, see (Zhu, 2005a; Chapelle et al., 2007; Bltzer and Zhu, 2008). In ths paper we propose a new algorthm for graph-based SSL and use the task of text classfcaton to demonstrate ts benefts over the current stateof-the-art. Text classfcaton nvolves automatcally assgnng a gven document to a fxed number of semantc categores. Each document may belong to one, many, or none of the categores. In general, text classfcaton s a mult-class problem (more than 2 categores). Tranng fully-supervsed text classfers requres large amounts of labeled data whose annotaton can be expensve (Dumas et al., 1998). As a result there has been nterest s usng SSL technques for text classfcaton (Joachms, 1999; Joachms, 2003). However past work n semsupervsed text classfcaton has reled prmarly on one vs. rest approaches to overcome the nherent mult-class nature of ths problem. We beleve such an approach may be sub-optmal because, dsregardng data overlap, the dfferent classfers have tranng procedures that are ndependent of one other. In order to address the above drawback we propose a new framework based on optmzng a loss functon composed of Kullback-Lebler dvergence (KL-dvergence) (Cover and Thomas, 1991) terms between probablty dstrbutons defned for each graph vertex. The use of probablty dstrbutons, rather than fxed nteger labels, not only leads to a straghtforward mult-class generalzaton, but also allows us to explot other well-defned functons of dstrbutons, such as entropy, to mprove system performance and to allow for the measure of uncertanty. For example, wth a sngle nteger, at most all we know s ts assgnment. Wth a dstrbuton, we can contnuously move from knowng an assgnment wth certanty (.e., an entropy of zero) to expressons of doubt or multple vald possbltes (.e., an entropy greater than zero). Ths s partcularly useful for document classfcaton as we wll see. We also show how one can use the alternatng mnmzaton (Csszar and Tusnady, 1984) algorthm to optmze our objectve leadng to a relatvely smple, fast, easy-to-mplement, guaranteed to converge, teratve, and closed form update for each teraton. 2 Proposed Graph-Based Learnng Framework We consder the transductve learnng problem,.e., gven a tranng set D = {D l, D u }, where D l and D u are the sets of labeled and unlabeled samples respectvely, the task s to nfer the labels for the samples n D u. In other words, D u s the test-set. Here D l = {(x, y )} l =1, D u = {x } l+u =l+1, x X (the nput space of the classfer, and corresponds to vectors of features) and y Y (the space of classfer outputs, and for our case s the space of non-negatve ntegers). Thus Y = 2 yelds bnary classfcaton whle Y > 2 yelds mult-class. We defne n = l + u, the total number of samples n the tranng set. Gven D, most graph-based SSL algorthms utlze an undrected weghted graph G = (V, E) where V = {1,..., n} are the data ponts n D and E = V V are the set of undrected edges between vertces. We use w j W to denote the weght of the edge between vertces and j. W s referred to as the weght (or affnty) matrx of G. As wll be seen shortly, the nput features x effect the fnal classfcaton results va W,.e., the graph. Thus graph constructon s crucal to the success of any graph-based SSL algorthm. Graph constructon s more of an art, than scence (Zhu, 2005b) and s an actve research area (Alexandrescu and Krchhoff, 2007). In general the weghts are formed as w j = sm(x, x j )δ(j K()). Here K() s the set of s k-nearest-neghbors (KNN), sm(x, x j ) s a gven measure of smlarty between x and x j, and δ(c) returns a 1 f c s true and 0 otherwse. Gettng the smlarty measure rght s crucal for the success of any SSL algorthm as that s what determnes the graph. Note that settng K() = V = n results n a fully-connected graph. Some popular smlarty measures nclude sm(x, x j ) = e x x j σ 2 x, x j sm(x, x j ) = cos(x, x j ) = x 2 2 x j 2 2 where x 2 s the L 2 norm, and x, x j s the nner product of x and x j. The frst smlarty measure s an RBF kernel appled on the squared Eucldean dstance whle the second s cosne smlarty. In ths paper all graphs are constructed usng cosne smlarty. 2 2 or

3 We next ntroduce our proposed approach. For every V, we defne a probablty dstrbuton p over the elements of Y. In addton let r j, j = 1... l be another set of probablty dstrbutons agan over the elements of Y (recall, Y s the space of classfer outputs). Here {r j } j represents the labels of the supervsed porton of the tranng data. If the label for a gven labeled data pont conssts only of a sngle nteger, then the entropy of the correspondng r j s zero (the probablty of that nteger wll be unty, wth the remanng probabltes beng zero). If, on the other hand, the label for a gven labeled data pont conssts of a set of ntegers (e.g., f the object s a member of multple classes), then r j s able to represent ths property accordngly (see below). We emphasze agan that both p and r j are probablty dstrbutons, wth r j fxed throughout tranng. The goal of learnng n ths paper s to fnd the best set of dstrbutons p, that attempt to: 1) agree wth the labeled data r j wherever t s avalable; 2) agree wth each other (when they are close accordng to a graph); and 3) be smooth n some way. These crtera are captured n the followng new mult-class SSL optmzaton procedure: [ l ( ) mn C 1 (p), where C 1 (p) = D KL r p p =1 n ( ) n +µ w j D KL p p j ν H(p ), j =1 (1) and where p (p 1,..., p n ) denotes the entre set of dstrbutons to be learned, H(p ) = y p (y) log p (y) s the standard Shannon entropy functon of p, D KL (p q j ) s the KLdvergence between p and q j, and µ and ν are hyperparameters whose selecton we dscuss n secton 5. The dstrbutons r are derved from D l (as mentoned above) and ths can be done n one of the followng ways: (a) f ŷ s the sngle supervsed label for nput x then r (y) = δ(y = ŷ ), whch means that r gves unty probablty for y equalng the label ŷ ; (b) f ŷ = {ŷ (1),..., ŷ (k) }, k Y s a set of possble outputs for nput x, meanng an object valdly falls nto all of the correspondng categores, we set r (y) = (1/k)δ(y ŷ ) meanng that r s unform over only the possble categores and zero otherwse; (c) f the labels are somehow provded n the form of a set of non-negatve scores, or even a probablty dstrbuton tself, we just set r to be equal to those scores (possbly) normalzed to become a vald probablty dstrbuton. Among these three cases, case (b) s partcularly relevant to text classfcaton as a gven document many belong to (and n practce may be labeled as) many classes. The fnal classfcaton results,.e., the fnal labels for D u, are then gven by ŷ = argmax p (y). y Y We next provde further ntuton on our objectve functon. SSL on a graph conssts of fndng a labelng D u that s consstent wth both the labels provded n D l and the geometry of the data nduced by the graph. The frst term of C 1 wll penalze the soluton p {1,..., l}, when t s far away from the labeled tranng data D l, but t does not nsst that p = r, as allowng for devatons from r can help especally wth nosy labels (Bengo et al., 2007) or when the graph s extremely dense n certan regons. As explaned above, our framework allows for the case where supervsed tranng s uncertan or ambguous. We consder t reasonable to call our approach soft-supervsed learnng, generalzng the noton of sem-supervsed learnng, snce there s even more of a contnuum here between fully supervsed and fully unsupervsed learnng than what typcally exsts wth SSL. Soft-supervsed learnng allows uncertanty to be expressed (va a probablty dstrbuton) about any of the labels ndvdually. The second term of C 1 penalzes a lack of consstency wth the geometry of the data and can be seen as a graph regularzer. If w j s large, we prefer a soluton n whch p and p j are close n the KLdvergence sense. Whle KL-dvergence s asymmetrc, gven that G s undrected mples W s symmetrc (w j = w j ) and as a result the second term s nherently symmetrc. The last term encourages each p to be close to the unform dstrbuton f not preferred to the contrary by the frst two terms. Ths acts as a guard aganst degenerate solutons commonly encountered n SSL (Blum and Chawla, 2001; Joachms, 2003). For example, consder the case where part of the graph s almost completely dsconnected from any labeled vertex (whch s possble n the k-nearest neghbor case). In such stuatons the thrd term en-

4 sures that the nodes n ths dsconnected regon are encouraged to yeld a unform dstrbuton, valdly expressng the fact that we do not know the labels of these nodes based on the nature of the graph. More generally, we conjecture that by maxmzng the entropy of each p, the classfer has a better chance of producng hgh entropy results n graph regons of low confdence (e.g. close to the decson boundary and/or low densty regons). Ths overcomes a common drawback of a large number of state-of-the-art classfers that tend to be confdent even n regons close to the decson boundary. We conclude ths secton by summarzng some of the features of our proposed framework. It should be clear that C 1 uses the manfold assumpton for SSL (see chapter 2 n (Chapelle et al., 2007)) t assumes that the nput data can be embedded wthn a low-dmensonal manfold (the graph). As the objectve s defned n terms of probablty dstrbutons over ntegers rather than just ntegers (or to real-valued relaxatons of ntegers (Joachms, 2003; Zhu et al., 2003)), the framework generalzes n a straghtforward manner to mult-class problems. Further, all the parameters are estmated jontly (compare to one vs. rest approaches whch nvolve solvng Y ndependent problems). Furthermore, the objectve s capable of handlng label tranng data uncertanty (Pearl, 1990). Of course, ths objectve would be useless f t wasn t possble to effcently and easly optmze t on large data sets. We next descrbe a method that can do ths. 3 Learnng wth Alternatng Mnmzaton As long as µ, ν 0, the objectve C 1 (p) s convex. Ths follows snce D KL (p p j ) s convex n the par (p, p j ) (Cover and Thomas, 1991), negatve entropy s convex, and a postve-weghted lnear combnaton of a set of convex functons s convex. Thus, the problem of mnmzng C 1 over the space of collectons of probablty dstrbutons (a convex set) consttutes a convex programmng problem (Bertsekas, 2004). Ths property s extremely benefcal snce there s a unque global optmum and there are a varety of methods that can be used to yeld that global optmum. One possble method mght take the dervatve of the objectve along wth Lagrange multplers to ensure that we stay wthn the space of probablty dstrbutons. Ths method can sometmes yeld a closed form sngle-step analytcal expresson for the globally optmum soluton. Unfortunately, however, our problem does not admt such a closed form soluton because the gradent of C 1 (p) wth respect to p (y) s of the form, k 1 p (y) log p (y) + k 2 p (y) + k 3 (where k 1, k 2, k 3 are fxed constants). Sometmes, optmzng the dual of the objectve can also produce a soluton, but unfortunately agan the dual of our objectve also does not yeld a closed form soluton. The typcal next step, then, s to resort to teratve technques such as gradent descent along wth modfcatons to ensure that the soluton stays wthn the set of probablty dstrbutons (the gradent of C 1 alone wll not necessarly pont n the drecton where p s stll a vald dstrbuton) - one such modfcaton s called the method of multplers (MOM). Another soluton would be to use computatonally complex (and complcated) algorthms lke nteror pont methods (IPM). Whle all of the above methods (descrbed n detal n (Bertsekas, 2004)) are feasble ways to solve our problem, they each have ther own drawbacks. Usng MOM, for example, requres the careful tunng of a number of addtonal parameters such as learnng rates, growth factors, and so on. IPM nvolves nvertng a matrx of the order of the number of varables and constrants durng each teraton. We nstead adopt a dfferent strategy based on alternatng mnmzaton (Csszar and Tusnady, 1984). Ths approach has a sngle addtonal optmzaton parameter (contrasted wth MOM), admts a closed form soluton for each teraton not nvolvng any matrx nverson (contrasted wth IPM), and yelds guaranteed convergence to the global optmum. In order to render our approach amenable to AM, however, we relax our objectve C 1 by defnng a new (thrd) set of dstrbutons for all tranng samples q, = 1,..., n denoted collectvely lke the above usng the notaton q (q 1,..., q n ). We defne a new objectve to be optmzed as follows: mn p,q C 2(p, q), where C 2 (p, q) = +µ n =1 j N () [ l ( ) D KL r q =1 w jd ( ) n KL p q j ν H(p ). =1

5 Before gong further, the reader may be wonderng at ths juncture how mght t be desrable for us to have apparently complcated the objectve functon n an attempt to yeld a more computatonally and methodologcally superor machne learnng procedure. Ths s ndeed the case as wll be spelled out below. Frst, n C 2 we have defned a new weght matrx [W ] j = w j of the same sze as the orgnal where W = W + αi n, where I n s the n n dentty matrx, and where α 0 s a non-negatve constant (ths s the optmzaton related parameter mentoned above). Ths has the effect that w w. In the orgnal objectve C 1, w s rrelevant snce D KL (p p) = 0 for all p, but snce there are now two dstrbutons for each tranng pont, there should be encouragement for the two to approach each other. Lke C 1, the frst term of C 2 ensures that the labeled tranng data s respected and the last term s a smoothness regularzer, but these are done va dfferent sets of dstrbutons, q and p respectvely ths choce s what makes possble the relatvely smple analytcal update equatons gven below. Next, we see that the two objectve functons n fact have dentcal solutons when the optmzaton enforces the constrant that p and q are equal: mn C 2(p, q) = mn C 1 (p). (p,q):p=q p Indeed, as α gets large, the solutons consdered vable are those only where p = q. We thus have that: lm mn C 2(p, q) = mn C 1 (p). α p,q p Therefore, the two objectves should yeld the same soluton as long as α w j for all, j. A key advantage of ths relaxed objectve s that t s amenable to alternatng mnmzaton, a method to produce a sequence of sets of dstrbutons (p n, q n ) as follows: p n = argmnc 2 (p, q n 1 ), q n = argmnc 2 (p n, q). p q It can be shown (we omt the rather lengthy proof due to space constrants) that the sequence generated usng the above mnmzatons converges to the mnmum of C 2 (p, q),.e., lm C 2(p (n), q (n) ) = nf C 2(p, q), n p,q provded we start wth a dstrbuton that s ntalzed properly q (0) (y) > 0 y Y. The update equatons for p (n) and q (n) are gven by p (n) (y) = 1 exp Z q (n) where β (n 1) (y) γ, (y) = r (y)δ( l) + µ j w j p(n) j δ( l) + µ j w j γ = ν + µ j w j, β (n 1) (y) = ν + µ j (y), w j(log q (n 1) j (y) 1) and where Z s a normalzng constant to ensure p s a vald probablty dstrbuton. Note that each teraton of the proposed framework has a closed form soluton and s relatvely smple to mplement, even for very large graphs. Henceforth we refer to the proposed objectve optmzed usng alternatng mnmzaton as AM. 4 Connectons to Other Approaches Label propagaton (LP) (Zhu and Ghahraman, 2002) s a graph-based SSL algorthms that performs Markov random walks on the graph and has a straghtforward extenson to mult-class problems. The update equatons for LP (whch also we use for our LP mplementatons) may be wrtten as p (n) (y) = r (y)δ( l) + δ( > l) j w jp (n 1) j (y) δ( l) + δ( > l) j w j Note the smlarty to the update equaton for q (n) n our AM case. It has been shown that the squaredloss based SSL algorthm (Zhu et al., 2003) and LP have smlar updates (Bengo et al., 2007). The proposed objectve C 1 s smlar n sprt to the squared-loss based objectve n (Zhu et al., 2003; Bengo et al., 2007). Our method, however, dffers n that we are optmzng the KL-dvergence over probablty dstrbutons. We show n secton 5 that KL-dvergence based loss sgnfcantly outperforms the squared-loss. We beleve that ths could be due

6 to the followng: 1) squared loss s approprate under a Gaussan loss model whch may not be optmal under many crcumstances (e.g. classfcaton); 2) KL-dvergence D KL (p q) s based on a relatve (relatve to p) rather than an absolute error; and 3) under certan natural assumptons, KL-dvergence s asymptotcally consstent wth respect to the underlyng probablty dstrbutons. AM s also smlar to the spectral graph transducer (Joachms, 2003) n that they both attempt to fnd labellngs over the unlabeled data that respect the smoothness constrants of the graph. Whle spectral graph transducton s an approxmate soluton to a dscrete optmzaton problem (whch s NP hard), AM s an exact soluton obtaned by optmzng a convex functon over a contnuous space. Further, whle spectral graph transducton assumes bnary classfcaton problems, AM naturally extends to mult-class stuatons wthout loss of convexty. Entropy Mnmzaton (EnM) (Grandvalet and Bengo, 2004) uses the entropy of the unlabeled data as a regularzer whle optmzng a parametrc loss functon defned over the labeled data. Whle the objectves n the case of both AM and EnM make use of the entropy of the unlabeled data, there are several mportant dfferences: (a) EnM s not graphbased, (b) EnM s parametrc whereas our proposed approach s non-parametrc, and most mportantly, (c) EnM attempts to mnmze entropy whle the proposed approach ams to maxmze entropy. Whle ths may seem a trvalty, t has catastrophc consequences n terms of both the mathematcs and meanng. The objectve n case of EnM s not convex, whereas n our case we have a convex formulaton wth smple update equatons and convergence guarantees. (Wang et al., 2008) s a graph-based SSL algorthm that also employs alternatng mnmzaton style optmzaton. However, t s nherently squared-loss based whch our proposed approach out-performs (see secton 5). Further, they do not provde or state convergence guarantees and one sde of ther update approxmates an NP-complete optmzaton procedure. The nformaton regularzaton (IR) (Corduneanu and Jaakkola, 2003) algorthm also makes use of a KL-dvergence based loss for SSL. Here the nput space s dvded nto regons {R } whch mght or mght not overlap. For a gven pont x R, IR attempts to mnmze the KL-dvergence between p (y x ) and ˆp R (y), the agglomeratve dstrbuton for regon R. Gven a graph, one can defne a regon to be a vertex and ts neghbor thus makng IR amenable to graph-based SSL. In (Corduneanu and Jaakkola, 2003), the agglomeraton s performed by a smple averagng (arthmetc mean). Whle IR suggests (wthout proof of convergence) the use of alternatng mnmzaton for optmzaton, one of the steps of the optmzaton does not admt a closedform soluton. Ths s a serous practcal drawback especally n the case of large data sets. (Tsuda, 2005) (hereafter referred to as PD) s an extenson of the IR algorthm to hypergraphs where the agglomeraton s performed usng the geometrc mean. Ths leads to closed form solutons n both steps of the alternatng mnmzaton. There are several mportant dfferences between IR and PD on one sde and our proposed approach: (a) nether IR nor PD use an entropy regularzer, and (b) the update equaton for one of the steps of the optmzaton n the case of PD (equaton 13 n (Tsuda, 2005)) s actually a specal case of our update equaton for p (y) and may be obtaned by settng w j = 1/2. Further, our work here may be easly extended to hypergraphs. 5 Results We compare our algorthm (AM) wth other state-of-the-art SSL-based text categorzaton algorthms, namely, (a) SVM (Joachms, 1999), (b) Transductve-SVM (TSVM) (Joachms, 1999), (c) Spectral Graph Transducton (SGT) (Joachms, 2003), and (d) Label Propagaton (LP) (Zhu and Ghahraman, 2002). Note that only SGT and LP are graph-based algorthms, whle SVM s fullysupervsed (.e., t does not make use of any of the unlabeled data). We mplemented SVM and TSVM usng SVM Lght (Joachms, b) and SGT usng SGT Lght (Joachms, a). In the case of SVM, TSVM and SGT we traned Y classfers (one for each class) n a one vs. rest manner precsely followng (Joachms, 2003). 5.1 Reuters We used the ModApte splt of the Reuters dataset collected from the Reuters newswre n

7 1987 (Lews et al., 1987). The corpus has 9,603 tranng (not to be confused wth D) and 3,299 test documents (whch represents D u ). Of the 135 potental topc categores only the 10 most frequent categores are used (Joachms, 1999). Categores outsde the 10 most frequent were collapsed nto one class and assgned a label other. For each document n the tranng and test sets, we extract features x n the followng manner: stop-words are removed followed by the removal of case and nformaton about nflecton (.e., stemmng) (Porter, 1980). We then compute TFIDF features for each document (Salton and Buckley, 1987). All graphs were constructed usng cosne smlarty wth TFIDF features. For ths task Y = { earn, acq, money, gran, crude, trade, nterest, shp, wheat, corn, average}. For LP and AM, we use the output space Y = Y { other }. For documents n D l that are labeled wth multple categores, we ntalze r to have equal non-zero probablty for each such category. For example, f document s annotated as belongng to classes { acq, gran, wheat}, then r (acq) = r (gran) = r (wheat) = 1/3. We created 21 transducton sets by randomly samplng l documents from the tranng set wth the constrant that each of 11 categores (top 10 categores and the class other) are represented at least once n each set. These samples consttute D l. All algorthms used the same transducton sets. In the case of SGT, LP and AM, the frst transducton set was used to tune the hyperparameters whch we then held fxed for all the remanng 20 transducton sets. For all the graph-based approaches, we ran a search over K {2, 10, 50, 100, 250, 500, 1000, 2000, n} (note K = n represents a fully connected graph). In addton, n the case of AM, we set α = 2 for all experments, and we ran a search over µ {1e 8, 1e 4, 0.01, 0.1, 1, 10, 100} and ν {1e 8, 1e 6, 1e 4, 0.01, 0.1}, for SGT the search was over c {3000, 3200, 3400, 3800, 5000, } (see (Joachms, 2003)). We report precson-recall break even pont (PRBEP) results on the 3,299 test documents n Table 1. PRBEP has been a popular measure n nformaton retreval (see e.g. (Raghavan et al., 1989)). It s defned as that value for whch precson and recall are equal. Results for each category n Table 1 were obtaned by averagng the PRBEP over Category SVM TSVM SGT LP AM earn acq money gran crude trade nterest shp wheat corn average Table 1: P/R Break Even Ponts (PRBEP) for the top 10 categores n the Reuters data set wth l = 20 and u = All results are averages over 20 randomly generated transducton sets. The last row s the macroaverage over all the categores. Note AM s the proposed approach. the 20 transducton sets. The fnal row average was obtaned by macro-averagng (average of averages). The optmal value of the hyperparameters n case of LP was K = 100; n case of AM, K = 2000, µ = 1e 4, ν = 1e 2; and n the case of SGT, K = 100, c = The results show that AM outperforms the state-of-the-art on 6 out of 10 categores and s compettve n 3 of the remanng 4 categores. Further t sgnfcantly outperforms all other approaches n case of the macro-averages. AM s sgnfcant over ts best compettor SGT at the level accordng to the dfference of proportons sgnfcance test. Fgure 1 shows the varaton of average PRBEP aganst the number of labeled documents (l). For each value of l, we tuned the hyperparameters over the frst transducton set and used these values for all the other 20 sets. Fgure 1 also shows errorbars (± standard devaton) all the experments. As expected, the performance of all the approaches mproves wth ncreasng number of labeled documents. Once agan n ths case, AM, outperforms the other approaches for all values of l. 5.2 WebKB Collecton World Wde Knowledge Base (WebKB) s a collecton of 8282 web pages obtaned from four academc

8 Average PRBEP AM SGT 50 LP TSVM SVM Number of Labeled Documents Average PRBEP AM SGT LP TSVM SVM Number of Labeled Documents Fgure 1: Average PRBEP over all classes vs. number of labeled documents (l) for Reuters data set domans. The web pages n the WebKB set are labeled usng two dfferent polychotomes. The frst s accordng to topc and the second s accordng to web doman. In our experments we only consdered the frst polychotomy, whch conssts of 7 categores: course, department, faculty, project, staff, student, and other. Followng (Ngam et al., 1998) we only use documents from categores course, department, faculty, project whch gves 4199 documents for the four categores. Each of the documents s n HTML format contanng text as well as other nformaton such as HTML tags, lnks, etc. We used both textual and non-textual nformaton to construct the feature vectors. In ths case we dd not use ether stop-word removal or stemmng as ths has been found to hurt performance on ths task (Ngam et al., 1998). As n the the case of the Reuters data set we extracted TFIDF features for each document and constructed the graph usng cosne smlarty. As n (Bekkerman et al., 2003), we created four roughly-equal random parttons of the data set. In order to obtan D l, we frst randomly choose a splt and then sample l documents from that splt. The other three splts consttute D u. We beleve ths s more realstc than samplng the labeled web-pages from a sngle unversty and testng web-pages from the other unverstes (Joachms, 1999). Ths method of creatng transducton sets allows us to better evaluate the generalzaton performance of the varous algorthms. Once agan we create 21 transducton sets and the frst set was used to tune the hyperparameters. Further, we ran a search over the same grd as used n the case of Reuters. We report precson- Fgure 2: Average PRBEP over all classes vs. number of labeled documents (l) for WebKB collecton. Class SVM TSVM SGT LP AM course faculty project student average Table 2: P/R Break Even Ponts (PRBEP) for the WebKB data set wth l = 48 and u = All results are averages over 20 randomly generated transducton sets. The last row s the macro-average over all the classes. AM s the proposed approach. recall break even pont (PRBEP) results on the 3,148 test documents n Table 2. For ths task, we found that the optmal value of the hyperparameter were: n the case of LP, K = 1000; n case of AM, K = 1000, µ = 1e 2, ν = 1e 4; and n case of SGT, K = 100, c = Once agan, AM s sgnfcant at the level over ts closest compettor LP. Fgure 2 shows the varaton of PRBEP wth number of labeled documents (l) and was generated n a smlar fashon as n the case of the Reuters data set. 6 Dscusson We note that LP may be cast nto an AM-lke framework by usng the followng sequence of updates, p (n) (y) = δ( l)r (y) + δ( > l)q (n 1) q (n) (y) = j w jp (n) j w j (y),

9 To compare the behavor of AM and LP, we appled ths form of LP along wth AM on a smple 5-node bnary-classfcaton SSL graph where two nodes are labeled (node 1 and 2) and the remanng nodes are unlabeled (see Fgure 3, top). Snce ths s bnary classfcaton ( Y = 2), each dstrbuton p or q can be depcted usng only a sngle real number between 0 and 1 correspondng to the probablty that each vertex s class 2 (yes two). We show how both LP and AM evolve startng from exactly the same random startng pont q 0 (Fgure 3, bottom). For each algorthm, the fgure shows that both algorthms clearly converge. Each alternate teraton of LP s such that the labeled vertces oscllate due to ts clampng back to the labeled dstrbuton, but that s not the case for AM. We see, moreover, qualtatve dfferences n the solutons as well e.g., AM s soluton for the pendant node 5 s less confdent than s LP s soluton. More emprcal comparatve analyss between the two algorthms of ths sort wll appear n future work. We have proposed a new algorthm for semsupervsed text categorzaton. Emprcal results show that the proposed approach sgnfcantly outperforms the state-of-the-art. In addton the proposed approach s relatvely smple to mplement and has guaranteed convergence propertes. Whle n ths work, we use relatvely smple features to construct the graph, use of more sophstcated features and/or smlarty measures could lead to further mproved results. Acknowledgments Ths work was supported by ONR MURI grant N , by NSF grant IIS , by the Companons project (IST programme under EC grant IST-FP ), and by a Mcrosoft Research Fellowshp. vertex (data pont) number vertex (data pont) number Node 2 Label Node 3 Unlabeled 0.8 Node 5 Unlabeled Node 1 Label 1 Node 4 Unlabeled q (0) q (1) q (2) q (3) q (4) q (5) q (6) q (7) q (8) q(9) q (10) q (11) q (12) q (13) q (14) q (15) p (1) p (2) p (3) p (4) p (5) p (6) p (7) p (8) p (9) p (10) p (11) p (12) p (13) p (14) p (15) AM teraton (and dstrbuton par) number References Alexandrescu, A. and Krchhoff, K. (2007). Data-drven graph constructon for sem-supervsed graph-based learnnng n nlp. In Proc. of the Human Language Technologes Conference (HLT-NAACL). Bekkerman, R., El-Yanv, R., Tshby, N., and Wnter, Y. (2003). Dstrbutonal word clusters vs. words for text categorzaton. J. Mach. Learn. Res., 3: q (0) q (1) q (2) q (3) q (4) q (5) q (6) q (7) q (8) q(9) q (10) q (11) q (12) q (13) q (14) q (15) p (1) p (2) p (3) p (4) p (5) p (6) p (7) p (8) p (9) p (10) p (11) p (12) p (13) p (14) p (15) LP teraton (and dstrbuton par) number Fgure 3: Graph (top), and alternatng values of p n, q n for ncreasng n for AM and LP.

10 Belkn, M., Nyog, P., and Sndhwan, V. (2005). On manfold regularzaton. In Proc. of the Conference on Artfcal Intellgence and Statstcs (AISTATS). Bengo, Y., Delalleau, O., and Roux, N. L. (2007). Sem- Supervsed Learnng, chapter Label Propogaton and Quadratc Crteron. MIT Press. Bertsekas, D. (2004). Nonlnear Programmng. Athena Scentfc Publshng. Bltzer, J. and Zhu, J. (2008). ACL 2008 tutoral on Sem-Supervsed learnng. wkdot.com/. Blum, A. and Chawla, S. (2001). Learnng from labeled and unlabeled data usng graph mncuts. In Proc. 18th Internatonal Conf. on Machne Learnng, pages Morgan Kaufmann, San Francsco, CA. Blum, A. and Mtchell, T. (1998). Combnng labeled and unlabeled data wth co-tranng. In COLT: Proceedngs of the Workshop on Computatonal Learnng Theory. Chapelle, O., Scholkopf, B., and Zen, A. (2007). Sem- Supervsed Learnng. MIT Press. Corduneanu, A. and Jaakkola, T. (2003). On nformaton regularzaton. In Uncertanty n Artfcal Intellgence. Cover, T. M. and Thomas, J. A. (1991). Elements of Informaton Theory. Wley Seres n Telecommuncatons. Wley, New York. Csszar, I. and Tusnady, G. (1984). Informaton Geometry and Alternatng Mnmzaton Procedures. Statstcs and Decsons. Dumas, S., Platt, J., Heckerman, D., and Saham, M. (1998). Inductve learnng algorthms and representatons for text categorzaton. In CIKM 98: Proceedngs of the seventh nternatonal conference on Informaton and knowledge management, New York, NY, USA. Grandvalet, Y. and Bengo, Y. (2004). Sem-supervsed learnng by entropy mnmzaton. In Advances n Neural Informaton Processng Systems (NIPS). Joachms, T. SGT Lght. org. Joachms, T. SVM Lght. joachms.org. Joachms, T. (1999). Transductve nference for text classfcaton usng support vector machnes. In Proc. of the Internatonal Conference on Machne Learnng (ICML). Joachms, T. (2003). Transductve learnng va spectral graph parttonng. In Proc. of the Internatonal Conference on Machne Learnng (ICML). Lews, D. et al. (1987). Reuters http: // testcollectons/reuters Ngam, K., McCallum, A., Thrun, S., and Mtchell, T. (1998). Learnng to classfy text from labeled and unlabeled documents. In AAAI 98/IAAI 98: Proceedngs of the ffteenth natonal/tenth conference on Artfcal ntellgence/innovatve applcatons of artfcal ntellgence, pages Pearl, J. (1990). Jeffrey s Rule, Passage of Experence and Neo-Bayesansm n Knowledge Representaton and Defeasble Reasonng. Kluwer Academc Publshers. Porter, M. (1980). An algorthm for suffx strppng. Program, 14(3): Raghavan, V., Bollmann, P., and Jung, G. S. (1989). A crtcal nvestgaton of recall and precson as measures of retreval system performance. ACM Trans. Inf. Syst., 7(3): Salton, G. and Buckley, C. (1987). Term weghtng approaches n automatc text retreval. Techncal report, Ithaca, NY, USA. Sndhwan, V., Nyog, P., and Belkn, M. (2005). Beyond the pont cloud: from transductve to semsupervsed learnng. In Proc. of the Internatonal Conference on Machne Learnng (ICML). Szummer, M. and Jaakkola, T. (2001). Partally labeled classfcaton wth Markov random walks. In Advances n Neural Informaton Processng Systems, volume 14. Tsuda, K. (2005). Propagatng dstrbutons on a hypergraph by dual nformaton regularzaton. In Proceedngs of the 22nd Internatonal Conference on Machne Learnng. Wang, J., Jebara, T., and Chang, S.-F. (2008). Graph transducton va alternatng mnmzaton. In Proc. of the Internatonal Conference on Machne Learnng (ICML). Yarowsky, D. (1995). Unsupervsed word sense dsambguaton rvalng supervsed methods. In Proceedngs of the 33rd Annual Meetng of the Assocaton for Computatonal Lngustcs. Zhu, X. (2005a). Sem-supervsed learnng lterature survey. Techncal Report 1530, Computer Scences, Unversty of Wsconsn-Madson. Zhu, X. (2005b). Sem-Supervsed Learnng wth Graphs. PhD thess, Carnege Mellon Unversty. Zhu, X. and Ghahraman, Z. (2002). Learnng from labeled and unlabeled data wth label propagaton. Techncal report, Carnege Mellon Unversty. Zhu, X., Ghahraman, Z., and Lafferty, J. (2003). Semsupervsed learnng usng gaussan felds and harmonc functons. In Proc. of the Internatonal Conference on Machne Learnng (ICML).

Semi-supervised Classification with Active Query Selection

Semi-supervised Classification with Active Query Selection Sem-supervsed Classfcaton wth Actve Query Selecton Jao Wang and Swe Luo School of Computer and Informaton Technology, Beng Jaotong Unversty, Beng 00044, Chna Wangjao088@63.com Abstract. Labeled samples

More information

Kernel Methods and SVMs Extension

Kernel Methods and SVMs Extension Kernel Methods and SVMs Extenson The purpose of ths document s to revew materal covered n Machne Learnng 1 Supervsed Learnng regardng support vector machnes (SVMs). Ths document also provdes a general

More information

MMA and GCMMA two methods for nonlinear optimization

MMA and GCMMA two methods for nonlinear optimization MMA and GCMMA two methods for nonlnear optmzaton Krster Svanberg Optmzaton and Systems Theory, KTH, Stockholm, Sweden. krlle@math.kth.se Ths note descrbes the algorthms used n the author s 2007 mplementatons

More information

Generalized Linear Methods

Generalized Linear Methods Generalzed Lnear Methods 1 Introducton In the Ensemble Methods the general dea s that usng a combnaton of several weak learner one could make a better learner. More formally, assume that we have a set

More information

Support Vector Machines. Vibhav Gogate The University of Texas at dallas

Support Vector Machines. Vibhav Gogate The University of Texas at dallas Support Vector Machnes Vbhav Gogate he Unversty of exas at dallas What We have Learned So Far? 1. Decson rees. Naïve Bayes 3. Lnear Regresson 4. Logstc Regresson 5. Perceptron 6. Neural networks 7. K-Nearest

More information

Lecture 10 Support Vector Machines II

Lecture 10 Support Vector Machines II Lecture 10 Support Vector Machnes II 22 February 2016 Taylor B. Arnold Yale Statstcs STAT 365/665 1/28 Notes: Problem 3 s posted and due ths upcomng Frday There was an early bug n the fake-test data; fxed

More information

Feature Selection: Part 1

Feature Selection: Part 1 CSE 546: Machne Learnng Lecture 5 Feature Selecton: Part 1 Instructor: Sham Kakade 1 Regresson n the hgh dmensonal settng How do we learn when the number of features d s greater than the sample sze n?

More information

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015

CS 3710: Visual Recognition Classification and Detection. Adriana Kovashka Department of Computer Science January 13, 2015 CS 3710: Vsual Recognton Classfcaton and Detecton Adrana Kovashka Department of Computer Scence January 13, 2015 Plan for Today Vsual recognton bascs part 2: Classfcaton and detecton Adrana s research

More information

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017

U.C. Berkeley CS294: Beyond Worst-Case Analysis Luca Trevisan September 5, 2017 U.C. Berkeley CS94: Beyond Worst-Case Analyss Handout 4s Luca Trevsan September 5, 07 Summary of Lecture 4 In whch we ntroduce semdefnte programmng and apply t to Max Cut. Semdefnte Programmng Recall that

More information

Which Separator? Spring 1

Which Separator? Spring 1 Whch Separator? 6.034 - Sprng 1 Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng Whch Separator? Mamze the margn to closest ponts 6.034 - Sprng 3 Margn of a pont " # y (w $ + b) proportonal

More information

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.

More information

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur Module 3 LOSSY IMAGE COMPRESSION SYSTEMS Verson ECE IIT, Kharagpur Lesson 6 Theory of Quantzaton Verson ECE IIT, Kharagpur Instructonal Objectves At the end of ths lesson, the students should be able to:

More information

Lecture 12: Classification

Lecture 12: Classification Lecture : Classfcaton g Dscrmnant functons g The optmal Bayes classfer g Quadratc classfers g Eucldean and Mahalanobs metrcs g K Nearest Neghbor Classfers Intellgent Sensor Systems Rcardo Guterrez-Osuna

More information

Lecture Notes on Linear Regression

Lecture Notes on Linear Regression Lecture Notes on Lnear Regresson Feng L fl@sdueducn Shandong Unversty, Chna Lnear Regresson Problem In regresson problem, we am at predct a contnuous target value gven an nput feature vector We assume

More information

Homework Assignment 3 Due in class, Thursday October 15

Homework Assignment 3 Due in class, Thursday October 15 Homework Assgnment 3 Due n class, Thursday October 15 SDS 383C Statstcal Modelng I 1 Rdge regresson and Lasso 1. Get the Prostrate cancer data from http://statweb.stanford.edu/~tbs/elemstatlearn/ datasets/prostate.data.

More information

MAXIMUM A POSTERIORI TRANSDUCTION

MAXIMUM A POSTERIORI TRANSDUCTION MAXIMUM A POSTERIORI TRANSDUCTION LI-WEI WANG, JU-FU FENG School of Mathematcal Scences, Peng Unversty, Bejng, 0087, Chna Center for Informaton Scences, Peng Unversty, Bejng, 0087, Chna E-MIAL: {wanglw,

More information

Ensemble Methods: Boosting

Ensemble Methods: Boosting Ensemble Methods: Boostng Ncholas Ruozz Unversty of Texas at Dallas Based on the sldes of Vbhav Gogate and Rob Schapre Last Tme Varance reducton va baggng Generate new tranng data sets by samplng wth replacement

More information

Numerical Heat and Mass Transfer

Numerical Heat and Mass Transfer Master degree n Mechancal Engneerng Numercal Heat and Mass Transfer 06-Fnte-Dfference Method (One-dmensonal, steady state heat conducton) Fausto Arpno f.arpno@uncas.t Introducton Why we use models and

More information

Lecture 12: Discrete Laplacian

Lecture 12: Discrete Laplacian Lecture 12: Dscrete Laplacan Scrbe: Tanye Lu Our goal s to come up wth a dscrete verson of Laplacan operator for trangulated surfaces, so that we can use t n practce to solve related problems We are mostly

More information

Natural Language Processing and Information Retrieval

Natural Language Processing and Information Retrieval Natural Language Processng and Informaton Retreval Support Vector Machnes Alessandro Moschtt Department of nformaton and communcaton technology Unversty of Trento Emal: moschtt@ds.untn.t Summary Support

More information

CSC 411 / CSC D11 / CSC C11

CSC 411 / CSC D11 / CSC C11 18 Boostng s a general strategy for learnng classfers by combnng smpler ones. The dea of boostng s to take a weak classfer that s, any classfer that wll do at least slghtly better than chance and use t

More information

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4)

Econ107 Applied Econometrics Topic 3: Classical Model (Studenmund, Chapter 4) I. Classcal Assumptons Econ7 Appled Econometrcs Topc 3: Classcal Model (Studenmund, Chapter 4) We have defned OLS and studed some algebrac propertes of OLS. In ths topc we wll study statstcal propertes

More information

Linear Classification, SVMs and Nearest Neighbors

Linear Classification, SVMs and Nearest Neighbors 1 CSE 473 Lecture 25 (Chapter 18) Lnear Classfcaton, SVMs and Nearest Neghbors CSE AI faculty + Chrs Bshop, Dan Klen, Stuart Russell, Andrew Moore Motvaton: Face Detecton How do we buld a classfer to dstngush

More information

COS 521: Advanced Algorithms Game Theory and Linear Programming

COS 521: Advanced Algorithms Game Theory and Linear Programming COS 521: Advanced Algorthms Game Theory and Lnear Programmng Moses Charkar February 27, 2013 In these notes, we ntroduce some basc concepts n game theory and lnear programmng (LP). We show a connecton

More information

Probabilistic & Unsupervised Learning

Probabilistic & Unsupervised Learning Probablstc & Unsupervsed Learnng Convex Algorthms n Approxmate Inference Yee Whye Teh ywteh@gatsby.ucl.ac.uk Gatsby Computatonal Neuroscence Unt Unversty College London Term 1, Autumn 2008 Convexty A convex

More information

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009

College of Computer & Information Science Fall 2009 Northeastern University 20 October 2009 College of Computer & Informaton Scence Fall 2009 Northeastern Unversty 20 October 2009 CS7880: Algorthmc Power Tools Scrbe: Jan Wen and Laura Poplawsk Lecture Outlne: Prmal-dual schema Network Desgn:

More information

Week 5: Neural Networks

Week 5: Neural Networks Week 5: Neural Networks Instructor: Sergey Levne Neural Networks Summary In the prevous lecture, we saw how we can construct neural networks by extendng logstc regresson. Neural networks consst of multple

More information

EEE 241: Linear Systems

EEE 241: Linear Systems EEE : Lnear Systems Summary #: Backpropagaton BACKPROPAGATION The perceptron rule as well as the Wdrow Hoff learnng were desgned to tran sngle layer networks. They suffer from the same dsadvantage: they

More information

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem

More information

Boostrapaggregating (Bagging)

Boostrapaggregating (Bagging) Boostrapaggregatng (Baggng) An ensemble meta-algorthm desgned to mprove the stablty and accuracy of machne learnng algorthms Can be used n both regresson and classfcaton Reduces varance and helps to avod

More information

Some modelling aspects for the Matlab implementation of MMA

Some modelling aspects for the Matlab implementation of MMA Some modellng aspects for the Matlab mplementaton of MMA Krster Svanberg krlle@math.kth.se Optmzaton and Systems Theory Department of Mathematcs KTH, SE 10044 Stockholm September 2004 1. Consdered optmzaton

More information

The Geometry of Logit and Probit

The Geometry of Logit and Probit The Geometry of Logt and Probt Ths short note s meant as a supplement to Chapters and 3 of Spatal Models of Parlamentary Votng and the notaton and reference to fgures n the text below s to those two chapters.

More information

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them?

Image classification. Given the bag-of-features representations of images from different classes, how do we learn a model for distinguishing i them? Image classfcaton Gven te bag-of-features representatons of mages from dfferent classes ow do we learn a model for dstngusng tem? Classfers Learn a decson rule assgnng bag-offeatures representatons of

More information

Gaussian Mixture Models

Gaussian Mixture Models Lab Gaussan Mxture Models Lab Objectve: Understand the formulaton of Gaussan Mxture Models (GMMs) and how to estmate GMM parameters. You ve already seen GMMs as the observaton dstrbuton n certan contnuous

More information

FUZZY GOAL PROGRAMMING VS ORDINARY FUZZY PROGRAMMING APPROACH FOR MULTI OBJECTIVE PROGRAMMING PROBLEM

FUZZY GOAL PROGRAMMING VS ORDINARY FUZZY PROGRAMMING APPROACH FOR MULTI OBJECTIVE PROGRAMMING PROBLEM Internatonal Conference on Ceramcs, Bkaner, Inda Internatonal Journal of Modern Physcs: Conference Seres Vol. 22 (2013) 757 761 World Scentfc Publshng Company DOI: 10.1142/S2010194513010982 FUZZY GOAL

More information

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg

princeton univ. F 17 cos 521: Advanced Algorithm Design Lecture 7: LP Duality Lecturer: Matt Weinberg prnceton unv. F 17 cos 521: Advanced Algorthm Desgn Lecture 7: LP Dualty Lecturer: Matt Wenberg Scrbe: LP Dualty s an extremely useful tool for analyzng structural propertes of lnear programs. Whle there

More information

Intro to Visual Recognition

Intro to Visual Recognition CS 2770: Computer Vson Intro to Vsual Recognton Prof. Adrana Kovashka Unversty of Pttsburgh February 13, 2018 Plan for today What s recognton? a.k.a. classfcaton, categorzaton Support vector machnes Separable

More information

Supporting Information

Supporting Information Supportng Informaton The neural network f n Eq. 1 s gven by: f x l = ReLU W atom x l + b atom, 2 where ReLU s the element-wse rectfed lnear unt, 21.e., ReLUx = max0, x, W atom R d d s the weght matrx to

More information

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018

INF 5860 Machine learning for image classification. Lecture 3 : Image classification and regression part II Anne Solberg January 31, 2018 INF 5860 Machne learnng for mage classfcaton Lecture 3 : Image classfcaton and regresson part II Anne Solberg January 3, 08 Today s topcs Multclass logstc regresson and softma Regularzaton Image classfcaton

More information

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results.

For now, let us focus on a specific model of neurons. These are simplified from reality but can achieve remarkable results. Neural Networks : Dervaton compled by Alvn Wan from Professor Jtendra Malk s lecture Ths type of computaton s called deep learnng and s the most popular method for many problems, such as computer vson

More information

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering / Theory and Applcatons of Pattern Recognton 003, Rob Polkar, Rowan Unversty, Glassboro, NJ Lecture 4 Bayes Classfcaton Rule Dept. of Electrcal and Computer Engneerng 0909.40.0 / 0909.504.04 Theory & Applcatons

More information

Support Vector Machines CS434

Support Vector Machines CS434 Support Vector Machnes CS434 Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best? + + + + + + + + + Intuton of Margn Consder ponts

More information

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M

CIS526: Machine Learning Lecture 3 (Sept 16, 2003) Linear Regression. Preparation help: Xiaoying Huang. x 1 θ 1 output... θ M x M CIS56: achne Learnng Lecture 3 (Sept 6, 003) Preparaton help: Xaoyng Huang Lnear Regresson Lnear regresson can be represented by a functonal form: f(; θ) = θ 0 0 +θ + + θ = θ = 0 ote: 0 s a dummy attrbute

More information

Pattern Classification

Pattern Classification Pattern Classfcaton All materals n these sldes ere taken from Pattern Classfcaton (nd ed) by R. O. Duda, P. E. Hart and D. G. Stork, John Wley & Sons, 000 th the permsson of the authors and the publsher

More information

1 Convex Optimization

1 Convex Optimization Convex Optmzaton We wll consder convex optmzaton problems. Namely, mnmzaton problems where the objectve s convex (we assume no constrants for now). Such problems often arse n machne learnng. For example,

More information

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems

Chapter 5. Solution of System of Linear Equations. Module No. 6. Solution of Inconsistent and Ill Conditioned Systems Numercal Analyss by Dr. Anta Pal Assstant Professor Department of Mathematcs Natonal Insttute of Technology Durgapur Durgapur-713209 emal: anta.bue@gmal.com 1 . Chapter 5 Soluton of System of Lnear Equatons

More information

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty

Additional Codes using Finite Difference Method. 1 HJB Equation for Consumption-Saving Problem Without Uncertainty Addtonal Codes usng Fnte Dfference Method Benamn Moll 1 HJB Equaton for Consumpton-Savng Problem Wthout Uncertanty Before consderng the case wth stochastc ncome n http://www.prnceton.edu/~moll/ HACTproect/HACT_Numercal_Appendx.pdf,

More information

Linear Approximation with Regularization and Moving Least Squares

Linear Approximation with Regularization and Moving Least Squares Lnear Approxmaton wth Regularzaton and Movng Least Squares Igor Grešovn May 007 Revson 4.6 (Revson : March 004). 5 4 3 0.5 3 3.5 4 Contents: Lnear Fttng...4. Weghted Least Squares n Functon Approxmaton...

More information

NUMERICAL DIFFERENTIATION

NUMERICAL DIFFERENTIATION NUMERICAL DIFFERENTIATION 1 Introducton Dfferentaton s a method to compute the rate at whch a dependent output y changes wth respect to the change n the ndependent nput x. Ths rate of change s called the

More information

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification E395 - Pattern Recognton Solutons to Introducton to Pattern Recognton, Chapter : Bayesan pattern classfcaton Preface Ths document s a soluton manual for selected exercses from Introducton to Pattern Recognton

More information

VQ widely used in coding speech, image, and video

VQ widely used in coding speech, image, and video at Scalar quantzers are specal cases of vector quantzers (VQ): they are constraned to look at one sample at a tme (memoryless) VQ does not have such constrant better RD perfomance expected Source codng

More information

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS

A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS HCMC Unversty of Pedagogy Thong Nguyen Huu et al. A PROBABILITY-DRIVEN SEARCH ALGORITHM FOR SOLVING MULTI-OBJECTIVE OPTIMIZATION PROBLEMS Thong Nguyen Huu and Hao Tran Van Department of mathematcs-nformaton,

More information

Foundations of Arithmetic

Foundations of Arithmetic Foundatons of Arthmetc Notaton We shall denote the sum and product of numbers n the usual notaton as a 2 + a 2 + a 3 + + a = a, a 1 a 2 a 3 a = a The notaton a b means a dvdes b,.e. ac = b where c s an

More information

Classification as a Regression Problem

Classification as a Regression Problem Target varable y C C, C,, ; Classfcaton as a Regresson Problem { }, 3 L C K To treat classfcaton as a regresson problem we should transform the target y nto numercal values; The choce of numercal class

More information

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011

Stanford University CS359G: Graph Partitioning and Expanders Handout 4 Luca Trevisan January 13, 2011 Stanford Unversty CS359G: Graph Parttonng and Expanders Handout 4 Luca Trevsan January 3, 0 Lecture 4 In whch we prove the dffcult drecton of Cheeger s nequalty. As n the past lectures, consder an undrected

More information

Fuzzy Boundaries of Sample Selection Model

Fuzzy Boundaries of Sample Selection Model Proceedngs of the 9th WSES Internatonal Conference on ppled Mathematcs, Istanbul, Turkey, May 7-9, 006 (pp309-34) Fuzzy Boundares of Sample Selecton Model L. MUHMD SFIIH, NTON BDULBSH KMIL, M. T. BU OSMN

More information

Multilayer Perceptron (MLP)

Multilayer Perceptron (MLP) Multlayer Perceptron (MLP) Seungjn Cho Department of Computer Scence and Engneerng Pohang Unversty of Scence and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjn@postech.ac.kr 1 / 20 Outlne

More information

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach

A Bayes Algorithm for the Multitask Pattern Recognition Problem Direct Approach A Bayes Algorthm for the Multtask Pattern Recognton Problem Drect Approach Edward Puchala Wroclaw Unversty of Technology, Char of Systems and Computer etworks, Wybrzeze Wyspanskego 7, 50-370 Wroclaw, Poland

More information

The Expectation-Maximization Algorithm

The Expectation-Maximization Algorithm The Expectaton-Maxmaton Algorthm Charles Elan elan@cs.ucsd.edu November 16, 2007 Ths chapter explans the EM algorthm at multple levels of generalty. Secton 1 gves the standard hgh-level verson of the algorthm.

More information

Finding Dense Subgraphs in G(n, 1/2)

Finding Dense Subgraphs in G(n, 1/2) Fndng Dense Subgraphs n Gn, 1/ Atsh Das Sarma 1, Amt Deshpande, and Rav Kannan 1 Georga Insttute of Technology,atsh@cc.gatech.edu Mcrosoft Research-Bangalore,amtdesh,annan@mcrosoft.com Abstract. Fndng

More information

Single-Facility Scheduling over Long Time Horizons by Logic-based Benders Decomposition

Single-Facility Scheduling over Long Time Horizons by Logic-based Benders Decomposition Sngle-Faclty Schedulng over Long Tme Horzons by Logc-based Benders Decomposton Elvn Coban and J. N. Hooker Tepper School of Busness, Carnege Mellon Unversty ecoban@andrew.cmu.edu, john@hooker.tepper.cmu.edu

More information

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}.

We present the algorithm first, then derive it later. Assume access to a dataset {(x i, y i )} n i=1, where x i R d and y i { 1, 1}. CS 189 Introducton to Machne Learnng Sprng 2018 Note 26 1 Boostng We have seen that n the case of random forests, combnng many mperfect models can produce a snglodel that works very well. Ths s the dea

More information

Markov Chain Monte Carlo Lecture 6

Markov Chain Monte Carlo Lecture 6 where (x 1,..., x N ) X N, N s called the populaton sze, f(x) f (x) for at least one {1, 2,..., N}, and those dfferent from f(x) are called the tral dstrbutons n terms of mportance samplng. Dfferent ways

More information

Evaluation of simple performance measures for tuning SVM hyperparameters

Evaluation of simple performance measures for tuning SVM hyperparameters Evaluaton of smple performance measures for tunng SVM hyperparameters Kabo Duan, S Sathya Keerth, Aun Neow Poo Department of Mechancal Engneerng, Natonal Unversty of Sngapore, 0 Kent Rdge Crescent, 960,

More information

Maximum Likelihood Estimation (MLE)

Maximum Likelihood Estimation (MLE) Maxmum Lkelhood Estmaton (MLE) Ken Kreutz-Delgado (Nuno Vasconcelos) ECE 175A Wnter 01 UCSD Statstcal Learnng Goal: Gven a relatonshp between a feature vector x and a vector y, and d data samples (x,y

More information

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography

CSci 6974 and ECSE 6966 Math. Tech. for Vision, Graphics and Robotics Lecture 21, April 17, 2006 Estimating A Plane Homography CSc 6974 and ECSE 6966 Math. Tech. for Vson, Graphcs and Robotcs Lecture 21, Aprl 17, 2006 Estmatng A Plane Homography Overvew We contnue wth a dscusson of the major ssues, usng estmaton of plane projectve

More information

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests

Simulated Power of the Discrete Cramér-von Mises Goodness-of-Fit Tests Smulated of the Cramér-von Mses Goodness-of-Ft Tests Steele, M., Chaselng, J. and 3 Hurst, C. School of Mathematcal and Physcal Scences, James Cook Unversty, Australan School of Envronmental Studes, Grffth

More information

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix

Lectures - Week 4 Matrix norms, Conditioning, Vector Spaces, Linear Independence, Spanning sets and Basis, Null space and Range of a Matrix Lectures - Week 4 Matrx norms, Condtonng, Vector Spaces, Lnear Independence, Spannng sets and Bass, Null space and Range of a Matrx Matrx Norms Now we turn to assocatng a number to each matrx. We could

More information

Resource Allocation with a Budget Constraint for Computing Independent Tasks in the Cloud

Resource Allocation with a Budget Constraint for Computing Independent Tasks in the Cloud Resource Allocaton wth a Budget Constrant for Computng Independent Tasks n the Cloud Wemng Sh and Bo Hong School of Electrcal and Computer Engneerng Georga Insttute of Technology, USA 2nd IEEE Internatonal

More information

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING

ADVANCED MACHINE LEARNING ADVANCED MACHINE LEARNING 1 ADVANCED ACHINE LEARNING ADVANCED ACHINE LEARNING Non-lnear regresson technques 2 ADVANCED ACHINE LEARNING Regresson: Prncple N ap N-dm. nput x to a contnuous output y. Learn a functon of the type: N

More information

On the Multicriteria Integer Network Flow Problem

On the Multicriteria Integer Network Flow Problem BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 5, No 2 Sofa 2005 On the Multcrtera Integer Network Flow Problem Vassl Vasslev, Marana Nkolova, Maryana Vassleva Insttute of

More information

3.1 ML and Empirical Distribution

3.1 ML and Empirical Distribution 67577 Intro. to Machne Learnng Fall semester, 2008/9 Lecture 3: Maxmum Lkelhood/ Maxmum Entropy Dualty Lecturer: Amnon Shashua Scrbe: Amnon Shashua 1 In the prevous lecture we defned the prncple of Maxmum

More information

NP-Completeness : Proofs

NP-Completeness : Proofs NP-Completeness : Proofs Proof Methods A method to show a decson problem Π NP-complete s as follows. (1) Show Π NP. (2) Choose an NP-complete problem Π. (3) Show Π Π. A method to show an optmzaton problem

More information

Spectral Clustering. Shannon Quinn

Spectral Clustering. Shannon Quinn Spectral Clusterng Shannon Qunn (wth thanks to Wllam Cohen of Carnege Mellon Unverst, and J. Leskovec, A. Raaraman, and J. Ullman of Stanford Unverst) Graph Parttonng Undrected graph B- parttonng task:

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.65/15.070J Fall 013 Lecture 1 10/1/013 Martngale Concentraton Inequaltes and Applcatons Content. 1. Exponental concentraton for martngales wth bounded ncrements.

More information

Lecture 10 Support Vector Machines. Oct

Lecture 10 Support Vector Machines. Oct Lecture 10 Support Vector Machnes Oct - 20-2008 Lnear Separators Whch of the lnear separators s optmal? Concept of Margn Recall that n Perceptron, we learned that the convergence rate of the Perceptron

More information

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI Logstc Regresson CAP 561: achne Learnng Instructor: Guo-Jun QI Bayes Classfer: A Generatve model odel the posteror dstrbuton P(Y X) Estmate class-condtonal dstrbuton P(X Y) for each Y Estmate pror dstrbuton

More information

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS Avalable onlne at http://sck.org J. Math. Comput. Sc. 3 (3), No., 6-3 ISSN: 97-537 COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

More information

10-701/ Machine Learning, Fall 2005 Homework 3

10-701/ Machine Learning, Fall 2005 Homework 3 10-701/15-781 Machne Learnng, Fall 2005 Homework 3 Out: 10/20/05 Due: begnnng of the class 11/01/05 Instructons Contact questons-10701@autonlaborg for queston Problem 1 Regresson and Cross-valdaton [40

More information

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

Hidden Markov Models & The Multivariate Gaussian (10/26/04) CS281A/Stat241A: Statstcal Learnng Theory Hdden Markov Models & The Multvarate Gaussan (10/26/04) Lecturer: Mchael I. Jordan Scrbes: Jonathan W. Hu 1 Hdden Markov Models As a bref revew, hdden Markov models

More information

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA

4 Analysis of Variance (ANOVA) 5 ANOVA. 5.1 Introduction. 5.2 Fixed Effects ANOVA 4 Analyss of Varance (ANOVA) 5 ANOVA 51 Introducton ANOVA ANOVA s a way to estmate and test the means of multple populatons We wll start wth one-way ANOVA If the populatons ncluded n the study are selected

More information

Design and Optimization of Fuzzy Controller for Inverse Pendulum System Using Genetic Algorithm

Design and Optimization of Fuzzy Controller for Inverse Pendulum System Using Genetic Algorithm Desgn and Optmzaton of Fuzzy Controller for Inverse Pendulum System Usng Genetc Algorthm H. Mehraban A. Ashoor Unversty of Tehran Unversty of Tehran h.mehraban@ece.ut.ac.r a.ashoor@ece.ut.ac.r Abstract:

More information

Temperature. Chapter Heat Engine

Temperature. Chapter Heat Engine Chapter 3 Temperature In prevous chapters of these notes we ntroduced the Prncple of Maxmum ntropy as a technque for estmatng probablty dstrbutons consstent wth constrants. In Chapter 9 we dscussed the

More information

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X

3.1 Expectation of Functions of Several Random Variables. )' be a k-dimensional discrete or continuous random vector, with joint PMF p (, E X E X1 E X Statstcs 1: Probablty Theory II 37 3 EPECTATION OF SEVERAL RANDOM VARIABLES As n Probablty Theory I, the nterest n most stuatons les not on the actual dstrbuton of a random vector, but rather on a number

More information

Support Vector Machines

Support Vector Machines CS 2750: Machne Learnng Support Vector Machnes Prof. Adrana Kovashka Unversty of Pttsburgh February 17, 2016 Announcement Homework 2 deadlne s now 2/29 We ll have covered everythng you need today or at

More information

Global Sensitivity. Tuesday 20 th February, 2018

Global Sensitivity. Tuesday 20 th February, 2018 Global Senstvty Tuesday 2 th February, 28 ) Local Senstvty Most senstvty analyses [] are based on local estmates of senstvty, typcally by expandng the response n a Taylor seres about some specfc values

More information

Chapter 11: Simple Linear Regression and Correlation

Chapter 11: Simple Linear Regression and Correlation Chapter 11: Smple Lnear Regresson and Correlaton 11-1 Emprcal Models 11-2 Smple Lnear Regresson 11-3 Propertes of the Least Squares Estmators 11-4 Hypothess Test n Smple Lnear Regresson 11-4.1 Use of t-tests

More information

Chapter Newton s Method

Chapter Newton s Method Chapter 9. Newton s Method After readng ths chapter, you should be able to:. Understand how Newton s method s dfferent from the Golden Secton Search method. Understand how Newton s method works 3. Solve

More information

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression

MACHINE APPLIED MACHINE LEARNING LEARNING. Gaussian Mixture Regression 11 MACHINE APPLIED MACHINE LEARNING LEARNING MACHINE LEARNING Gaussan Mture Regresson 22 MACHINE APPLIED MACHINE LEARNING LEARNING Bref summary of last week s lecture 33 MACHINE APPLIED MACHINE LEARNING

More information

Solutions to exam in SF1811 Optimization, Jan 14, 2015

Solutions to exam in SF1811 Optimization, Jan 14, 2015 Solutons to exam n SF8 Optmzaton, Jan 4, 25 3 3 O------O -4 \ / \ / The network: \/ where all lnks go from left to rght. /\ / \ / \ 6 O------O -5 2 4.(a) Let x = ( x 3, x 4, x 23, x 24 ) T, where the varable

More information

A new construction of 3-separable matrices via an improved decoding of Macula s construction

A new construction of 3-separable matrices via an improved decoding of Macula s construction Dscrete Optmzaton 5 008 700 704 Contents lsts avalable at ScenceDrect Dscrete Optmzaton journal homepage: wwwelsevercom/locate/dsopt A new constructon of 3-separable matrces va an mproved decodng of Macula

More information

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law:

Introduction to Vapor/Liquid Equilibrium, part 2. Raoult s Law: CE304, Sprng 2004 Lecture 4 Introducton to Vapor/Lqud Equlbrum, part 2 Raoult s Law: The smplest model that allows us do VLE calculatons s obtaned when we assume that the vapor phase s an deal gas, and

More information

Support Vector Machines CS434

Support Vector Machines CS434 Support Vector Machnes CS434 Lnear Separators Many lnear separators exst that perfectly classfy all tranng examples Whch of the lnear separators s the best? Intuton of Margn Consder ponts A, B, and C We

More information

The Study of Teaching-learning-based Optimization Algorithm

The Study of Teaching-learning-based Optimization Algorithm Advanced Scence and Technology Letters Vol. (AST 06), pp.05- http://dx.do.org/0.57/astl.06. The Study of Teachng-learnng-based Optmzaton Algorthm u Sun, Yan fu, Lele Kong, Haolang Q,, Helongang Insttute

More information

Appendix B: Resampling Algorithms

Appendix B: Resampling Algorithms 407 Appendx B: Resamplng Algorthms A common problem of all partcle flters s the degeneracy of weghts, whch conssts of the unbounded ncrease of the varance of the mportance weghts ω [ ] of the partcles

More information

The Minimum Universal Cost Flow in an Infeasible Flow Network

The Minimum Universal Cost Flow in an Infeasible Flow Network Journal of Scences, Islamc Republc of Iran 17(2): 175-180 (2006) Unversty of Tehran, ISSN 1016-1104 http://jscencesutacr The Mnmum Unversal Cost Flow n an Infeasble Flow Network H Saleh Fathabad * M Bagheran

More information

4DVAR, according to the name, is a four-dimensional variational method.

4DVAR, according to the name, is a four-dimensional variational method. 4D-Varatonal Data Assmlaton (4D-Var) 4DVAR, accordng to the name, s a four-dmensonal varatonal method. 4D-Var s actually a drect generalzaton of 3D-Var to handle observatons that are dstrbuted n tme. The

More information

Lecture 20: November 7

Lecture 20: November 7 0-725/36-725: Convex Optmzaton Fall 205 Lecturer: Ryan Tbshran Lecture 20: November 7 Scrbes: Varsha Chnnaobreddy, Joon Sk Km, Lngyao Zhang Note: LaTeX template courtesy of UC Berkeley EECS dept. Dsclamer:

More information

A Robust Method for Calculating the Correlation Coefficient

A Robust Method for Calculating the Correlation Coefficient A Robust Method for Calculatng the Correlaton Coeffcent E.B. Nven and C. V. Deutsch Relatonshps between prmary and secondary data are frequently quantfed usng the correlaton coeffcent; however, the tradtonal

More information

x = , so that calculated

x = , so that calculated Stat 4, secton Sngle Factor ANOVA notes by Tm Plachowsk n chapter 8 we conducted hypothess tests n whch we compared a sngle sample s mean or proporton to some hypotheszed value Chapter 9 expanded ths to

More information