Classification with Reject Option in Text Categorisation Systems

Size: px

Start display at page:

Download "Classification with Reject Option in Text Categorisation Systems"

Eustace Carson
5 years ago
Views:

1 Classfcaton wth Reject Opton n Text Categorsaton Systems Gorgo Fumera Ignazo Plla Fabo Rol Dept. of Electrcal and Electronc Eng., Pazza d Arm, 0923 Caglar, Italy {fumera,plla,rol}@dee.unca.t Abstract The am of ths paper s to evaluate the potental usefulness of the reject opton for text categorsaton (TC) tasks. The reject opton s a technque used n statstcal pattern recognton for mprovng classfcaton relablty. Our work s motvated by the fact that, although the reject opton proved to be useful n several pattern recognton problems, t has not yet been consdered for TC tasks. Snce TC tasks dffer from usual pattern recognton problems n the performance measures used and n the fact that documents can belong to more than one category, we developed a specfc rejecton technque for TC problems. The performance mprovement achevable by usng the reject opton was expermentally evaluated on the Reuters dataset, whch s a standard benchmark for TC systems.. Introducton Text categorsaton (TC) problems conssts n assgnng text documents, on the bass of ther content, to one or more predefned thematc categores. Ths s a typcal nformaton retreval (IR) task, and s currently an actve research feld, wth applcatons lke document ndexng and flterng, and herarchcal categorsaton of web pages. Approaches to TC have shfted snce the 90s from knowledge engneerng to the machne learnng approach, ganng n flexblty and savng expert labor power. Usng the machne learnng approach, a classfer s constructed on the bass of a tranng set of labeled documents, by usng nductve learnng methods. Several classfcaton technques have been appled to TC tasks, both symbolc, rule-based methods from the machne learnng feld, and non-symbolc methods (lke neural networks), from statstcal pattern recognton (see [8] for a comprehensve overvew). So far, works n the TC lterature focused on two man topcs: feature selcton/extracton, and comparson between dfferent classfcaton technques [6,9,0,,2]. In ths paper we focus on a technque used n statstcal pattern recognton for mprovng classfcaton relablty, namely, classfcaton wth reject opton. The reject opton conssts n wthholdng the classfcaton of an nput pattern, f t s lkely to be msclassfed. Rejected patterns are then handled n a dfferent way, for nstance, they are manually classfed. The reject opton s useful n pattern recognton tasks n whch the cost of an error s hgher than that of a rejecton. To the best of our knowledge, so far no work consdered the reject opton n TC tasks. evertheless, we beleve that the reject opton can be useful also n TC. Accordngly, the objectve of ths paper s to develop a method for ntroducng the reject opton n TC tasks, and to evaluate ts potental usefulness. To ths am, we consder the usual decomposton of -category problems nto ndependent bnary problems [8]: each bnary problem conssts n decdng f an nput document does or does not belong to one of the predefned categores. We ntroduce the reject opton by allowng the classfer to wthhold assgnng or not an nput document to the categores for whch the decson s consdered not suffcently relable. To ths end, we defne a reject rule based on applyng two threshold values to the scores provded by the classfer for each category. We then evaluate the performance mprovement achevable by usng the reject opton on the Reuters dataset, whch s a standard benchmark for TC systems. The paper s structured as follows. Sectons 2 and 3 provde the theoretcal background about TC and about the reject opton n statstcal pattern recognton. In sectons 4 and 5 we descrbe our method for mplementng the reject opton n TC tasks. The expermental results are reported n secton 6. Conclusons are drawn n secton Text categorsaton Formally, TC can be defned as the task of assgnng a Boolean value T or F to each par (d j,c ) D C, ndcatng whether document d j s to be labeled as belongng to category c, where D s a set of documents, and C s a set of predefned categores. The goal s to fnd a labelng functon (classfer) φ:d C {T,F} whch approxmates at best an unknown target funton Φ, accordng to a gven performance measure [8].

2 The most used performance measures, precson (π) and recall (ρ), are derved from IR. Precson π for the - th category s defned as the probablty that, f a random document d s classfed as belongng to c, ths decson s correct: π =P(Φ(d,c )=T φ(d,c )=T). Recall ρ for the -th class s defned as the probablty that, f d s to be classfed as belongng to c, ths decson s taken: ρ =P(φ(d,c )=T Φ(d,c )=T). For a gven classfer φ, these probabltes are usually estmated from a test set of prelabeled documents, as a functon of the number of true postves (TP ), false postves (FP ), true negatves (T ) and false negatves (F ) documents, defned as shown n Table. The estmates are computed as: TP π =, ρ = TP + FP Table. Contngency table for class Φ(d,c ) T F T TP FP φ(d,c ) F F T TP TP + F. () Global precson and recall are computed ether as mcro- or macro-averaged values (dependng on applcaton requrements). Mcro-averagng conssts n applyng defnton () to the global values of TP, FP, T and F, obtaned by summng over all classes: TP π µ = TP, ρ ( TP + FP ) µ =, (2) ( TP + F ) whle n macro-averagng, the values of π and ρ gven n Eq. () are averaged over the classes: π π M =, ρ ρ M =. (3) Precson and recall (both macro- and mcro-averaged) take on values n the range [0,], and are not ndependent measures: a hgher precson correspond to a lower recall, and vce-versa. The optmal classfer would be the one for whch π=, ρ=. In real tasks, for varyng values of the classfer parameters, precson and recall descrbe a curve n the π-ρ plane lke the one of Fg.. The workng pont on ths curve s applcaton-dependent. π 0 ρ Fg.. A typcal behavour of the π-ρ curve Several effectveness measures whch combne π and ρ have also been defned. A wdely used one s the F measure, defned as the harmonc mean of π and ρ: πρ F = 2. (4) ( π + ρ) In TC systems documents are usually represented wth a weght vector, each weght correspondng to a term (ether a word or a phrase) of the so-called vocabulary, whch s the set of all terms occurrng n the set D (note that n real tasks D s the avalable tranng set). Typcally, only terms obtaned after stop-word removal and stemmng are consdered. Weghts are manly computed as term frequences (tf), n whch they represent the frequency of occurrence of the correspondng terms n a document, and as nverted term frequences (tfdf), defned as: tfdf(t k,d j )=#(t k,d j ) log( D /# D (t k )), where #(t k,d j ) denotes the number of occurrences of term t k n document d j, and # D (t k ) denotes the number of documents n D n whch t k occurs. 3. Classfcaton wth reject opton n statstcal pattern recognton In statstcal pattern recognton, classfcaton wth reject opton has been formalsed under the mnmum rsk theory n [,2,4]. When the costs of correct classfcatons, rejectons, and msclassfcatons are respectvely c C, c R and c E (obvously, c C <c R <c E ), then the expected rsk (.e. the expected value of the cost of classfyng any pattern) s mnmsed by Chow s rule, whch s defned as follows. Consder the class for whch the a posteror probablty of x s maxmum: ω =argmax j P(ω j x); f P(ω x) T, assgn x to ω, otherwse reject x. T s the so-called reject threshold, whose optmal value depends on classfcaton costs: T=(c E -c R )/(c E -c C ) [2]. ote that when the cost of rejectons equals that of msclassfcatons, one obtans T=0: therefore, no pattern s rejected, and Chow s rule reduces to the well-known Bayes rule. We pont out that the above approach to the reject opton can not be mmedately appled to TC tasks, for two man reasons. Frst, Chow s rule apples to classfcaton problems n whch each pattern belongs to only one class, whle ths s not the case of TC problems, as explaned n prevous sectons. Moreover, Chow s rule s amed at mnmsng the expected rsk of a classfer, whle the performance measures n TC tasks are dfferent than the expected rsk. In the next secton we dscuss the way n whch the reject opton can be ntroduced n TC tasks.

3 4. A method for ntroducng the reject opton n text categorsaton As ponted out n [8], -category TC problems are usually consdered as ndependent bnary problems: for each category c, the classfer has to decde whether an nput document d does or does not belong to c, ndependently on the other categores. In ths framework, t s reasonable to ntroduce the reject opton by allowng a classfer to wthhold makng a decson, for categores where the decson s consdered unrelable. A document can then be rejected from any subset of the categores, whle t can be accepted and automatcally categorsed as belongng or not to the other categores. Therefore, each document has to be exceptonally handled (typcally, by manual classfcaton) only for the categores from whch t has been rejected, f any. A text classfer wth reject opton can then be vewed as mplementng functons φ :D C {T,F,R}, one for each category, where R stands for the reject decson. The goal of the reject opton should be that of ncreasng the effectveness measure (Eq. ()-(4)) by turnng as many ncorrect decsons (.e. FP and F ) as possble nto rejectons. Obvously, ths can be acheved at the expense of exceptonally handlng the rejected documents: a trade-off between the achevable performance mprovement and the cost of rejectons must then be found. However, the cost of handlng the rejected documents s clearly applcaton-dependent. In ths work, we evaluate the performance mprovement that can be acheved by the reject opton, wth respect to the rate of rejected decsons. By denotng as r the number of documents rejected from the -th category, the rate r of rejected decsons s defned as: r r D, (5) = ( / ) = where D s the total number of documents. From now on, we wll call r the reject rate. Accordngly, we wll consder the followng goal for the reject opton: maxmse the performance measure for any gven value of the reject rate (5). Let us now dscuss the way n whch the decson of rejectng an nput document from a category can be mplemented. Most classfcaton technques used so far n TC (lke the naïve Bayes and neural networks) are based on computng for any document d a score s [0,] for each category c, representng the evdence of the fact d c. Scores can be obtaned, for nstance, from the output of neural network classfers, or as probabltes computed usng a naïve Bayes classfer. The decson of assgnng or not assgnng d to c s usually made by ntroducng a threshold τ for each category, so that d s labeled as belongng to category c, only f s (d) τ, =,..., [8,2]. In other words, f s (d) τ, then φ(d,c )=T, otherwse φ(d,c )=F. ote that the threshold values are usually computed from a set of valdaton documents. To ntroduce the reject opton, t s reasonable to assume that the relablty of the above decson s hgher for values of s far from τ, than for values near to τ. In other words, the lower s -τ, the more lkely s an erroneous decson for category. Accordngly, we mplement the reject opton by ntroducng two threshold values for each category, denoted as τ H and τ L, wth τ H τ L, such that f s s between these values, the document s rejected from category. Formally, the classfer φ s the defned as follows: f s (d) τ H, φ(d,c )=T, f τ L <s (d)<τ H, φ(d,c )=R, (6) f s (d) τ L, φ(d,c )=F. ote that, f τ H =τ L, one obtans the standard classfer wthout reject opton. We pont out that our approach for ntroducng the reject opton by usng two threshold values for each category, s analogous to that proposed n [3] for two-class pattern recognton problems. In order to expermentally evaluate the performance mprovement achevable by the reject opton, wth respect to the number of rejected decsons, we developed algorthms for maxmsng the performance measure F (4), for any gven value of the reject rate (). These algorthms are presented n secton Algorthms for determnnng threshold values When the reject opton s mplemented n a TC problem as descrbed n secton 4, the optmal values of the 2 threshold {τ H,τ L,...,τ H,τ L } are the ones that maxmse the performance measure for any gven value of the reject rate (). In practce, t s convenent to mplement an algorthm that maxmses the performance measure whle keepng the reject rate r below the gven value. We wll denote such value as r MAX. In ths work, we focused on the F performance measure, both macro- and mcro-averaged. The macro-averaged F s defned as ( / ) ( / ) 2 / = ( = ), (7) F M M M M M M = F π ρ π ρ = + whle the mcro-averaged measure s defned as: m m m m m F = 2 π ρ /( π + ρ ) = 2/ ( 2 + ( FP+ F) / TP), (8) where FP = FP, F = F, TP = TP = = =. To maxmse F M, we chose to ndependently maxmse each F M, enforcng the constrant r r MAX by requrng a maxmum reject rate r for each category equal to r/: r r/. We also exploted the fact that F M s a nonncreasng functon of any τ L (ths property follows from Eqs. (7),(6),(3),()). Snce for each class only two thresholds τ H and τ L must be computed, a smple

4 exhaustve search of ther optmal values can be performed, usng a predefned dscretsaton step τ. To maxmse F m, we exploted the followng propertes (we omt the proof for the sake of brevty): m F s a non-ncreasng functon of any τ L (from (8),(6),(2)); consder the values of the thresholds of category, (τ * L,τ * H ), whch maxmse F m, for any fxed value of the thresholds of the other classes; now, f the thresholds of the other classes are changed so as to ncrease F m, then values of τ L and τ H such that * m τ H <τ H can not further ncrease F (from (8),(6),(2)). On the bass of the above propertes, we developed an algorthm for maxmsng F m under the constrant r r MAX, based on an teratve search n the space of the 2 threshold values. The thresholds are ntally set to zero. The above propertes allow to reduce the search space by consderng at each step only hgher values of the thresholds wth respect to the current ones. To further reduce the search space, we consder at each step only ponts n the search space obtaned by ncrementng the thresholds of only one category at a tme, whle keepng the ones of the other categores at ther current values. Accordngly, at each step, the thresholds of only one category are changed, provded that F m can be ncreased wth r r MAX. If no threshold values satsfyng ths condton are found, the algorthm termnates. The algorthm can be schematsed as follows. ntalse τ H =0, τ H =0, =,..., repeat for each category c : evaluate F m and r for τ H + k τ, for each k n 0,, maxk, and for the maxmum τ L such that r r MAX /, and F m does not decrease wth respect to τ H + k τ and to the ntal value of τ L ; select the par of threshold values (τ L *,τ H * ) that gve the maxmum mprovement wth respect to the current F m, f any, and set them as new values for the correspondng category; untl a hgher F m wth r r MAX s found wth respect to the prevous step. The parameter τ s a predefned dscretsaton step, whle maxk determnes the number of dfferent values of τ H whch are explored at each step. 6. Expermental results The goal of the experments presented n ths secton was to evaluate the potental usefulness of the reject opton n real TC tasks. More precsely, our am was to assess the mprovement of the performance of a TC system achevable by the reject opton, mplemented as descrbed n secton 4, wth respect to the rate of rejected decsons. We conducted our experments on the Reuters dataset, whch s a standard benchmark for TC systems [8,4,,3]. Ths dataset conssts of a set of 2,578 newswre stores (provded n standard text fles) classfed under categores related to economcs. In partcular, we used the ApteMod verson of ths dataset, whch was obtaned by elmnatng unlabelled documents and selectng the categores whch have at least one document n the tranng set and one n the test set. Ths verson conssts of a tranng set of 7,769 documents and a test set of 3,09 documents, belongng to 90 categores. The vocabulary, extracted from the tranng set, conssts of 6,635 terms, after stemmng and stop-word removal. Our experments have been performed wth three knds of classfers, commonly used n the TC lterature [4,9,0,]: k-nearest neghbour (k-), mult-layer perceptron neural network (MLP) and support vector machne (SVM) classfers. Pre-processng of the dataset (stop-word removal, stemmng and feature extracton from the tranng set) was performed wth the software SMART [7]. In partcular, both tf and tfdf weghts have been consdered. Due to the hgh number of features, feature selecton was performed on the tranng set by usng the nformaton gan crteron [0,]. Feature selecton was performed separately for each classfer, and for macro- and mcro-averaged performance measures. Then, the 70% of the documents n the orgnal tranng set was randomly extracted (by keepng the same proporton between the number of documents n each class) to be used for determnng classfer parameters and for tranng the classfers, whle the remanng 30% of documents were used as valdaton set for estmatng threshold values. In partcular, for categores wth less than fve documents, we always kept at least one document both n the tranng and n the valdaton set; for categores wth only one document n the orgnal tranng set, the document was duplcated n the valdaton set. For statstcal sgnfcance, the experments were repeated for ten dfferent randomly generated tranng sets. Reported results refer to the test set performance, averaged over these ten runs. For MLPs, we used one hdden layer wth 64 unts, and 90 output unts, as n []. The tf features were used. The number of nput unts was equal to that of features, whch was set to,000. MLPs were trand wth the standard backpropagaton algorthm, wth a learnng rate equal to 0.0. For k-s, the dstance-weghted verson of Yang [0] was used, wth 2,500 tfdf features. The value of k was 45 for macro- and 65 for mcro-averaged measures. On the bass of experments presented n [4,], we used SVMs wth lnear kernel, traned wth the SVM lght software [5]. 0,000 tfdf features were used. One SVM

5 for each category was traned, and the outputs were scaled to the range [0,] usng a sgmodal functon. Thresholds estmaton was performed on the valdaton set, wth the algorthms descrbed n secton 5. We used a threshold ncrement unt τ equal to 0.00, and a value of maxk for F m equal to 00. Values of the maxmum allowed reject rate r MAX between 0 and 5% were consdered. In Fgs. 2-7 we report the test set macro- and mcroaveraged values of F versus the reject rate, averaged over the ten runs, for the three classfers used. F Macro 0,48 0,47 0,46 0,45 0,44 0,43 0,42 0,4 0,40 0,39 0,38 0,0%,0% 2,0% 3,0% 4,0% Fg. 2. Test set F M vs reject rate (MLP) F mcro 0,94 0,92 0,90 0,88 0,86 0,84 0,82 0,0%,0% 2,0% 3,0% 4,0% 5,0% Fg. 3. Test set F m vs reject rate (MLP) F Macro 0,66 0,64 0,62 0,60 0,58 0,56 0,54 0,0% 0,5%,0%,5% 2,0% 2,5% Fg. 4. Test set F M vs reject rate (k-) F mcro 0,96 0,94 0,92 0,90 0,88 0,86 0,84 0,82 0,0% 0,5%,0%,5% 2,0% 2,5% 3,0% Fg. 5. Test set F m vs reject rate (k-) F Macro 0,58 0,57 0,56 0,55 0,54 0,53 0,52 0,5 0,50 0,0%,0% 2,0% 3,0% Fg. 6. Test set F M vs reject rate (SVM) F mcro 0,96 0,95 0,94 0,93 0,92 0,9 0,90 0,89 0,88 0,87 0,0% 0,5%,0%,5% 2,0% 2,5% 3,0% Fg. 7. Test set F m vs reject rate (SVM) ote that macro-averagng gves much worse performances than mcro-averagng. Ths s due to the fact that Reuters categores have a very dfferent generalty, and macro-averagng s domnated by the performance of the system on rare categores, as noted n [2] and [8]. Let us analyse the performance mprovement acheved by usng the reject opton. We frst pont out that the reject rate acheved on the test set s always much lower than the maxmum requred reject rate, whch was 5%. In partcular, the maxmum value of the reject rate (.e., the rate of rejected decsons) on the test set s about 4% for MLPs (Fgs. 2,3), 3% for k- (Fgs. 4,5), and 3.5% for SVMs (Fgs. 6,7). ote now that F always ncreases for ncreasng values of the reject rate (except for the fnal part of the curve n Fg. 2). Moreover, a sgnfcant ncrement of F was always obtaned by usng the reject opton, especally for macro-averaged F. In partcular, the maxmum ncrement of macro-averaged F was about 20% wth MLPs, 6% wth k-, and 4% wth SVM classfers; the ncrement of mcro-averaged F was 3% wth MLPs, 2% wth k-, and 8% wth SVM classfers. 7. Conclusons In ths work we proposed a method for ntroducng the reject opton n text categorsaton problems, and expermentally evaluated the performance mprovement achevable by usng the reject opton on a real TC task. The expermental results showed that the reject opton can allow to sgnfcantly mprove the performance of a text

6 categorsaton system, at the expense of a reasonably small rate of rejected decsons for each category. As ponted out n secton 4, the cost of handlng rejected documents strongly depends on the partcular applcaton. The approach proposed n ths work was based on the usual decomposton of -category TC problems nto bnary ndependent problems. On the bass of our expermental results, ths approach proved to be useful for applcatons n whch the cost of rejectons depends manly on the rate of rejected decsons. Our future work wll be devoted to analyse the case n whch the cost of rejectons s sgnfcantly affected also by the number of rejected documents. In ths case our approach should be modfed, n order to guarantee a small number of rejected documents. [3] F. Tortorella, An optmal reject rule for bnary classfers, Proc. Int. Workshop on Statstcal Pattern Recognton, Alcante, Span, 2000, pp [4] G. Fumera, F. Rol and G. Gacnto, Reject opton wth multple thresholds, Pattern Recognton 33, 2000, pp References [] C.K. Chow, An Optmum character recognton system usng decson functons, IRE Trans. Electronc Computers 6, 957, pp [2] C.K. Chow, On Optmum Error and Reject Tradeoff, IEEE Trans. on Inf. Theory 6, 970, pp [3] V. Dasg et al., Informaton fuson for text classfcaton an expermental comparson, Pattern Recognton 34, 200, pp [4] T. Joachms, Text categorzaton wth support vector machnes: learnng wth many relevant feature, Proc. 0th European Conf. on Machne Learnng, Chemntz, Germany, 998, pp [5] T. Joachms, Makng large-scale SVM learnng practcal, n Advances n Kernel Methods Support Vector Learnng, MIT Press, 999, pp [6] D.D. Lews, An evaluaton of phrasal and clustered representatons on a text categorzaton task, Proc. 5th ACM Int. Conf. on Research and Development n Inf. Retreval, Kobenhavn, DK, 992, pp [7] G. Salton, Automatc Text Processng: The Transformaton, analyss, and Retreval of Informaton by Computer, Addson- Wesley, Pennsylvana, 989. [8] F. Sebastan, Machne Learnng n Automated Text Categorzaton, ACM Computng Surveys 34, 2002, pp. 47. [9] E. Wener, J.O. Pedersen, and A.S. Wegend, A neural network approache to topc spottng, Proc. 4th Annual Symp. on Doc. Analyss and Inf. Retreval, Las Vegas, USA, 995, pp [0] Y. Yang, and J.O. Pedersen, A comparatve study on feature selecton n text categorzaton, Proc. of 4th Int. Conf. on Machne Learnng, ashvlle, USA, 997, pp [] Y. Yang, and X. Lu, A re-examnaton of text categorzaton methods, Proc. of 22nd ACM Int. Conf. on Research and Development n Inf. Retreval, Berkeley, US, 999. [2] Y. Yang, A Study on Thresholdng Strateges for Text Categorzaton, Proc. of SIGIR-0, Lousana, US, 200.

Evaluation for sets of classes

Evaluation for sets of classes Evaluaton for Tet Categorzaton Classfcaton accuracy: usual n ML, the proporton of correct decsons, Not approprate f the populaton rate of the class s low Precson, Recall and F 1 Better measures 21 Evaluaton