Evaluation for sets of classes

Evaluaton for Tet Categorzaton Classfcaton accuracy: usual n ML, the proporton of correct decsons, Not approprate f the populaton rate of the class s low Precson, Recall and F 1 Better measures 21 Evaluaton for sets of classes How can we combne evaluaton w.r.t. sngle classes nto an evaluaton for predcton over multple classes? Two aggregate measures Macro-Averagng, computes a smple average over the classes of the precson, recall, and F 1 measures Mcro-Averagng, pools per-doc decsons across classes and then compute precson, recall, and F 1 on the pooled contngency table 22

Macro and Mcro Averagng Macro-averagng gves the same weght to each class Mcro-averagng gves the same weght to each per-doc decson 23 Eample Class 1 Class 2 POOLED Pred: yes Pred: no Truth: yes 10 10 Truth: no 10 970 Pred: yes Pred: no Truth: yes 90 10 Truth: no 10 890 Pred: yes Pred: no Truth: yes 100 20 Truth: no 20 1860 Macro-Averaged Precson: (.5+.9/2 =.7 Mcro-averaged Precson: 100/120 =.833 24

Benchmark Collectons (used n Tet Categorzaton Reuters-21578 The most wdely used n tet categorzaton. It conssts of newswre artcles whch are labeled wth some number of topcal classfcatons (zero or more out of 115 classes. 9603 tran + 3299 test documents Reuters RCV1 Newstores, larger than the prevous (about 810K documents and a herarchcally structured set of (103 leaf classes Oshumed a ML set of 348K docs classfed under a herarchcally structured set of 14K classes (MESH thesaurus. Ttle+abstracts of scentfc medcal papers. 20 Newsgroups 18491 artcles from the 20 Usenet newsgroups 25 The nductve constructon of classfers 26

Two dfferent phases to buld a classfer h for category c C 1. Defnton of a functon CSV : D R, a categorzaton status value, representng the strength of the evdence that a gven document d belongs to c 2. Defnton of a threshold τ such that CSV (d τ nterpreted as a decson to classfy d under c CSV (d τ nterpreted as a decson not to classfy d under c 27 CSV and Proportonal thresholdng Two dfferent ways to determne the thresholds τ once gven CSV are [Yang01] 1. CSV thresholdng: τ s a value returned by the CSV functon. May or may not be equal for all the categores. Obtaned on a valdaton set 2. Proportonal thresholdng: τ are the values such that the valdaton set frequences for each class s as close as possble to the same frequences n the tranng set CSV thresholdng s theoretcally better motvated, and generally produce superor effectveness, but computatonally more epansve Thresholdng s needed only for hard classfcaton. In soft classfcaton the decson s taken by the epert, and the CSV scores can be used for rankng purposes 28

Probablstc Classfers Probablstc classfers vew CSV (d n terms of P(c d, and compute t by means of the Bayes theorem P(c d = P(d c P(c /P(d Mamum a posteror Hypothesys (MAP argma P(c d Classes are vewed as generators of documents The pror probablty P(c s the probablty that a document d s n c 29 Nave Bayes Classfers Task: Classfy a new nstance D based on a tuple of attrbute values D = nto one of the 1, 2, K, n classes c C c MAP = argma P( c, 2, K, c C 1 n = argma c C P( 1, 2 P(, K, 1, 2 n c, K, P( c n = argma P(, 2, K, c C c P( c 1 n 30

Naïve Bayes Classfer: Assumpton P(c Can be estmated from the frequency of classes n the tranng eamples. P( 1, 2,, n c O( n C parameters Could only be estmated f a very, very large number of tranng eamples was avalable. Naïve Bayes Condtonal Independence Assumpton: Assume that the probablty of observng the conuncton of attrbutes s equal to the product of the ndvdual probabltes P( c. 31 The Naïve Bayes Classfer Flu 1 2 3 4 5 runnynose snus cough fever muscle-ache Condtonal Independence Assumpton: features are ndependent of each other gven the class: P ( 1, K, 5 = P( 1 P( 2 L P( 5 Only n C parameters (+ C to estmate 32

Learnng the Model C 1 2 3 4 5 6 mamum lkelhood estmates: most lkely value of each parameter gven the tranng data.e. smply use the frequences n the data Pˆ( c = N( C = c N Pˆ( c = N( =, C = c N( C = c 33 Problem wth Ma Lkelhood Flu ( 1, K, 5 = P( 1 P( 2 L P( 5 What f we have seen no tranng cases where patent had no flu and muscle aches? P P ( Zero probabltes cannot be condtoned away, no matter the other evdence! l = arg ma c P ˆ( c Pˆ( c 1 2 3 4 5 runnynose snus cough fever N ( = t, C = nf N ( C = nf ˆ 5 5 = t C = nf = = muscle-ache 0 34

Smoothng to Avod Overfttng Pˆ( c = N( =, C = c + 1 N( C = c + k # of values of 35 Stochastc Language Models Models probablty of generatng strngs (each word n turn n the language. Model M 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 sad 0.02 lkes the man lkes the woman 0.2 0.01 0.02 0.2 0.01 multply P(s M = 0.00000008 36

Stochastc Language Models Model probablty of generatng any strng Model M1 Model M2 0.2 the 0.01 class 0.2 the 0.0001 class the class pleaseth yon maden 0.0001 sayst 0.0001 pleaseth 0.03 sayst 0.02 pleaseth 0.2 0.2 0.01 0.0001 0.0001 0.0001 0.02 0.1 0.0005 0.01 0.0001 yon 0.1 yon 0.0005 maden 0.01 woman 0.01 maden 0.0001 woman P(s M2 > P(s M1 37 Two Models Model 1: Multvarate bnomal One feature w for each word n dctonary w = true n document d f w appears n d Nave Bayes assumpton: Gven the document s topc, appearance of one word n the document tells us nothng about chances that another word appears Ths s the model you get from bnary ndependence model n probablstc relevance feedback n handclassfed data 38

Two Models Model 2: Multnomal One feature for each word pos n document feature s values are all words n dctonary Value of s the word n poston Naïve Bayes assumpton: Gven the document s topc, word n one poston n the document tells us nothng about words n other postons Second assumpton: Word appearance does not depend on poston P( = w c = P( w c = 39 Parameter estmaton Bnomal model: Pˆ( = t w c = fracton of documents of topc c n whch word w appears Multnomal model: Pˆ( = w c = fracton of tmes n whch word w appears across all documents of topc c 40

Naïve Bayes: Learnng From tranng corpus, etract Vocabulary Calculate requred P(c and P( k c terms For each c n C do docs subset of documents for whch the target class s c P( c total docs # documents Tet sngle document contanng all docs for each word k n Vocabulary n k number of occurrences of k n Tet nk + α P( c k α n + Vocabulary 41 Naïve Bayes: Classfyng postons all word postons n current document whch contan tokens found n Vocabulary Return c NB, where c C c = argma P( c P( c NB postons 42

Nave Bayes: Tme Complety Tranng Tme: O( D L d + C V where L d s the average length of a document n D. Assumes V and all D, n, and n pre-computed n O( D L d tme durng one pass through all of the data. Generally ust O( D L d snce usually C V < D L d Test Tme: O( C L t where L t s the average length of a test document. Very effcent overall, lnearly proportonal to the tme needed to ust read n all the data. 43 Underflow Preventon Multplyng lots of probabltes, whch are between 0 and 1 by defnton, can result n floatng-pont underflow. Snce log(y = log( + log(y, t s better to perform all computatons by summng logs of probabltes rather than multplyng probabltes. Class wth hghest fnal un-normalzed log probablty score s stll the most probable. c C c = argma log P( c + log P( c NB postons 44