Test Data: Classes: Training Data:

Size: px

Start display at page:

Download "Test Data: Classes: Training Data:"

Alvin Harris
5 years ago
Views:

1 CS276A Text Retreval and Mnng Reap of the last leture Probablst models n Informaton Retreval Probablty Rankng Prnple Bnary Independene Model Bayesan Networks for IR [very superfally] Leture 11 These models were based around random varables that were bnary [1/0] denotng the presene or absene of a word v n a doument Today we move to probablst language models: modelng the probablty that a word token n a doument s v... frst for text ategorzaton Probablst models: Naïve Bayes Text Classfaton Is ths spam? Today: Introduton to Text Classfaton Probablst Language Models Naïve Bayes text ategorzaton From: "" <takworlld@hotmal.om> Subet: real estate s the only way... gem oalvgkay Anyone an buy real estate wth no money down Stop payng rent TODAY! There s no need to spend hundreds or even thousands for smlar ourses I am 22 years old and I have already purhased 6 propertes usng the methods outlned n ths truly INCREDIBLE ebook. Change your lfe NOW! ================================================= Clk Below to order: ================================================= Categorzaton/Classfaton Doument Classfaton Gven: A desrpton of an nstane, x, where s the nstane language or nstane spae. Issue: how to represent text douments. A fxed set of ategores: C = { 1, 2,, n } Determne: The ategory of x: (x C, where (x s a ategorzaton funton whose doman s and whose range s C. We want to know how to buld ategorzaton funtons ( lassfers. Test Data: Classes: Tranng Data: ML (AI learnng ntellgene algorthm renforement network... Plannng plannng temporal reasonng plan language... (Programmng Semants programmng semants language proof... Garb.Coll. garbage olleton memory optmzaton regon... plannng language proof ntellgene (HCI Multmeda GUI (Note: n real lfe there s often a herarhy, not present n the above problem statement; and you get papers on ML approahes to Garb. Coll. 1

2 Text Categorzaton Examples Assgn labels to eah doument or web-page: Labels are most often tops suh as Yahoo-ategores e.g., "fnane," "sports," "news>world>asa>busness" Labels may be genres e.g., "edtorals" "move-revews" "news Labels may be opnon e.g., lke, hate, neutral Labels may be doman-spef bnary e.g., "nterestng-to-me" : "not-nterestng-to-me e.g., spam : not-spam e.g., ontans adult language : doesn t Classfaton Methods (1 Manual lassfaton Used by Yahoo!, Looksmart, about.om, ODP, Medlne Very aurate when ob s done by experts Consstent when the problem sze and team s small Dffult and expensve to sale Classfaton Methods (2 Automat doument lassfaton Hand-oded rule-based systems One tehnque used by CS dept s spam flter, Reuters, CIA, Verty, E.g., assgn ategory f doument ontans a gven boolean ombnaton of words Commeral systems have omplex query languages (everythng n IR query languages + aumulators Auray s often very hgh f a rule has been arefully refned over tme by a subet expert Buldng and mantanng these rules s expensve Classfaton Methods (3 Supervsed learnng of a doument-label assgnment funton Many systems partly rely on mahne learnng (Autonomy, MSN, Verty, Enkata, Yahoo!, k-nearest Neghbors (smple, powerful Nave Bayes (smple, ommon method Support-vetor mahnes (new, more powerful plus many other methods No free lunh: requres hand-lassfed tranng data But data an be bult up (and refned by amateurs Note that many ommeral systems use a mxture of methods Bayesan Methods Our fous ths leture Learnng and lassfaton methods based on probablty theory. Bayes theorem plays a rtal role n probablst learnng and lassfaton. Buld a generatve model that approxmates how data s produed Uses pror probablty of eah ategory gven no nformaton about an tem. Categorzaton produes a posteror probablty dstrbuton over the possble ategores gven a desrpton of an tem. Bayes Rule one more P ( C, = C = C C C C P ( C = 2

3 Maxmum a posteror Hypothess Maxmum lkelhood Hypothess h MAP argmax h D h H If all hypotheses are a pror equally lkely, we only need to onsder the D h term: D h h = argmax h H D = argmax D h h As D s h H onstant h ML argmax D h h H Nave Bayes Classfers Task: Classfy a new nstane D based on a tuple of attrbute values D = x, x2,, nto one of the lasses C MAP 1 K x n = argmax x1, x2, xn C x1, x2, xn = argmax C x, x, x = argmax x, x2, x C 1 2 n 1 n Naïve Bayes Classfer: Assumpton Can be estmated from the frequeny of lasses n the tranng examples. x 1,x 2,,x n O( n C parameters Could only be estmated f a very, very large number of tranng examples was avalable. Naïve Bayes Condtonal Independene Assumpton: Assume that the probablty of observng the onunton of attrbutes s equal to the produt of the ndvdual probabltes x. The Naïve Bayes Classfer Flu Learnng the Model C P runnynose snus ough fever musle-ahe Condtonal Independene Assumpton: features are ndependent of eah other gven the lass: 1, C = 1 C 2 C L C Ths model s approprate for bnary varables Just lke last leture ( Frst attempt: maxmum lkelhood estmates smply use the frequenes n the data C = = N = x, C = x = C = 3

4 Problem wth Max Lkelhood runnynose snus ough fever musle-ahe ( 1, C = 1 C 2 C L C What f we have seen no tranng ases where patent had no flu and musle ahes? ˆ = t, C = nf = t C = nf = = 0 C = nf Zero probabltes annot be ondtoned away, no matter the other evdene! P l = arg max Flu P ˆ( x Smoothng to Avod Overfttng = x, C = + 1 x = C = + k # of values of Somewhat more subtle verson x = = x overall fraton n data where =x,k, C = + mp ˆ, k, k, k C = + m extent of smoothng Stohast Language Models Models probablty of generatng strngs (eah word n turn n the language (ommonly all strngs over. E.g., ungram model Model M 0.2 the the man lkes the woman 0.1 a 0.01 man woman 0.03 sad multply 0.02 lkes s M = Stohast Language Models Model probablty of generatng any strng Model M1 0.2 the 0.01 lass sayst pleaseth yon maden 0.01 woman Model M2 0.2 the the lass pleaseth yon maden lass 0.03 sayst 0.02 pleaseth 0.1 yon 0.01 maden woman s M2 > s M1 Ungram and hgher-order models P ( = P ( P ( P ( P ( Ungram Language Models P ( P ( P ( P ( Bgram (generally, n-gram Language Models P ( P ( P ( P ( Other Language Models Grammar-based models (PCFGs, et. Probably not the frst thng to try n IR Easy. Effetve! Naïve Bayes va a lass ondtonal language model = multnomal NB Cat w 1 w 2 w 3 w 4 w w 6 Effetvely, the probablty of eah lass s done as a lass-spef ungram language model 4

5 Usng Nave Bayes Classfers to Classfy Text: Bas method Attrbutes are text postons, values are words. NB = argmax C x = argmax x = "our" L x C 1 = "text" Stll too many possbltes Assume that lassfaton s ndependent of the postons of the words Use same parameters for eah poston Result s bag of words model (over tokens not types n Naïve Bayes: Learnng From tranng orpus, extrat Voabulary Calulate requred and x k terms For eah n C do dos subset of douments for whh the target lass s dos total # douments Text sngle doument ontanng all dos for eah word x k n Voabulary n k number of ourrenes of x k n Text nk + α xk n + α Voabulary Naïve Bayes: Classfyng Nave Bayes: Tme Complexty postons all word postons n urrent doument whh ontan tokens found n Voabulary Return NB, where C = argmax x NB postons Tranng Tme: O( D L d + C V where L d s the average length of a doument n D. Assumes V and all D, n, and n pre-omputed n O( D L d tme durng one pass through all of the data. Why? Generally ust O( D L d sne usually C V < D L d Test Tme: O( C L t where L t s the average length of a test doument. Very effent overall, lnearly proportonal to the tme needed to ust read n all the data. Underflow Preventon Multplyng lots of probabltes, whh are between 0 and 1 by defnton, an result n floatng-pont underflow. Sne log(xy = log(x + log(y, t s better to perform all omputatons by summng logs of probabltes rather than multplyng probabltes. Class wth hghest fnal un-normalzed log probablty sore s stll the most probable. C = argmax log + log x NB postons Reap: Two Models Model 1: Multvarate bnomal One feature w for eah word n dtonary w = true n doument d f w appears n d Nave Bayes assumpton: Gven the doument s top, appearane of one word n the doument tells us nothng about hanes that another word appears Ths s the model you get from bnary ndependene model n probablst relevane feedbak n hand-lassfed data (Maron n IR was a very early user of NB

6 Two Models Model 2: Multnomal = Class ondtonal ungram One feature for eah word pos n doument feature s values are all words n dtonary Value of s the word n poston Naïve Bayes assumpton: Gven the doument s top, word n one poston n the doument tells us nothng about words n other postons Seond assumpton: Word appearane does not depend on poston = w = = w for all postons,, word w, and lass Just have one multnomal feature predtng all words Parameter estmaton Bnomal model: = t = w Multnomal model: = w = fraton of douments of top n whh word w appears fraton of tmes n whh word w appears aross all douments of top Can reate a mega-doument for top by onatenatng all douments n ths top Use frequeny of w n mega-doument Classfaton Multnomal vs Multvarate bnomal? Multnomal s n general better See results fgures later Feature seleton va Mutual Informaton We mght not want to use all words, but ust relable, good dsrmnatng terms In tranng set, hoose k words whh best dsrmnate the ategores. One way s usng terms wth maxmal Mutual Informaton wth the lasses: p( ew, e I( w, = p( ew, e log ew { 0,1} e 0,1} p( e p( e { w For eah word w and eah ategory Feature seleton va MI (ontd. For eah ategory we buld a lst of k most dsrmnatng terms. For example (on 20 Newsgroups: s.eletrons: rut, voltage, amp, ground, opy, battery, eletrons, oolng, re.autos: ar, ars, engne, ford, dealer, mustang, ol, ollson, autos, tres, toyota, Greedy: does not aount for orrelatons between terms In general feature seleton s neessary for bnomal NB, but not for multnomal NB Why? Ch-Square Feature Seleton Doument belongs to ategory Doument does not belong to ategory Term present A C Term absent 2 = AD-BC 2 / ( (A+B (A+C (B+D (C+D Value for omplete ndependene of term and ategory? B D 6

Feature Seleton Mutual Informaton Clear nformaton-theoret nterpretaton May selet rare unnformatve terms Ch-square Statstal foundaton May selet very slghtly nformatve frequent terms that are not very

7 Feature Seleton Mutual Informaton Clear nformaton-theoret nterpretaton May selet rare unnformatve terms Ch-square Statstal foundaton May selet very slghtly nformatve frequent terms that are not very useful for lassfaton Commonest terms: No partular foundaton In prate often s 90% as good Evaluatng Categorzaton Evaluaton must be done on test data that are ndependent of the tranng data (usually a dsont set of nstanes. Classfaton auray: /n where n s the total number of test nstanes and s the number of test nstanes orretly lassfed by the system. Results an vary based on samplng error due to dfferent tranng and test sets. Average results over multple tranng and test sets (splts of the overall data for the best results. Example: AutoYahoo! Classfy 13,89 Yahoo! webpages n Sene subtree nto 9 dfferent tops (herarhy depth 2 Example: WebKB (CMU Classfy webpages from CS departments nto: student, faulty, ourse,proet WebKB Experment NB Model Comparson Tran on ~,000 hand-labeled web pages Cornell, Washngton, U.Texas, Wsonsn Crawl and lassfy a new ste (CMU Results: Student Faulty Person Proet Course Departmt Extrated Corret Auray: 72% 42% 79% 73% 89% 100% 7

Sample Learnng Curve (Yahoo Sene Data Volaton of NB Assumptons Condtonal ndependene Postonal ndependene Naïve Bayes Posteror Probabltes Classfaton results of naïve Bayes (the lass wth maxmum posteror

8 Sample Learnng Curve (Yahoo Sene Data Volaton of NB Assumptons Condtonal ndependene Postonal ndependene Naïve Bayes Posteror Probabltes Classfaton results of naïve Bayes (the lass wth maxmum posteror probablty are usually farly aurate. However, due to the nadequay of the ondtonal ndependene assumpton, the atual posteror-probablty numeral estmates are not. Output probabltes are generally very lose to 0 or 1. When does Nave Bayes work? Nave Bayes s Not So Nave Sometmes NB performs well even f the Condtonal Independene assumptons are badly volated. Classfaton s about predtng the orret lass label and NOT about aurately estmatng probabltes. Assume two lasses 1 and 2. A new ase A arrves. NB wll lassfy A to 1 f: A, 1 >A, 2 A,1 A,2 Class of A Atual Probablty Estmated Probablty by NB Besdes the bg error n estmatng the probabltes the lassfaton s stll orret. Corret estmaton aurate predton but NOT aurate predton Corret estmaton Naïve Bayes: Frst and Seond plae n KDD-CUP 97 ompetton, among 16 (then state of the art algorthms Goal: Fnanal serves ndustry dret mal response predton model: Predt f the repent of mal wll atually respond to the advertsement 70,000 reords. Robust to Irrelevant Features Irrelevant Features anel eah other wthout affetng results Instead Deson Trees an heavly suffer from ths. Very good n Domans wth many equally mportant features Deson Trees suffer from fragmentaton n suh ases espeally f lttle data A good dependable baselne for text lassfaton (but not the best! Optmal f the Independene Assumptons hold: If assumed ndependene s orret, then t s the Bayes Optmal Classfer for problem Very Fast: Learnng wth one pass over the data; testng lnear n the number of attrbutes, and doument olleton sze Low Storage requrements 8

9 Resoures Fabrzo Sebastan. Mahne Learnng n Automated Text Categorzaton. ACM Computng Surveys, 34(1:1-47, Andrew MCallum and Kamal Ngam. A Comparson of Event Models for Nave Bayes Text Classfaton. In AAAI/ICML-98 Workshop on Learnng for Text Categorzaton, pp Tom Mthell, Mahne Learnng. MGraw-Hll, Ymng Yang & n Lu, A re-examnaton of text ategorzaton methods. Proeedngs of SIGIR,

Evaluation for sets of classes

Evaluation for sets of classes Evaluaton for Tet Categorzaton Classfcaton accuracy: usual n ML, the proporton of correct decsons, Not approprate f the populaton rate of the class s low Precson, Recall and F 1 Better measures 21 Evaluaton