Information Retrieval. Lecture 10: Text Classification;

Size: px

Start display at page:

Download "Information Retrieval. Lecture 10: Text Classification;"

Joshua Hoover
5 years ago
Views:

1 Introduction ti to Information Retrieval Lecture 10: Tet Classification; The Naïve Bayes algorithm

2 Relevance feedback revisited In relevance feedback, the user marks a number of documents as relevant/nonrelevant. We then try to use this information to return better search results. Suose we ust tried to learn a filter for nonrelevant documents This is an instance of a tet classification roblem: Two classes : relevant, nonrelevant For each document, decide whether it is relevant or nonrelevant The notion of classification is very general and has many alications within and beyond information retrieval.

3 Standing queries The ath from information retrieval to tet classification: You have an information need, say: Unrest in the Niger delta region You want to rerun an aroriate query eriodically to find new news items on this toic You will be sent new documents that are found Ie I.e., it s classification not ranking Such queries are called standing queries Long used by information rofessionals A modern mass instantiation is Google Alerts 13.0

4 Sam filtering: Another tet classification task From: "" Subect: real estate is the only way... gem oalvgkay Anyone can buy real estate with no money down Sto aying rent TODAY! There is no need to send hundreds or even thousands for similar courses I am 22 years old and I have already urchased 6 roerties using the methods outlined in this truly INCREDIBLE ebook. Change your life NOW! Click Below to order: htt:// 13.0

5 Tet classification: Naïve Bayes Tet Classification Today: Introduction to Tet Classification Also widely known as tet categorization. Same thing. Probabilistic Language Models Naïve Bayes tet classification

6 Categorization/Classification Given: A descrition of an instance, X, where X is the instance language g or instance sace. Issue: how to reresent tet documents. A fied set of classes: C {c 1, c 2,, c J } Determine: The category of : c C, where c is a classification function whose domain is X and whose range is C. We want to know how to build classification functions classifiers. 13.1

7 Document Classification Test Data: lanning language g roof intelligence Classes: ML AI Programming HCI Planning Semantics Garb.Coll. Multimedia GUI Training Data: learning intelligence algorithm reinforcement network... lanning temoral reasoning lan language... rogramming semantics language roof... garbage collection memory otimization region Note: in real life there is often a hierarchy, not resent in the above roblem statement; and also, you get aers on ML aroaches to Garb. Coll. 13.1

8 More Tet Classification Eamles: Many search engine functionalities use classification Assign labels to each document or web-age: Labels are most often toics such as Yahoo-categories eg e.g., "finance finance, ""sorts sorts, ""news>world>asia>business" Labels may be genres e.g., "editorials" "movie-reviews" "news Labels may be oinion on a erson/roduct e.g., like, hate, neutral Labels may be domain-secific e.g., "interesting-to-me" : "not-interesting-to-me e.g., contains adult language : doesn t e.g., language identification: English, French, Chinese, e.g., search vertical: about Linu versus not e.g., link sam : not link sam

9 Classification Methods 1 Manual classification Used by Yahoo! originally; now resent but downlayed, Looksmart, about.com, ODP, PubMed Very accurate when ob is done by eerts Consistent when the roblem size and team is small Difficult and eensive to scale Means we need automatic classification methods for big roblems 13.0

10 Classification Methods 2 Automatic document classification Hand-coded rule-based systems One technique used by CS det s sam filter, Reuters, CIA, etc. Comanies Verity rovide IDE for writing such rules E.g., assign category if document contains a given boolean combination of words Standing queries: Commercial systems have comle query languages everything in IR query languages + accumulators Accuracy is often very high if a rule has been carefully refined over time by a subect eert Building and maintaining these rules is eensive 13.0

11 A Verity toic a comle classification rule Note: maintenance issues author, etc. Hand-weighting of terms

12 Classification Methods 3 Suervised learning of a document-label assignment function Many systems artly rely on machine learning Autonomy, MSN, Verity, Enkata, Yahoo!, k-nearest Neighbors simle, owerful Naive Bayes simle, common method Suort-vector machines new, more owerful lus many other methods No free lunch: requires hand-classified training data But data can be built u and refined by amateurs Note that many commercial systems use a miture of methods

13 Probabilistic relevance feedback Recall this idea: Rather than reweighting in a vector sace If user has told us some relevant and some irrelevant documents, then we can roceed to build a robabilistic classifier, such as the Naive Bayes model we will look at today: Pt k R D rk / D r Pt k NR D nrk / D nr t k is a term; D r is the set of known relevant documents; D rk is the subset that contain t k; D nr is the set of known irrelevant documents; D nrk is the subset that contain t k

14 Recall a few robability basics Recall a few robability basics For events a and b: For events a and b: Bayes Rule a a b b b a b a b a a a b b b a a a b b b a b a b a, Prior b a a b b a a b b a Prior Odds: a a b b, a a O Posterior Odds: 1 a a a O 13.2

15 Bayesian Methods Our focus this lecture Learning and classification methods based on robability theory. Bayes theorem lays a critical role in robabilistic learning and classification. Build a generative model that t aroimates how data is roduced Uses rior robability of each category given no information about an item. Categorization roduces a osterior robability distribution over the ossible categories given a descrition of an item. 13.2

16 Bayes Rule PC,D PC DPD PD CPC PD CPC PC D PD

17 Naive Bayes Classifiers Naive Bayes Classifiers Task: Classify a new instance D based on a tule of attribute Task: Classify a new instance D based on a tule of attribute values into one of the classes c C n D,, 2, 1 K,,, argma 2 1 n C c MAP c P c K,,,,,, argma n n C c P c P c P K K,,, 2 1 n C c K,,, argma 2 1 n C c P c P K c C

18 Naïve Bayes Classifier: Naïve Bayes Assumtion Pc Can be estimated from the frequency of classes in the training eamles. P 1, 2,, n c O X n C arameters Could only be estimated if a very, very large number of training eamles was available. Naïve Bayes Conditional Indeendence d Assumtion: Assume that the robability of observing the conunction of attributes is equal to the roduct of the individual robabilities P i c.

19 The Naïve Bayes Classifier Flu P X 1 X 2 X 3 X 4 X 5 runnynose sinus cough fever muscle-ache Conditional Indeendence Assumtion: features detect term resence and are indeendent of each other given the class: 5 X1, K, X 5 C P X1 C P X 2 C L P X C This model is aroriate for binary variables Multivariate Bernoulli model 13.3

20 Learning the Model C X 1 X 2 X 3 X 4 X 5 X 6 First attemt: maimum likelihood estimates simly use the frequencies in the data P ˆ Pˆ c i c N C N c N X, C i N C c i c 13.3

21 Problem with Ma Likelihood Flu P X 1 X 2 X 3 X 4 X 5 runnynose sinus cough fever muscle-ache 5 X1, K, X 5 C P X1 C P X 2 C L P X C What if we have seen no training cases where atient had no flu and muscle aches? ˆ N X 5 t, C nf P X t C nf N C nf 5 Zero robabilities cannot be conditioned away, no matter the other evidence! l arg ma c i i P ˆ c Pˆ c

22 Smoothing to Avoid Overfitting Smoothing to Avoid Overfitting c C X N + 1 k c C N c C X N c P i i i + + 1, ˆ # of values of X i ll f ti i Somewhat more subtle version C X N overall fraction in data where X i i,k m c C N m c C X N c P k i k i i k i + +, ˆ,,, etent of smoothing

23 Stochastic Language Models Models robability of generating strings each word in turn in the language commonly all strings over. E.g., unigram model Model M 0.2 the 0.1 a 0.01 man 0.01 woman 0.03 said 0.02 likes the man likes the woman multily Ps M

24 Stochastic Language Models Model robability of generating any string Model M1 Model M2 0.2 the 0.01 class sayst leaseth yon maiden 0.01 woman 0.2 the class 0.03 sayst leaseth 0.1 yon 0.01 maiden woman the class leaseth yon Ps M2 > Ps M1 maiden

25 Unigram and higher-order models P P P P P Unigram Language Models P P P P Bigram generally, n-gram Language Models P P P P Other Language Models Grammar-based models PCFGs, etc. Probably not the first thing to try in IR Easy. Effective!

26 Naïve Bayes via a class conditional language age model multinomial NB Cat w 1 w 2 w 3 w 4 w 5 w 6 Effectively, the robability of each class is done as a class-secific unigram language model

27 Using Multinomial Naive Bayes Classifiers to Classify Tet: Basic method Attributes are tet ositions, values are words. c argma P c NB c C c C i P 1 i c argma P c P "our" c LP "tet" n c Still too many ossibilities Assume that classification is indeendent of the ositions of the words Use same arameters for each osition Result is bag of words model over tokens not tyes

28 Naïve Bayes: Learning From training corus, etract Vocabulary Calculate required Pc and P k c terms For each c in C do docs subset of documents for which h the target t class is c P c docs total # documents Tet single document containing all docs for each word k in Vocabulary n k number of occurrences of k in Tet P k c nk + α n + α Vocabulary

29 Naïve Bayes: Classifying ositions all word ositions in current document which contain tokens found in Vocabulary Return c NB, where c NB argma P c P i c c C i ositions

30 Naive Bayes: Time Comleity Training Time: O D L d + C V where L d is the average length of a document in D. Assumes V and all D i, n i, and n i re-comuted in O D L d time during one ass through all of the data. Generally ust O D L d since usually C V < D L d Test Time: O C C L t where L t is the average length of a test document. Very efficient overall, linearly roortional to the time Very efficient overall, linearly roortional to the time needed to ust read in all the data. Why?

31 Underflow Prevention: log sace Multilying lots of robabilities, which are between 0 and 1 by definition, can result in floating-oint underflow. Since logy log + logy, it is better to erform all comutations by summing logs of robabilities rather than multilying robabilities. Class with highest final un-normalized log robability score is still the most robable. c NB argma log P c c C + i ositions log P Note that model is now ust ma of sum of weights i c

32 Note: Two Models Model 1: Multivariate Bernoulli One feature X w for each word in dictionary X w true in document d if w aears in d Naive Bayes assumtion: Given the document s toic, aearance of one word in the document tells us nothing about chances that another word aears This is the model used in the binary indeendence model in classic robabilistic relevance feedback in hand-classified data Maron in IR was a very early user of NB

33 Two Models Model 2: Multinomial Class conditional unigram One feature X i for each word os in document feature s values are all words in dictionary Value of X i is the word in osition i Naïve Bayes assumtion: Given the document s toic, word in one osition in the document tells us nothing about words in other ositions Second assumtion: Word aearance does not deend on osition P X w c P X w c i for all ositions i i,, word w, and class c Just have one multinomial feature redicting all words

34 Parameter estimation Multivariate i t Bernoulli model: Pˆ X t c w fraction of documents of toic c in which word w aears Multinomial model: Pˆ X w c i fraction of times in which word w aears across all documents of toic c Can create a mega-document for toic by concatenating all documents in this toic Use frequency of w in mega-document

35 Classification Multinomial vs Multivariate Bernoulli? Multinomial model is almost always more effective in tet alications! See results figures later See IIR sections 13.2 and 13.3 for worked eamles with each model

36 Feature Selection: Why? Tet collections have a large number of features 10,000 1,000,000 unique words and more May make using a articular classifier feasible Some classifiers can t deal with 100,000 of features Reduces training time Training i time for some methods is quadratic or worse in the number of features Can imrove generalization erformance Eliminates noise features Avoids overfitting 13.5

37 Feature selection: how? Two ideas: Hyothesis testing statistics: Are we confident that the value of one categorical variable is associated with the value of another Chi-square test Information theory: How much information does the value of one categorical variable give you about the value of another Mutual information They re similar, but χ 2 measures confidence in association, based on available statistics, while MI measures etent of association assuming erfect knowledge of robabilities 13.5

38 χ 2 statistic CHI χ2 is interested in f o f e 2 /f e summed over all table entries: is the observed number what you d eect given the marginals? 2 2 χ, a O E / E / The null hyothesis is reected with confidence.999, since 12.9 > the value for.999 confidence. 2 / / / <.001 Term aguar Term aguar eected: f e Class auto Class auto observed: f o

39 χ 2 statistic CHI There is a simler formula for 22 χ 2 : A #t,c C # t,c B #t #t, c D # t # t, c N A + B + C + D Value for comlete indeendence of term and category?

40 Feature selection via Mutual Information In training set, choose k words which best discriminate give most info on the categories. The Mutual Information between a word, class is: e w, ec I w, c e, e log w c e { 0,1} e { 0,1} e e w c For each word w and each category c w e c

41 Feature selection via MI contd. For each category we build a list of k most discriminating terms. For eamle on 20 Newsgrous: sci.electronics: circuit, voltage, am, ground, coy, battery, electronics, cooling, rec.autos: car, cars, engine, ford, dealer, mustang, oil, collision, autos, tires, toyota, Greedy: does not account for correlations between terms Why?

42 Feature Selection Mutual Information Clear information-theoretic interretation May select rare uninformative terms Chi-square Statistical foundation May select very slightly informative frequent terms that are not very useful for classification Just use the commonest terms? No articular foundation In ractice, this is often 90% as good

43 Feature selection for NB In general feature selection is necessary for multivariate Bernoulli NB. Otherwise you suffer from noise, multi-counting Feature selection really means something different for multinomial NB. It means dictionary truncation The multinomial NB model only has 1 feature This feature selection normally isn t needed for multinomial NB, but may hel a fraction with quantities that are badly estimated

44 Evaluating Categorization Evaluation must be done on test data that are indeendent of the training data usually a disoint set of instances. Classification accuracy: c/n/ where n is the total number of test instances and c is the number of test instances correctly classified by the system. Adequate if one class er document Otherwise F measure for each class Results can vary based on samling error due to different training and test sets. Average results over multile training and test sets slits of the overall data for the best results. See IIR 13.6 for evaluation on Reuters

Teas, Wisconsin Crawl and classify a new site CMU Results: Student Faculty Person

45 WebKB Eeriment 1998 Classify webages from CS deartments into: student, faculty, course,roect Train on ~5,000 hand-labeled web ages Cornell, Washington, U.Teas, Wisconsin Crawl and classify a new site CMU Results: Student Faculty Person Proect Course Deartmt Etracted Correct Accuracy: 72% 42% 79% 73% 89% 100%

46 NB Model Comarison: WebKB

48 Naïve Bayes on sam 13.6

49 SamAssassin Naïve Bayes has found a home in sam filtering Paul Graham s A Plan for Sam A mutant with more mutant offsring... Naive Bayes-like classifier with weird arameter estimation Widely used in sam filters Classic Naive Bayes suerior when aroriately used According to David D. Lewis But also many other things: black hole lists, etc. Many toic filters also use NB classifiers

50 Violation of NB Assumtions Conditional indeendence Positional indeendence Eamles?

51 Eamle: Sensors Raining Reality Sunny P++r P+,+,r 3/8 P-,,r -r 1/8 P++s P+,+,s 1/8 P-,,s -s 3/8 NB Model Raining? M1 M2 NB FACTORS: Ps 1/2 P+ s 1/4 P+ r 3/4 PREDICTIONS: Pr,+,+ ½¾¾ Ps,+,+ ½¼¼ Pr +,+ 9/10 Ps +,+ 1/10

52 Naïve Bayes Posterior Probabilities Classification results of naïve Bayes the class with maimum osterior robability are usually fairly accurate. However, due to the inadequacy of the conditional indeendence assumtion, the actual osterior-robability numerical estimates are not. Outut robabilities are commonly very close to 0 or 1. Correct estimation accurate rediction, but correct robability estimation is NOT necessary for accurate rediction ust need right ordering of robabilities

53 Naive Bayes is Not So Naive Naïve Bayes: First and Second lace in KDD-CUP 97 cometition, among 16 then state t of the art algorithms Goal: Financial services industry direct mail resonse rediction model: Predict if the reciient of mail will actually resond to the advertisement 750,000 records. Robust to Irrelevant Features Irrelevant Features cancel each other without affecting results Instead Decision Trees can heavily suffer from this. Very good in domains with many equally imortant features Decision Trees suffer from fragmentation in such cases esecially if little data A good deendable baseline for tet classification but not the best! Otimal if the Indeendence Assumtions hold: If assumed indeendence is correct, then it is the Bayes Otimal Classifier for roblem Very Fast: Learning with one ass of counting over the data; testing linear in the number of attributes, and document collection size Low Storage requirements

54 Resources IIR 13 Fabrizio Sebastiani. Machine Learning in Automated Tet Categorization. ACM Comuting Surveys, 341:1-47, Yiming Yang & Xin Liu, A re-eamination of tet categorization methods. Proceedings of SIGIR, Andrew McCallum and Kamal Nigam. A Comarison of Event Models for Naive Bayes Tet Classification. In AAAI/ICML-98 Worksho on Learning for Tet Categorization, Tom Mitchell, Machine Learning. McGraw-Hill, Clear simle elanation of Naïve Bayes Oen Calais: Automatic Semantic Tagging Free but they can kee your data, rovided by Thomson/Reuters Weka: A data mining software ackage that includes an imlementation of Naive Bayes Reuters the most famous tet classification evaluation set and still widely used by lazy eole but now it s too small for realistic eeriments you should use Reuters RCV1

Information Retrieval and Organisation

Information Retrieval and Organisation Chapter 13 Text Classification and Naïve Bayes Dell Zhang Birkbeck, University of London Motivation Relevance Feedback revisited The user marks a number of documents