Manning & Schuetze, FSNLP (c)1999, PDF Free Download

page 589 16.2 Maxmum Entropy Modelng 589 Mannng & Schuetze, FSNLP (c)1999, 2001 a decson tree that detects spam. Fndng the rght features s paramount for ths task, so desgn your feature set carefully. Exercse 16.4 [ ] Another mportant applcaton of text categorzaton s the detecton of adult content, that s, content that s not approprate for chldren because t s sexually explct. Collect tranng and test sets of adult and non-adult materal from the World Wde Web and buld a decson tree that can block access to adult materal. Exercse 16.5 [ ] Collect a reasonable amount of text wrtten by yourself and by a frend. You may want to break up ndvdual texts (e.g., term papers) nto smaller peces to get a large enough set. Buld a decson tree that automatcally determnes whether you are the author of a pece of text. Note that t s often the lttle words that gve an author away (for example, the relatve frequences of words lke because or though). Exercse 16.6 [ ] Download a set of Englsh and non-englsh texts from the World Wde Web or use some other multlngual source. Buld a decson tree that can dstngush between Englsh and non-englsh texts. (See also exercse 6.10.) 16.2 Maxmum Entropy Modelng Maxmum entropy modelng s a framework for ntegratng nformaton from many heterogeneous nformaton sources for classfcaton. The data for a classfcaton problem s descrbed as a (potentally large) number of features. These features can be qute complex and allow the expermenter to make use of pror knowledge about what types of nformaton are expected to be mportant for classfcaton. Each feature corresponds to a constrant on the model. We then compute the maxmum entropy model, the model wth maxmum entropy of all the models that satsfy the constrants. Ths term may ntally seem perverse, snce we have spent most of the book tryng to mnmze the (cross) entropy of data accordng to models, but the dea s that we do not want to go beyond the data. If we chose a model wth less entropy, we would add nformaton to the model that s not justfed by the emprcal evdence avalable to us. Choosng the maxmum entropy model s motvated by the desre to preserve as much uncertanty as possble. We have smplfed matters n ths chapter by neglectng the problem of feature selecton (we use the same 20 features throughout). In maxmum entropy modelng, feature selecton and tranng are usually ntegrated.

page 590 590 16 Text Categorzaton Mannng & Schuetze, FSNLP (c)1999, 2001 (16.3) Recall that s j s the term weght for word n Reuters artcle j. Note that the use of bnary features s dfferent from the rest of ths chapter: The other classfers use the magntude of the weght, not just the presence or absence of a word. 5 For a gven set of features, we frst compute the expectaton of each feature based on the tranng set. Each feature then defnes the constrant that the expectaton of the feature n our fnal maxmum entropy model must be the same as ths emprcal expectaton. Of all probablty ds- trbutons that obey these constrants, we attempt to fnd the maxmum entropy dstrbuton, the one wth the hghest entropy. One can show that there exsts a unque such maxmum entropy dstrbuton and there exsts an algorthm, generalzed teratve scalng, whch s guaranteed to converge to t. The model class for the partcular varety of maxmum entropy modelng that we ntroduce here s loglnear models of the followng form: emprcal expectaton maxmum entropy dstrbuton loglnear models (16.4) Ideally, ths enables us to specfy all potentally relevant nformaton at the begnnng, and then to let the tranng procedure worry about how to come up wth the best model for classfcaton. We wll only ntroduce the basc method here and refer the reader to the Further Readng for feature selecton. The features f are bnary functons that can be used to characterze any property of a par ( x, c), where xs a vector representng an nput element (n our case the 20-dmensonal vector of word weghts representng an artcle as n table 16.3), and c s the class label (1 f the artcle s n the earnngs category, 0 otherwse). For text categorzaton, we defne features as follows: { 1 f sj > 0andc=1 f ( x j,c)= 0 otherwse p( x, c) = 1 Z K =1 α f ( x,c) where K s the number of features, α s the weght for feature f and Z s a normalzng constant (commonly called the partton functon ), used 5. The maxmum entropy approach s not n prncple lmted to bnary features. Generalzed teratve scalng, whch we ntroduce below, requres merely that weghts are non-negatve and that ther sum s bounded. However, bnary features have generally been employed because they mprove the effcency of the computatonally ntensve reestmaton procedure.

page 591 16.2 Maxmum Entropy Modelng 591 Mannng & Schuetze, FSNLP (c)1999, 2001 features (16.5) Generalzed teratve scalng s a procedure for fndng the maxmum en- tropy dstrbuton p of form (16.4) that obeys the followng set of constrants: generalzed teratve scalng to ensure that a probablty dstrbuton results. To use the model for text categorzaton, we compute p( x, 0) and p( x, 1) and, n the smplest case, choose the class label wth the greater probablty. Note that, n ths secton, features contan nformaton about the class of the object n addton to the measurements of the object we want to classfy. Here, we are followng most publcatons on maxmum entropy modelng n defnng feature n ths sense. The more common use of the term feature (whch we have adopted for the rest of the book) s that t only refers to some characterstc of the object, ndependent of the class the object s a member of. Equaton (16.4) defnes a loglnear model because, f we take logs on both sdes, then log p s a lnear combnaton of the logs of the weghts: log p( x, c) = log Z + K f ( x, c) log α =1 Loglnear models are a general and very mportant class of models for classfcaton wth categorcal varables. Other examples of the class are logstc regresson (McCullagh and Nelder 1989), decomposable models (Bruce and Webe 1999), and the HMMs andpcfgs of earler chapters. We ntroduce the maxmum entropy modelng approach here because maxmum entropy models have recently been wdely used n Statstcal NLP and because t s an applcaton of the mportant maxmum entropy prncple. 16.2.1 Generalzed teratve scalng (16.6) (16.7) E p f = E p f In other words, the expected value of f for p s the same as the expected value for the emprcal dstrbuton p (n other words, for the tranng set). The algorthm requres that the sum of the features for each possble ( x, c) be equal to a constant C: 6 x, c f ( x, c) = C 6. See Berger et al. (1996) for Improved Iteratve Scalng, a varant of generalzed teratve scalng that does not mpose ths constrant.

page 592 592 16 Text Categorzaton Mannng & Schuetze, FSNLP (c)1999, 2001 (16.8) (16.9) (16.10) In order to fulfll ths requrement, we defne C as the greatest possble feature sum (over all possble data, not just observed data): K C max f ( x, c) x,c =1 and add a feature f K+1 that s defned as follows: K f K+1 ( x, c) = C f ( x, c) =1 Note that ths feature s not bnary, n contrast to the others. E p f s defned as (secton 2.1.5): E p f = x,c p( x, c)f ( x, c) where the sum s over the event space, that s, all possble vectors x and class labels c. The emprcal expectaton s easy to compute: E p f = p( x, c)f ( x, c) = 1 N f ( x j,c) N x,c j=1 where N s the number of elements n the tranng set and we use the fact that the emprcal probablty for a par that doesn t occur n the tranng set s 0. In general, the maxmum entropy dstrbuton E p f cannot be computed effcently snce t would nvolve summng over all possble combnatons of x and c, a huge or nfnte set. Instead, people use the followng approxmaton where only emprcally observed x are consdered (Lau 1994: 25): E p f p( x) p(c x)f ( x, c) = 1 N p(c x j )f ( x j,c) N x,c j=1 c where c stll ranges over all possble classes, n our case c {0,1}. Now we have all the peces to state the generalzed teratve scalng algorthm: 1. Intalze {α (1) }. Any ntalzaton wll do, but usually we choose α (1) = 1, 1 j K + 1. Compute E p f as shown above. Set n = 1. 2. Compute p (n) ( x, c) for the dstrbuton p (n) gven by the {α (n) } for each element ( x, c) n the tranng set: (16.11) p (n) ( x, c) = 1 K+1 ( (n) α Z =1 ) f ( x,c) where Z = K+1 ( (n) α x,c =1 ) f ( x,c)

page 593 16.2 Maxmum Entropy Modelng 593 Mannng & Schuetze, FSNLP (c)1999, 2001 (16.12) x c proft earnngs f 1 f 2 β = f 1 log α 1 + f 2 log α 2 2 β (0) 0 0 1 1 2 (0) 1 0 1 1 2 (1) 0 0 1 1 2 (1) 1 1 0 2 4 Table 16.6 An example of a maxmum entropy dstrbuton n the form of equaton (16.4). The vector x conssts of a sngle element, ndcatng the presence orabsenceofthewordproft n the artcle. There are two classes (member of earnngs or not). Feature f 1 s 1 f and only f the artcle s n earnngs and proft occurs. f 2 s the fller feature f K+1. For one partcular choce of the parameters, namely log α 1 = 2.0 and log α 2 = 1.0, we get after normalzaton (Z = 2 + 2 + 2 + 4 = 10) the followng maxmum entropy dstrbuton: p(0, 0) = p(0, 1) = p(1, 0) = 2/Z = 0.2 andp(1,1)=4/z = 0.4. An example of a data set wth the same emprcal dstrbuton s ((0, 0), (0, 1), (1, 0), (1, 1), (1, 1)). 3. Compute E p (n) f for all 1 K + 1 accordng to equaton (16.10). 4. Update the parameters α : α (n+1) ( ) 1 = α (n) C E p f E p (n) f 5. If the parameters of the procedure have converged, stop, otherwse ncrement n and go to 2. We present the algorthm n ths form for readablty. In an actual mplementaton, t s more convenent to do the computatons usng logarthms. One can show that ths procedure converges to a dstrbuton p that obeys the constrants (16.6), and that of all such dstrbutons t s the one that maxmzes the entropy H(p) and the lkelhood of the data. Darroch and Ratclff (1972) show that ths dstrbuton always exsts and s unque. A toy example of a maxmum entropy dstrbuton that generalzed teratve scalng wll converge to s shown n table 16.6. Exercse 16.7 [ ] What are the classfcaton decsons for the dstrbuton n table 16.6? Compute P( earnngs proft) and P( earnngs proft).

page 594 594 16 Text Categorzaton Mannng & Schuetze, FSNLP (c)1999, 2001 Does proft Is topc earnngs? occur? YES NO YES 20 9 NO 8 13 Table 16.7 An emprcal dstrbuton whose correspondng maxmum entropy dstrbuton s the one n table 16.6. Exercse 16.8 [ ] Show that the dstrbuton n table 16.6 s a fxed pont for generalzed teratve scalng. That s, computng one teraton should leave the dstrbuton unchanged. Exercse 16.9 [ ] Consder the dstrbuton n table 16.7. Show that for the features defned n table 16.6, ths dstrbuton has the same feature expectatons E p as the one n table 16.6. Exercse 16.10 [ ] Compute a number of teratons of generalzed teratve scalng for the data n table 16.7 (usng the features defned n table 16.6). The procedure should converge towards the dstrbuton n table 16.6. Exercse 16.11 [ ] Select one of exercses 16.3 through 16.6 and buld a maxmum entropy model for the correspondng text categorzaton task. 16.2.2 Applcaton to text categorzaton We have already suggested how to defne approprate features for text categorzaton n equaton (16.3). For the task of dentfyng Reuters earnngs artcles we end up wth 20 features, each correspondng to one of the selected words, and the f K+1 feature ntroduced at the start of the last subsecton, defned so that the features added to C = 20. We traned on the 9603 tranng set artcles. Table 16.8 shows the weghts found by generalzed teratve scalng after convergence (500 teratons). The features wth the hghest weghts are cts, proft, net and losst. IfweuseP( earnngs x) > P( earnngs x) as our decson rule, we get the classfcaton results n table 16.9. Classfcaton accuracy s 96.2%. An mportant queston n an mplementaton s when to stop the teraton. One way to test for convergence s to compute the log dfference

page 595 16.2 Maxmum Entropy Modelng 595 Mannng & Schuetze, FSNLP (c)1999, 2001 Word Feature weght w α log e α vs 2.696 0.992 mln 1.079 0.076 cts 12.303 2.510 ; 0.448 0.803 & 0.450 0.798 000 0.756 0.280 loss 4.032 1.394 0.993 0.007 " 1.502 0.407 3 0.435 0.832 proft 9.701 2.272 dlrs 0.678 0.388 1 1.193 0.177 pct 0.590 0.528 s 0.418 0.871 s 0.359 1.025 that 0.703 0.352 net 6.155 1.817 lt 3.566 1.271 at 0.490 0.713 f K+1 0.967 0.034 Table 16.8 Feature weghts n maxmum entropy modelng for the category earnngs n Reuters. earnngs earnngs correct? assgned? YES NO YES 1014 53 NO 73 2159 Table 16.9 Classfcaton results for the dstrbuton correspondng to table 16.8 on the test set. Classfcaton accuracy s 96.2%.

page 596 596 16 Text Categorzaton Mannng & Schuetze, FSNLP (c)1999, 2001 Nave Bayes lnear regresson between emprcal and estmated feature expectatons (log E p log E p (n)), whch should approach zero. Rstad (1996) recommends to also look at the largest α when dong teratve scalng. If the largest weght becomes too large, then ths ndcates a problem wth ether the data representaton or the mplementaton. When s the maxmum entropy framework presented n ths secton approprate as a classfcaton method? A characterstc of the maxmum entropy systems that are currently practcal for large data sets s the restrcton to bnary features. Ths s a shortcomng n some stuatons. In text categorzaton, we often need a noton of strength of evdence whch goes beyond smply recordng presence or absence of evdence. But ths does not appear to have hurt us much here. (Ths perhaps partly reflects how easy ths classfcaton problem s: Smply classfyng based on whether the cts feature s non-zero yelds an accuracy of 91.2%.) Generalzed teratve scalng can also be computatonally expensve due to slow convergence (but see (Lau 1994) for suggestons for speedng up convergence). For bnary classfcaton, the loglnear model defnes a lnear separator that s n prncple no more powerful than Nave Bayes or lnear regresson, classfers that can be traned more effcently. However, t s mportant to stress that, apart from the theoretcal power of a classfcaton method, the tranng procedure s crucal. Unlke Nave Bayes, generalzed teratve scalng takes dependence between features nto account: f one duplcated a feature, then the weght of each nstance of t would be halved. If feature dependence s not expected to a be a problem, then Nave Bayes s a better choce than maxmum entropy modelng. Fnally, the lack of smoothng can also cause problems. For example, f we have a feature that always predcts a certan class, then ths feature may get an excessvely hgh weght. One way to deal wth ths s to smooth the emprcal data by addng events that dd not occur. In practce, features that occur less than fve tmes are usually elmnated. One of the strengths of maxmum entropy modelng s that t offers a framework for specfyng all possbly relevant nformaton. The attracton of the method les n the fact that arbtrarly complex features can be defned f the expermenter beleves that these features may contrbute useful nformaton for the classfcaton decson. For example, Berger et al. (1996: 57) defne a feature for the translaton of the preposton n from Englsh to French that s 1 f and only f n s translated as pendant and n s followed by the word weeks wthn three words. There s also no need to worry about heterogenety of features or weghtng fea-

Manning & Schuetze, FSNLP (c)1999, 2001