Maxent Models and Discrimina1ve Es1ma1on

Size: px

Start display at page:

Download "Maxent Models and Discrimina1ve Es1ma1on"

Vivian Griffith
6 years ago
Views:

1 Maxent Models and Dscrmna1ve Es1ma1on Genera&ve vs. Dscrmna&ve models Chrstopher Mannng

2 Chrstopher Mannng Introduc1on So far we ve looked at genera&ve models Language models Nave Bayes But there s now much use of cond&onal or dscrmna&ve probabls&c models n NLP Speech IR and ML generally Because: They gve hgh accuracy performance They make t easy to ncorporate lots of lngus&cally mportant features They allow automa&c buldng of language ndependent retargetable NLP modules

3 Chrstopher Mannng Jont vs. Cond1onal Models We have some data {d c} of pared observa&ons d and hdden classes c. Jont genera&ve models place probabl&es over both observed data and the hdden stuff gene- rate the observed data from hdden stuff: All the classc StatNLP models: n- gram models Nave Bayes classfers hdden Markov models probabls&c context- free grammars IBM machne transla&on algnment models Pcd

4 Chrstopher Mannng Jont vs. Cond1onal Models Dscrmna&ve cond&onal models take the data as gven and put a probablty over hdden structure gven the data: Logs&c regresson cond&onal loglnear or maxmum entropy models cond&onal random felds Also SVMs averaged perceptron etc. are dscrmna&ve classfers but not drectly probabls&c Pc d

5 Chrstopher Mannng Bayes Net/Graphcal Models Bayes net dagrams draw crcles for random varables and lnes for drect dependences Some varables are observed; some are hdden Each node s a lyle classfer cond&onal probablty table based on ncomng arcs c c d 1 d 2 d 3 Nave Bayes Genera&ve d 1 d 2 d 3 Logs&c Regresson Dscrmna&ve

6 Chrstopher Mannng Cond1onal vs. Jont Lkelhood A jont model gves probabl&es Pdc and tres to maxmze ths jont lkelhood. It turns out to be trval to choose weghts: just rela&ve frequences. A condonal model gves probabl&es Pc d. It takes the data as gven and models only the cond&onal probablty of the class. We seek to maxmze cond&onal lkelhood. Harder to do as we ll see More closely related to classfca&on error.

7 Chrstopher Mannng Cond1onal models work well: Word Sense Dsambgua1on Tranng Set Objectve Accuracy Jont Lke Cond. Lke Even wth exactly the same features changng from jont to cond&onal es&ma&on ncreases performance Objectve Test Set Accuracy Jont Lke Cond. Lke Klen and Mannng 2002 usng Senseval-1 Data That s we use the same smoothng and the same word- class features we just change the numbers parameters

8 Maxent Models and Dscrmna1ve Es1ma1on Genera&ve vs. Dscrmna&ve models Chrstopher Mannng

9 Dscrmna1ve Model Features Makng features from text for dscrmna&ve NLP models Chrstopher Mannng

10 Chrstopher Mannng Features In these sldes and most maxent work: features f are elementary peces of evdence that lnk aspects of what we observe d wth a category c that we want to predct A feature s a func&on wth a bounded real value: f: C D R

11 Chrstopher Mannng Features In these sldes and most maxent work: features f are elementary peces of evdence that lnk aspects of what we observe d wth a category c that we want to predct A feature s a func&on wth a bounded real value

12 Chrstopher Mannng Example features f 1 c d [c = LOCATION w -1 = n scaptalzedw] f 2 c d [c = LOCATION hasaccentedlatncharw] f 3 c d [c = DRUG endsw c ] LOCATION n Arcada LOCATION n Québec DRUG takng Zantac PERSON saw Sue Models wll assgn to each feature a weght: A pos&ve weght votes that ths confgura&on s lkely correct A nega&ve weght votes that ths confgura&on s lkely ncorrect

13 Chrstopher Mannng Example features f 1 c d [c = LOCATION w -1 = n scaptalzedw] f 2 c d [c = LOCATION hasaccentedlatncharw] f 3 c d [c = DRUG endsw c ] LOCATION n Arcada LOCATION n Québec DRUG takng Zantac PERSON saw Sue Models wll assgn to each feature a weght: A pos&ve weght votes that ths confgura&on s lkely correct A nega&ve weght votes that ths confgura&on s lkely ncorrect

14 Chrstopher Mannng Feature Expecta1ons We wll crucally make use of two expectaons actual or predcted counts of a feature frng: Emprcal count expecta&on of a feature: Model expecta&on of a feature: =! " observed emprcal D C d c d c f f E =! " D C d c d c f d c P f E

15 Chrstopher Mannng Features In NLP uses usually a feature specfes 1 an ndcator func&on a yes/no boolean matchng func&on of proper&es of the nput and 2 a par&cular class f c d [Φd c = c j ] [Value s 0 or 1] They pck out a data subset and suggest a label for t. We wll say that Φd s a feature of the data d when for each c j the conjunc&on Φd c = c j s a feature of the data- class par c d

16 Chrstopher Mannng Features In NLP uses usually a feature specfes 1. an ndcator func&on a yes/no boolean matchng func&on of proper&es of the nput and 2. a par&cular class f c d [Φd c = c j ] [Value s 0 or 1] Each feature pcks out a data subset and suggests a label for t

17 Chrstopher Mannng Feature- Based Models The decson about a data pont s based only on the features ac&ve at that pont. Data BUSINESS: Stocks ht a yearly low Data to restructure bank:money debt. Data DT JJ NN The prevous fall Label: BUSINESS Features { stocks ht a yearly low } Text Categorzaton Label: MONEY Features { w -1 =restructure w +1 =debt L=12 } Word-Sense Dsambguaton Label: NN Features {w=fall t -1 =JJ w -1 =prevous} POS Taggng

18 Chrstopher Mannng Example: Text Categorza1on Zhang and Oles 2001 Features are presence of each word n a document and the document class they do feature selec&on to use relable ndcator words Tests on classc Reuters data set and others Naïve Bayes: 77.0% F 1 Lnear regresson: 86.0% Logs&c regresson: 86.4% Support vector machne: 86.5% Paper emphaszes the mportance of regularzaon smoothng for successful use of dscrmna&ve methods not used n much early NLP/IR work

19 Chrstopher Mannng Other Maxent Classfer Examples You can use a maxent classfer whenever you want to assgn data ponts to one of a number of classes: Sentence boundary detec&on Mkheev 2000 Is a perod end of sentence or abbreva&on? Sen&ment analyss Pang and Lee 2002 Word ungrams bgrams POS counts PP ayachment Ratnaparkh 1998 AYach to verb or noun? Features of head noun prepos&on etc. Parsng decsons n general Ratnaparkh 1997; Johnson et al etc.

20 Dscrmna1ve Model Features Makng features from text for dscrmna&ve NLP models Chrstopher Mannng

21 Feature- based Lnear Classfers How to put features nto a classfer 21

22 Chrstopher Mannng Feature- Based Lnear Classfers Lnear classfers at classfca&on &me: Lnear functon from feature sets {f } to classes {c}. Assgn a weght λ to each feature f. We consder each class for an observed datum d For a par cd features vote wth ther weghts: votec = Σλ f cd PERSON n Québec LOCATION n Québec DRUG n Québec Choose the class c whch maxmzes Σλ f cd

23 Chrstopher Mannng Feature- Based Lnear Classfers Lnear classfers at classfca&on &me: Lnear functon from feature sets {f } to classes {c}. Assgn a weght λ to each feature f. We consder each class for an observed datum d For a par cd features vote wth ther weghts: votec = Σλ f cd PERSON n Québec LOCATION n Québec 0.3 DRUG n Québec Choose the class c whch maxmzes Σλ f cd = LOCATION

24 Chrstopher Mannng Feature- Based Lnear Classfers There are many ways to chose weghts for features Perceptron: fnd a currently msclassfed example and nudge weghts n the drec&on of ts correct classfca&on Margn- based methods Support Vector Machnes

25 Chrstopher Mannng Feature- Based Lnear Classfers Exponen&al log- lnear maxent logs&c Gbbs models: Make a probabls&c model from the lnear combna&on Σλ f cd P c d! =! exp! c'! exp " f c d Makes votes postve " f c' d PLOCATION n Québec = e 1.8 e 0.6 /e 1.8 e e e 0 = PDRUG n Québec = e 0.3 /e 1.8 e e e 0 = PPERSON n Québec = e 0 /e 1.8 e e e 0 = The weghts are the parameters of the probablty model combned va a soft max functon Normalzes votes

26 Chrstopher Mannng Feature- Based Lnear Classfers Exponen&al log- lnear maxent logs&c Gbbs models: Gven ths model form we wll choose parameters {λ } that maxmze the condonal lkelhood of the data accordng to ths model. We construct not only classfca&ons but probablty dstrbu&ons over classfca&ons. There are other good! ways of dscrmna&ng classes SVMs boos&ng even perceptrons but these methods are not as trval to nterpret as dstrbu&ons over classes.

27 Chrstopher Mannng Asde: logs1c regresson Maxent models n NLP are essen&ally the same as mul&class logs&c regresson models n sta&s&cs or machne learnng 27 If you haven t seen these before don t worry ths presenta&on s self- contaned! If you have seen these before you mght thnk about: The parameterza&on s slghtly dfferent n a way that s advantageous for NLP- style models wth tons of sparse features but sta&s&cally nelegant The key role of feature func&ons n NLP and n ths presenta&on The features are more general wth f also beng a func&on of the class when mght ths be useful?

28 Chrstopher Mannng Quz Queston Assumng exactly the same set up 3 class decson: LOCATION PERSON or DRUG; 3 features as before maxent what are: PPERSON by Goérc = PLOCATION by Goérc = PDRUG by Goérc = 1.8 f 1 c d [c = LOCATION w -1 = n scaptalzedw] -0.6 f 2 c d [c = LOCATION hasaccentedlatncharw] 0.3 f 3 c d [c = DRUG endsw c ] PERSON by Goérc LOCATION by Goérc DRUG by Goérc P c d! =! exp! c'! exp " f c d " f c' d

29 Feature- based Lnear Classfers How to put features nto a classfer 29

30 Buldng a Maxent Model The nuts and bolts

31 Chrstopher Mannng Buldng a Maxent Model We defne features ndcator func&ons over data ponts Features represent sets of data ponts whch are ds&nc&ve enough to deserve model parameters. Words but also word contans number word ends wth ng etc. We wll smply encode each Φ feature as a unque Strng A datum wll gve rse to a set of Strngs: the ac&ve Φ features Each feature f c d [Φd c = c j ] gets a real number weght We concentrate on Φ features but the math uses ndces of f

32 Chrstopher Mannng Buldng a Maxent Model Features are oven added durng model development to target errors Oven the easest thng to thnk of are features that mark bad combna&ons Then for any gven feature weghts we want to be able to calculate: Data cond&onal lkelhood Derva&ve of the lkelhood wrt each feature weght Uses expecta&ons of each feature accordng to the model We can then fnd the op&mum feature weghts dscussed later.

33 Buldng a Maxent Model The nuts and bolts

34 Nave Bayes vs. Maxent models Genera&ve vs. Dscrmna&ve models: The problem of overcoun&ng evdence Chrstopher Mannng

35 Chrstopher Mannng Europe Text classfca1on: Asa or Europe Tranng Data Asa Monaco Monaco Monaco Monaco Monaco Monaco Hong Kong Hong Kong Monaco Monaco Hong Kong Hong Kong NB Model Class X 1 =M NB FACTORS: PA = PE = PM A = PM E = PREDICTIONS: PAM = PEM = PA M = PE M =

36 Chrstopher Mannng Europe Text classfca1on: Asa or Europe Tranng Data Asa Monaco Monaco Monaco Monaco Monaco Monaco Hong Kong Hong Kong Monaco Monaco Hong Kong Hong Kong NB Model Class X 1 =H X 2 =K NB FACTORS: PA = PE = PH A = PK A = PH E = PK E = PREDICTIONS: PAHK = PEHK = PA HK = PE HK =

37 Chrstopher Mannng Europe Text classfca1on: Asa or Europe Tranng Data Asa Monaco Monaco Monaco Monaco Monaco Monaco Hong Kong Hong Kong Monaco Monaco Hong Kong Hong Kong NB Model Class H K M NB FACTORS: PA = PE = PM A = PM E = PH A = PK A = PH E = PK E = PREDICTIONS: PAHKM = PEHKM = PA HKM = PE HKM =

38 Chrstopher Mannng Nave Bayes vs. Maxent Models Nave Bayes models mul&- count correlated evdence Each feature s mul&pled n even when you have mul&ple features tellng you the same thng Maxmum Entropy models preyy much solve ths problem As we wll see ths s done by wegh&ng features so that model expecta&ons match the observed emprcal expecta&ons

39 Nave Bayes vs. Maxent models Genera&ve vs. Dscrmna&ve models: The problem of overcoun&ng evdence Chrstopher Mannng

40 Maxent Models and Dscrmna1ve Es1ma1on Maxmzng the lkelhood

41 Chrstopher Mannng Exponen1al Model Lkelhood Maxmum Cond&onal Lkelhood Models : Gven a model form choose values of parameters to maxmze the cond&onal lkelhood of the data.!! " " = = log log log D C d c D C d c d c P D C P # #!! ' ' exp c d c f "! d c f exp "

42 Chrstopher Mannng The Lkelhood Value The log cond&onal lkelhood of d data CD accordng to maxent model s a func&on of the data and the parameters λ: If there aren t many values of c t s easy to calculate:! " # # = = log log log D C d c D C d c d c P d c P D C P $ $ $! " = log log D C d c D C P #!! ' ' exp c d c f "! d c f exp "

43 Chrstopher Mannng The Lkelhood Value We can separate ths nto two components: log P C D! =! logexp! c d " C D # f c d! The derva&ve s the dfference between the derva&ves of each component log P C D! = N!! M!! log! exp! c d " C D c' # f c' d

44 Chrstopher Mannng The Derva1ve I: Numerator Derva&ve of the numerator s: the emprcal countf c D C d c d c f!! " " = # # $!! " # # = D C d c d c f $ $! " = D C d c d c f D C d c c d c f N!!!! " " = " " # # $ log exp

45 Chrstopher Mannng The Derva1ve II: Denomnator " # log# exp# " M! c d $ C D c' = "! "! = =! 1! f c' d! exp! c' c d " C D! exp! $ c'' d # c'' 1 # f $ exp! $ f c' d $ f c' d #!! c d " C D! exp! $ f c'' d c' 1 # $ c'' exp#! f c' d "#! f c' d # # c d $ C D c' # exp! f c d "! c # '' '' = =!! c d " C D c'! $ f c' d P c' d # f c' d = predcted countf λ

46 Chrstopher Mannng! log The Derva1ve III P C D"!" = actual count f C! predcted count! The op&mum parameters are the ones for whch each feature s predcted expecta&on equals ts emprcal expecta&on. The op&mum dstrbu&on s: Always unque but parameters may not be unque Always exsts f feature counts are from actual data. These models are also called maxmum entropy models because we fnd the model havng maxmum entropy and sa&sfyng the constrants: E p f j = E~ p f j! j f

47 Chrstopher Mannng Fndng the op1mal parameters We want to choose parameters λ 1 λ 2 λ 3 that maxmze the cond&onal log- lkelhood of the tranng data CLogLk D =! log P c d n = 1 To be able to do that we ve worked out how to calculate the func&on value and ts par&al derva&ves ts gradent

48 Chrstopher Mannng A lkelhood surface

49 Chrstopher Mannng Fndng the op1mal parameters Use your favorte numercal op&mza&on package. Commonly and n our code you mnmze the nega&ve of CLogLk 1. Gradent descent GD; Stochas&c gradent descent SGD 2. Itera&ve propor&onal f ng methods: Generalzed Itera&ve Scalng GIS and Improved Itera&ve Scalng IIS 3. Conjugate gradent CG perhaps wth precond&onng 4. Quas- Newton methods lmted memory varable metrc LMVM methods n par&cular L- BFGS

50 Maxent Models and Dscrmna1ve Es1ma1on Maxmzng the lkelhood

51 Maxent Models and Dscrmnatve Estmaton The maxmum entropy model presentaton

52 Chrstopher Mannng Maxmum Entropy Models An equvalent approach: Lots of dstrbutons out there most of them very spked specfc overft. We want a dstrbuton whch s unform except n specfc ways we requre. Unformty means hgh entropy we can search for dstrbutons whch have propertes we desre but also have hgh entropy. Ignorance s preferable to error and he s less remote from the truth who beleves nothng than he who beleves what s wrong Thomas Jefferson 1781

53 Chrstopher Mannng Maxmum Entropy Entropy: the uncertanty of a dstrbuton. Quantfyng uncertanty surprse : Event Probablty Surprse log1/p x Entropy: expected surprse over p: x p x H p HEADS A con-flp s most uncertan for a far con.

Soluton: maxmze entropy H subject to feature-based constrants: Unconstraned max at 0.

54 Chrstopher Mannng Maxent Examples I What do we want from a dstrbuton? Mnmze commtment = maxmze entropy. Resemble some reference dstrbuton data. Soluton: maxmze entropy H subject to feature-based constrants: Unconstraned max at 0.5 Addng constrants features: Lowers maxmum entropy Rases maxmum lkelhood of data Brngs the dstrbuton further from unform Brngs the dstrbuton closer to data Constrant that p HEADS = 0.3

55 Chrstopher Mannng Maxent Examples II Hp H p T p H + p T = 1 p H = 0.3 1/e - x log x

56 Chrstopher Mannng Maxent Examples III Let s say we have the followng event space: NN NNS NNP NNPS VBZ VBD and the followng emprcal data: Maxmze H: /e 1/e 1/e 1/e 1/e 1/e want probabltes: E[NNNNSNNPNNPSVBZVBD] = 1 1/6 1/6 1/6 1/6 1/6 1/6

57 Chrstopher Mannng Maxent Examples IV Too unform! N* are more common than V* so we add the feature f N = {NN NNS NNP NNPS} wth E[f N ] =32/36 NN NNS NNP NNPS VBZ VBD 8/36 8/36 8/36 8/36 2/36 2/36 and proper nouns are more frequent than common nouns so we add f P = {NNP NNPS} wth E[f P ] =24/36 4/36 4/36 12/36 12/36 2/36 2/36 we could keep refnng the models e.g. by addng a feature to dstngush sngular vs. plural nouns or verb types.

58 Chrstopher Mannng Convexty Convex Non-Convex Convexty guarantees a sngle global maxmum because any hgher ponts are greedly reachable.

59 Chrstopher Mannng Convexty II Constraned Hp = x log x s convex: x log x s convex x log x s convex sum of convex functons s convex. The feasble regon of constraned H s a lnear subspace whch s convex The constraned entropy surface s therefore convex. The maxmum lkelhood exponental model dual formulaton s also convex.

60 Feature Overlap/Feature Interacton How overlappng features work n maxent models

61 Chrstopher Mannng Feature Overlap Maxent models handle overlappng features well. Unlke a NB model there s no double countng! A a A a A a Emprcal B b B b B b A a B 2 1 b 2 1 All = 1 A a B 1/4 1/4 A = 2/3 A a B 1/3 1/6 A = 2/3 A a B 1/3 1/6 b 1/4 1/4 b 1/3 1/6 b 1/3 1/6 A a A a A a B B A B A + A b b A b A + A

62 Chrstopher Mannng Example: Named Entty Feature Overlap Grace s correlated wth PERSON but does not add much evdence on top of already knowng prefx features. Local Context Prev Cur Next State Other?????? Word at Grace Road Tag IN NNP NNP Sg x Xx Xx Feature Weghts Prevous word at Current word Grace Begnnng bgram <G Current POS tag NNP Prev and cur tags IN NNP Prevous state Other Current sgnature Xx Prev state cur sg O-Xx Prev-cur-next sg x-xx-xx P. state - p-cur sg O-x-Xx

63 Chrstopher Mannng Feature Interacton Maxent models handle overlappng features well but do not automatcally model feature nteractons. A a A a A a Emprcal B b B b B b A a B 1 1 b 1 0 All = 1 A a B 1/4 1/4 A = 2/3 A a B 1/3 1/6 B = 2/3 A a B 4/9 2/9 b 1/4 1/4 b 1/3 1/6 b 2/9 1/9 A a A a A a B 0 0 B A B A + B B b 0 0 b A b A

64 Chrstopher Mannng Feature Interacton If you want nteracton terms you have to add them: A a A a A a Emprcal B b B b B b A a B 1 1 b 1 0 A = 2/3 A a B 1/3 1/6 B = 2/3 A a B 4/9 2/9 AB = 1/3 A a B 1/3 1/3 b 1/3 1/6 b 2/9 1/9 b 1/3 0 A dsjunctve feature would also have done t alone: A a A a B B 1/3 1/3 b b 1/3 0

65 Chrstopher Mannng Suppose we have a 1 feature maxent model bult over observed data as shown. What s the constructed model s probablty dstrbuton over the four possble outcomes? Emprcal A a B 2 1 b 2 1 B b Features A a Expectatons Probabltes A a B b

66 Chrstopher Mannng Feature Interacton For loglnear/logstc regresson models n statstcs t s standard to do a greedy stepwse search over the space of all possble nteracton terms. Ths combnatoral space s exponental n sze but that s okay as most statstcs models only have 4 8 features. In NLP our models commonly use hundreds of thousands of features so that s not okay. Commonly nteracton terms are added by hand based on lngustc ntutons.

Chrstopher Mannng Example: NER Interacton

Local Context Prev Cur Next State Other?

Sg x Xx Xx Feature Weghts Prevous word at

67 Chrstopher Mannng Example: NER Interacton Prevous-state and current-sgnature have nteractons e.g. P=PERS-C=Xx ndcates C=PERS much more strongly than C=Xx and P=PERS ndependently. Ths feature type allows the model to capture ths nteracton. Local Context Prev Cur Next State Other?????? Word at Grace Road Tag IN NNP NNP Sg x Xx Xx Feature Weghts Prevous word at Current word Grace Begnnng bgram <G Current POS tag NNP Prev and cur tags IN NNP Prevous state Other Current sgnature Xx Prev state cur sg O-Xx Prev-cur-next sg x-xx-xx P. state - p-cur sg O-x-Xx

Maxent Models and Discriminative Estimation

Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6) Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much