6.891: Lecture 16 (November 2nd, 2003) Global Linear Models: Part III

Size: px

Start display at page:

Download "6.891: Lecture 16 (November 2nd, 2003) Global Linear Models: Part III"

Chastity Goodman
6 years ago
Views:

1 6.891: Lecture 16 (November 2nd, 2003) Global Linear Models: Part III

2 Overview Recap: global linear models, and boosting Log-linear models for parameter estimation An application: LFG parsing Global and local features The perceptron revisited Log-linear models revisited

3 Three Components of Global Linear Models Φ is a function that maps a structure (x; y) to a feature vector y) 2 R d Φ(x; GEN is a function that maps an input x toasetofcandidates GEN(x) W is a parameter vector (also a member of R d ) Training data is used to set the value of W

4 Putting it all Together X is set of sentences, Y is set of possible outputs (e.g. trees) Need to learn a function F : X! Y GEN, Φ, W define (x) = arg max F y) y2gen(x)φ(x; W Choose the highest scoring candidate as the most plausible structure Given examples (x i ; y i ),howtosetw?

5 She S announced a VP program VP to promote safety in PP trucks and vans She announced a S program VP VP to promote safety in PP trucks and vans She S announced a VP program VP to promote safety in PP trucks and Φ + Φ + Φ + Φ + Φ + Φ + 1; 3; 5i h2; 0; 0; 5i h1; 0; 1; 5i h0; 0; 3; 0i h0; 1; 0; 5i h0; 0; 1; 5i h1; vans She S announced a VP program VP to promote arg max S + safety in PP trucks and vans She announced a S program VP VP to promote safety in PP trucks and vans She announced S a program VP VP to promote safety in PP trucks and vans She announced a program to promote safety in trucks and vans + GEN Φ W + Φ W + Φ W + Φ W + Φ W + Φ W VP She announced VP a program to VP promote safety PP in trucks and vans

6 (x i ; y i ; z i;j ) for i = 1 : : : n, j = 1 : : : n i The Training Data On each example there are several bad parses : 2 GEN(x i ), such that z 6= y i z There are n i bad parses on the i th training example (i.e., n i = jgen(x i )j 1) Some definitions: z i;j is the j th bad parse for the i th sentence We can think of the training data (x i ; y i ),andgen, providing a set of good/bad parse pairs

7 (x i ; y i ; z i;j ) for i = 1 : : : n, j = 1 : : : n i e m i;j(w) Margins and Boosting We can think of the training data (x i ; y i ),andgen, providing a set of good/bad parse pairs The Margin on example z i;j under parameters W is m i;j(w) = Φ(x i ; y i ) W Φ(x i ; z i;j ) W Exponential loss ExpLoss(W) = X i;j

8 Boosting: A New Parameter Estimation Method Exponential loss: ExpLoss(W) = P i;j e m i;j(w) Feature selection methods: Try to make good progress in minimizing ExpLoss, but keep most parameters W k = 0 This is a feature selection method: only a small number of features are selected In a couple of lectures we ll talk much more about overfitting, and generalization

9 Overview Recap: global linear models, and boosting Log-linear models for parameter estimation An application: LFG parsing Global and local features The perceptron revisited Log-linear models revisited

10 W ML = argmax W L(W) Back to Maximum Likelihood Estimation [Johnson et. al 1999] We can use the parameters to define a probability for each parse: Φ(x;y) W e 0 2GEN(x) e Φ(x;y0 ) W Py P (y j x; W) = Log-likelihood is then L(W) = log P (y i j x i ; W) X i A first estimation method: take maximum likelihood estimates, i.e.,

11 ψ = argmax W L(W) C X W MAP k W 2 k Adding Gaussian Priors [Johnson et. al 1999] A first estimation method: take maximum likelihood estimates, W i.e., = argmax W L(W) ML Unfortunately, very likely to overfit : could use feature selection methods, as in boosting Another way of preventing overfitting: choose parameters as! for some constant C Intuition: adds a penalty for large parameter values

12 W MAP = argmax W = argmax W k W2 k + C 2 The Bayesian Justification for Gaussian Priors In Bayesian methods, combine the log-likelihood P (data j W) with a prior over P (W) parameters, P (data j W)P (W) P (W j data) = R W P (data j W)P (W)dW The MAP (Maximum A-Posteriori) estimates are P (W j data) 0 B z } Log-Likelihood log P (data j W) C } A z Prior + log P (W) 1 C Gaussian P (W) / e prior: P k W2 k C P P (W) = C log )

13 = X i e m i;j(w) The Relationship to Margins X L(W) = log P (y i j x i ; W) i 0 X + m 1 A i;j(w) e log where m i;j(w) = Φ(x i ; y i ) W Φ(x i ; z i;j ) W j Compare this to exponential loss: X ExpLoss(W) = i;j

14 L(W) = = = = = X i X i X i X i X i = X i log P (yi j xi; W) log log log log Φ(x i;yi) W e 0 2GEN(xi) e Φ(x i;y 0 ) W Py ψ ψ ψ log 1 0 2GEN(xi) e Φ(x i;y 0 ) W Φ(xi;yi) W Py P y 0 2GEN(xi);y 0 6=yi eφ(x i;y 0 ) W Φ(xi;yi) W P j e m i;j(w) 0 X + j! m 1 A i;j(w) e!!

15 ψ = argmax W L(W) C X W MAP k = X i W 2 k Summary Choose parameters as:! where X log P (y i j x i ; W) L(W) = i Φ(x i;yi) W e 0 2GEN(xi) e Φ(x i;y 0 ) W Py log = X i 0 X + m 1 A i;j(w) e log Can use (conjugate) gradient ascent j

16 L(W) = X i e m i;j (W) Summary: A Comparison to Boosting Both methods combine a loss function (measure of how well the parameters match the training data), with some method of preventing over-fitting Loss functions: ExpLoss(W) = X i;j 0 X + m i;j (W) 1 A e log j Protection against overfitting: Feature selection for boosting Penalty for large parameter values in log-linear models

17 (At least) two other algorithms are possible: minimizing L(W) with a feature selection method, or minimizing a combination of ExpLoss and a penalty for large parameter values

18 An Application: LFG Parsing [Johnson et. parsing al 1999] introduced these methods for LFG LFG (Lexical functional grammar) is a detailed syntactic formalism Many of the structures in LFG are directed graphs which are not trees Makes coming up with a generative model difficult (see also [Abney, 1997])

19 An Application: LFG Parsing [Johnson et. al 1999]: used an existing, hand-crafted LFG parser and grammar from Xerox Domains were: 1) Xerox printer documentation; 2) Verbmobil corpus Parser used to generate all possible parses for each sentence, annotators marked which one was correct in each case On Verbmobil: baseline (random) score is 9.7% parses correct, log-linear model gets 58.7% correct On printer documentation: baselin is 15.2% correct, log-linear model scores 58.8%

20 Overview Recap: global linear models, and boosting Log-linear models for parameter estimation An application: LFG parsing Global and local features The perceptron revisited Log-linear models revisited

21 Global and Local Features So far: algorithms have depended on size of GEN Strategies for keeping the size of GEN manageable: Reranking methods: use a baseline model to generate its N top analyses LFG parsing: hope that the grammar produces a relatively small number of possible analyses

22 Global and Local Features Global linear models are global in a couple of ways: Feature vectors are defined over entire structures Parameter estimation methods explicitly related to errors on entire structures Next topic: global training methods with local features Our global features will be defined through local features Parameter estimates will be global GEN will be large! Dynamic programming used for search and parameter estimation: this is possible for some combinations of GEN and Φ

23 Tagging Problems TAGGING: Strings to Tagged Sequences abeeafhj) a/c b/d e/c e/c a/d f/c h/d j/c Example 1: Part-of-speech tagging Profits/N soared/v at/p Boeing/N Co./N,/, easily/adv topping/v forecasts/n on/p Wall/N Street/N,/, as/p their/poss CEO/N Alan/N Mulally/N announced/v first/adj quarter/n results/n./. Example 2: Named Entity Recognition Profits/NA soared/na at/na Boeing/SC Co./CC,/NA easily/na topping/na forecasts/na on/na Wall/SL Street/CL,/NA as/na their/na CEO/NA Alan/SP Mulally/CP announced/na first/na quarter/na results/na./na

24 Tagging Going back to tagging: Inputs x are sentences w [1:n] = fw 1 : : : w n g GEN(w [1:n] ) = T n i.e. all tag sequences of length n Note: GEN has an exponential number of members How do we define Φ?

25 i = 6 Representation: Histories A history is a 4-tuple ht 1 ; t 2 ; w [1:n] ; ii t 1 ; t 2 are the previous two tags. w [1:n] are the n words in the input sentence. i is the index of the word being tagged Hispaniola/N quickly/rb became/vb an/dt important/jj base/?? from which Spain expanded its empire into the rest of the Western Hemisphere. t 1 ; t 2 = DT, JJ w [1:n] = hhispaniola; quickly; became; : : : ; Hemisphere; :i

26 Local Feature-Vector Representations Take a history/tag pair (h; t). ffi s (h; t) for s = 1 : : : d are local features representing tagging t decision in context h. Example: POS Tagging Word/tag features ( if current word w i is base and t = VB 1 otherwise 0 ffi 100 (h; t) = ( if current word w i ends in ing and t = VBG 1 otherwise 0 ffi 101 (h; t) = Contextual Features if ht 2 ; t 1 ; ti = hdt, JJ, VBi 1 otherwise 0 ( ffi 103 (h; t) =

27 A tagged sentence with n words has n history/tag pairs Hispaniola/N quickly/rb became/vb an/dt important/jj base/nn History Tag * N 2 RB N RB 3 VB RB VB 4 DT VP DT 5 JJ DT JJ 6 NN

28 nx A tagged sentence with n words has n history/tag pairs Hispaniola/N quickly/rb became/vb an/dt important/jj base/nn History Tag * N 2 RB N RB 3 VB RB VB 4 DT VP DT 5 JJ DT JJ 6 NN Define global features through local features: where t i is the i th tag, h i is the i th history i=1 Φ(t [1:n] ; w [1:n] ) = ffi(h i ; t i )

29 Global and Local Features Typically, local features are indicator functions, e.g., ( if current word w i ends in ing and t = VBG 1 otherwise 0 ffi 101 (h; t) = and global features are then counts, 101 (w [1:n] ; t [1:n] ) = Number of times a word ending in ing is Φ tagged as VBG (w in ; t [1:n] ) [1:n]

30 GEN(w [1:n] ) is the set of all tagged sequences of length n nx nx Putting it all Together GEN, Φ, W define (w [1:n] ) = arg max F [1:n]2GEN(w [1:n] ) W Φ(w [1:n] ; t [1:n] ) t arg max = [1:n]2GEN(w [1:n] ) W t ffi(h i ; t i ) i=1 arg max = [1:n]2GEN(w [1:n] ) t W ffi(h i ; t i ) i=1 Score for a tagged sequence is a sum of local scores Dynamic programming can be used to find the argmax! (because history only considers the previous two tags) Some notes:

31 A Variant of the Perceptron Algorithm Inputs: Training set (x i ; y i ) for i = 1 : : : n Initialization: W = 0 Define: F (x) = argmax y2gen(x) Φ(x; y) W Algorithm: For t = 1 : : : T, i = 1 : : : n z i = F (x i ) If (z i 6= y i ) W = W + Φ(x i ; y i ) Φ(x i ; z i ) Output: Parameters W

32 Training a Tagger Using the Perceptron Algorithm Inputs: Training (w set [1:ni] ; ti [1:ni]) for i = 1 : : : n. i Initialization: W = 0 Algorithm: For t = 1 : : : T;i = 1 : : : n z [1:n i] = arg max u ] 2T n i W Φ(wi [1:ni] ; u [1:ni]) [1:ni z [1:n i] can be computed with the dynamic programming (Viterbi) algorithm If z [1:n i] 6= t i [1:ni] then i [1:ni] ; ti [1:ni] ) Φ(wi [1:ni] ; z [1:ni]) Φ(w Output: Parameter vector W. W = W +

33 An Example Say the correct tags for i th sentence are the/dt man/nn bit/vbd the/dt dog/nn Under current parameters, output is the/dt man/nn bit/nn the/dt dog/nn Assume also that features track: (1) all bigrams; (2) word/tag pairs Parameters incremented: VBDi; hnn, DTi; hvbd,! biti hvbd Parameters decremented: NNi; hnn, DTi; hnn,! biti hnn

34 Experiments Wall Street Journal part-of-speech tagging data Perceptron = 2.89%, Max-ent = 3.28% (11.9% relative error reduction) [Ramshaw and Marcus, 1995] chunking data Perceptron = 93.63%, Max-ent = 93.29% (5.1% relative error reduction)

35 How Does this Differ from Log-Linear Taggers? Log-linear taggers (in an earlier lecture) used very similar local representations How does the perceptron model differ? Why might these differences be important?

36 where Z(h; W) = P t 0 2T e W ffi(h;t0 ) Log-Linear Tagging Models ffi s (h; t) W s s = 1 : : : d for are features s = 1 : : : d for are parameters Take a history/tag pair (h; t). Conditional distribution: P (tjh) = ew ffi(h;t) Z(h; W) Parameters estimated using maximum-likelihood e.g., iterative scaling, gradient descent

37 Log-Linear Tagging Models Word sequence w [1:n] = [w 1 ; w 2 : : : w n ] Tag sequence t [1:n] = [t 1 ; t 2 : : : t n ] nx nx nx nx Histories h i = ht i 1 ; t i 2 ; w [1:n] ; ii log P (t [1:n] j w [1:n] ) = W ffi(h i ; t i ) log Z(h i ; W) log P (t i j h i ) = z } Linear Score z } Local Normalization Terms i=1 i=1 i=1 Compare this to the perceptron, where GEN, Φ, W define (w [1:n] ) = arg max F [1:n]2GEN(w [1:n] ) t z } Linear score i=1 W ffi(h i ; t i )

38 Problems with Locally Normalized models Label bias problem [Lafferty, McCallum and Pereira 2001] See also [Klein and Manning 2002] Example of a conditional distribution that locally normalized models can t capture (under bigram tag representation): A B C abc) abe) j j j a b c A D E j j j a b e Impossible to find parameters that satisfy with P (ABCj abc) = 1 with P (A DEj abe) = 1 P (A j a) P (B j b; A) P (C j c; B) = 1 P (A j a) P (D j b; A) P (E j e; D) = 1

39 Overview Recap: global linear models, and boosting Log-linear models for parameter estimation An application: LFG parsing Global and local features The perceptron revisited Log-linear models revisited

40 Global Log-Linear Models We can use the parameters to define a probability for each tagged sequence: where (t [1:n] j w [1:n] ; W) = e P i W ffi(h i;ti) P Z(w [1:n] ; W) Z(w [1:n] ; W) = X P [1:n] ) e i W ffi(h i;ti) [1:n]2GEN(w t is a global normalization term This is a global log-linear model with Φ(w [1:n] ; t [1:n] ) = ffi(h i ; t i ) X i

41 nx Now we have: log P (t [1:n] j w [1:n] ) z } Linear Score i=1 W ffi(h i ; t i ) log Z(w [1:n] ; W) z } Global Normalization Term = When finding highest probability tag sequence, the global term is irrelevant: argmax t[1:n]2gen(w [1:n] ) n X i=1 W ffi(hi ; t i ) log Z(w [1:n] ; W) = argmax t[1:n]2gen(w [1:n] ) n X i=1 W ffi(h i ; t i )

42 t 0 [1:n] 2GEN(w [1:n] ) P (t 0 [1:n] j w [1:n] ; W)ffi(h 0 i ; t0 i ) Parameter Estimation For parameter estimation, we must calculate the gradient of log P (t [1:n] j w [1:n] ) = nx i=1 W ffi(h i ; t i ) log X P 0 2GEN(w [1:n] ) e i W ffi(h0 i ;t0 i ) [1:n] t with respect to W Taking derivatives gives dl = dw nx X ffi(h i ; t i ) i=1 Can be calculated using dynamic programming! (very similar to forward-backward algorithm for EM training)

43 GEN(w [1:n] ) = the set of all possible tag sequences for w [1:n] Summary of Perceptron vs. Global Log-Linear Model Both are global linear models, where Φ(w [1:n] ; t [1:n] ) = ffi(h i ; t i ) X i In both cases, (w [1:n] ) = argmax t[1:n]2gen(w F ) W Φ(w [1:n] ; t [1:n] ) [1:n] argmax t[1:n]2gen(w = ) X [1:n] W ffi(h i ; t i ) can be computed using dynamic programming i

44 Dynamic programming is also used in training: Perceptron requires highest-scoring tag sequence for each training example Global log-linear model requires gradient, and therefore expected counts

45 From [Sha and Pereira, 2003] Results Task = shallow parsing (base noun-phrase recognition) Model Accuracy SVM combination 94.39% Conditional random field 94.38% (global log-linear model) Generalized winnow 93.89% Perceptron 94.09% Local log-linear model 93.70%

6.864: Lecture 21 (November 29th, 2005) Global Linear Models: Part II

6.864: Lecture 21 (November 29th, 2005) Global Linear Models: Part II Overview Log-linear models for parameter estimation Global and local features The perceptron revisited Log-linear models revisited