SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 9, 2018 0
Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 1: Classification tasks in NLP 1
Classification tasks in NLP Naive Bayes Classifier Log linear models 2
Prepositional Phrases noun attach: I bought the shirt with pockets verb attach: I washed the shirt with soap As in the case of other attachment decisions in parsing: it depends on the meaning of the entire sentence needs world knowledge, etc. Maybe there is a simpler solution: we can attempt to solve it using heuristics or associations between words 3
Ambiguity Resolution: Prepositional Phrases in English Learning Prepositional Phrase Attachment: Annotated Data v n 1 p n 2 Attachment join board as director V is chairman of N.V. N using crocidolite in filters V bring attention to problem V is asbestos in products N making paper for filters N including three with cancer N..... 4
Prepositional Phrase Attachment Method Accuracy Always noun attachment 59.0 Most likely for each preposition 72.2 Average Human (4 head words only) 88.2 Average Human (whole sentence) 93.2 5
Back-off Smoothing Random variable a represents attachment. a = n 1 or a = v (two-class classification) We want to compute probability of noun attachment: p(a = n 1 v, n 1, p, n 2 ). Probability of verb attachment is 1 p(a = n 1 v, n 1, p, n 2 ). 6
Back-off Smoothing 1. If f (v, n 1, p, n 2 ) > 0 and ˆp 0.5 ˆp(a n1 v, n 1, p, n 2 ) = f (a n 1, v, n 1, p, n 2 ) f (v, n 1, p, n 2 ) 2. Else if f (v, n 1, p) + f (v, p, n 2 ) + f (n 1, p, n 2 ) > 0 and ˆp 0.5 ˆp(a n1 v, n 1, p, n 2 ) = f (a n 1, v, n 1, p) + f (a n1, v, p, n 2 ) + f (a n1, n 1, p, n 2 ) f (v, n 1, p) + f (v, p, n 2 ) + f (n 1, p, n 2 ) 3. Else if f (v, p) + f (n 1, p) + f (p, n 2 ) > 0 ˆp(a n1 v, n 1, p, n 2 ) = f (a n 1, v, p) + f (a n1, n 1, p) + f (a n1, p, n 2 ) f (v, p) + f (n 1, p) + f (p, n 2 ) 4. Else if f (p) > 0 (try choosing attachment based on preposition alone) 5. Else ˆp(a n1 v, n 1, p, n 2 ) = 1.0 ˆp(a n1 v, n 1, p, n 2 ) = f (a n 1, p) f (p) 7
Prepositional Phrase Attachment: Results Results (Collins and Brooks 1995): 84.5% accuracy with the use of some limited word classes for dates, numbers, etc. Toutanova, Manning, and Ng, 2004: use sophisticated smoothing model for PP attachment 86.18% with words & stems; with word classes: 87.54% Merlo, Crocker and Berthouzoz, 1997: test on multiple PPs, generalize disambiguation of 1 PP to 2-3 PPs 1PP: 84.3% 2PP: 69.6% 3PP: 43.6% 8
Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 2: Probabilistic Classifiers 9
Classification tasks in NLP Naive Bayes Classifier Log linear models 10
Naive Bayes Classifier x is the input that can be represented as d independent features f j, 1 j d y is the output classification P(y x) = P(y) P(x y) P(x) (Bayes Rule) P(x y) = d j=1 P(f j y) P(y x) = P(y) d j=1 P(f j y) 11
Classification tasks in NLP Naive Bayes Classifier Log linear models 12
Log linear model The model classifies input into output labels y Y Let there be m features, f k (x, y) for k = 1,..., m Define a parameter vector v R m Each (x, y) pair is mapped to score: s(x, y) = k v k f k (x, y) Using inner product notation: v f(x, y) = k v k f k (x, y) s(x, y) = v f(x, y) To get a probability from the score: Renormalize! Pr(y x; v) = exp (s(x, y)) y Y exp (s(x, y )) 13
Log linear model The name log-linear model comes from: log Pr(y x; v) = v f(x, y) log exp ( v f(x, y ) ) }{{} linear term y } {{} normalization term Once the weights v are learned, we can perform predictions using these features. The goal: to find v that maximizes the log likelihood L(v) of the labeled training set containing (x i, y i ) for i = 1... n L(v) = i = i log Pr(y i x i ; v) v f(x i, y i ) i log y exp ( v f(x i, y ) ) 14
Log linear model Maximize: L(v) = i v f(x i, y i ) i log y exp ( v f(x i, y ) ) Calculate gradient: dl(v) dv v = f(x i, y i ) 1 i i y exp (v f(x i, y )) f(x i, y ( ) exp v f(xi, y ) ) y = i f(x i, y i ) i = f(x i, y i ) i i }{{} Observed counts f(x i, y ) y exp (v f(x i, y )) y exp (v f(x i, y )) y f(x i, y ) Pr(y x i ; v) }{{} Expected counts 15
Gradient ascent Init: v (0) = 0 t 0 Iterate until convergence: Calculate: = dl(v) dv v=v (t) Find β = arg max β L(v (t) + β ) Set v (t+1) v (t) + β 16
Learning the weights: v: Generalized Iterative Scaling f # = max x,y j f j(x, y) (the maximum possible feature value; needed for scaling) Initialize v (0) For each iteration t expected[j] 0 for j = 1.. # of features For i = 1 to training data For each feature f j expected[j] += f j (x i, y i ) P(y i x i ; v (t) ) For each feature f j (x, y) observed[j] = f j (x, y) For each feature f j (x, y) v (t+1) j v (t) j cf. Goodman, NIPS 01 f # observed[j] expected[j] c(x,y) training data 17
Acknowledgements Many slides borrowed or inspired from lecture notes by Michael Collins, Chris Dyer, Kevin Knight, Philipp Koehn, Adam Lopez, Graham Neubig and Luke Zettlemoyer from their NLP course materials. All mistakes are my own. A big thank you to all the students who read through these notes and helped me improve them. 18