Natural Language Processing

Similar documents
Natural Language Processing

Natural Language Processing

Log-Linear Models, MEMMs, and CRFs

Machine Learning for NLP

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Maxent Models and Discriminative Estimation

Logistic Regression. William Cohen

Logistic Regression & Neural Networks

Lecture 2: Logistic Regression and Neural Networks

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Logistic Regression. Machine Learning Fall 2018

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Intelligent Systems (AI-2)

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Natural Language Processing

Naïve Bayes classification

Lecture 13: Structured Prediction

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

10-701/ Machine Learning - Midterm Exam, Fall 2010

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

Last Time. Today. Bayesian Learning. The Distributions We Love. CSE 446 Gaussian Naïve Bayes & Logistic Regression

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Intelligent Systems (AI-2)

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Conditional Random Field

The Noisy Channel Model and Markov Models

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Generative Learning. INFO-4604, Applied Machine Learning University of Colorado Boulder. November 29, 2018 Prof. Michael Paul

Expectation Maximization Algorithm

ECE 5984: Introduction to Machine Learning

Machine Learning for NLP

Final Exam. December 11 th, This exam booklet contains five problems, out of which you are expected to answer four problems of your choice.

Midterm sample questions

Expectation Maximization, and Learning from Partly Unobserved Data (part 2)

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Spatial Role Labeling CS365 Course Project

Classification & Information Theory Lecture #8

Logistic Regression. COMP 527 Danushka Bollegala

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Machine Learning Gaussian Naïve Bayes Big Picture

On Using Selectional Restriction in Language Models for Speech Recognition

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Statistical Methods for NLP

Naïve Bayes Introduction to Machine Learning. Matt Gormley Lecture 18 Oct. 31, 2018

Midterm, Fall 2003

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

Bias-Variance Tradeoff

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

A brief introduction to Conditional Random Fields

Maschinelle Sprachverarbeitung

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Maschinelle Sprachverarbeitung

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

Hidden Markov Models

Machine Learning

MIA - Master on Artificial Intelligence

Introduction to Machine Learning

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Lior Wolf

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

CSE446: Clustering and EM Spring 2017

EE 511 Online Learning Perceptron

Machine Learning for natural language processing

Maximum Entropy Klassifikator; Klassifikation mit Scikit-Learn

Sequential Supervised Learning

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

CS395T: Structured Models for NLP Lecture 2: Machine Learning Review

Introduction to Bayesian Learning. Machine Learning Fall 2018

Bayes Classifiers. CAP5610 Machine Learning Instructor: Guo-Jun QI

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

Hidden Markov Models

10/17/04. Today s Main Points

Probabilistic Graphical Models

Bayesian Methods: Naïve Bayes

CS540 ANSWER SHEET

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Is the test error unbiased for these programs?

Lab 12: Structured Prediction

Introduction to Machine Learning. Maximum Likelihood and Bayesian Inference. Lecturers: Eran Halperin, Yishay Mansour, Lior Wolf

Introduc)on to Bayesian methods (con)nued) - Lecture 16

Classification: Naïve Bayes. Nathan Schneider (slides adapted from Chris Dyer, Noah Smith, et al.) ENLP 19 September 2016

Structured Prediction

Classification Based on Probability

Support Vector Machines

Machine Learning Tom M. Mitchell Machine Learning Department Carnegie Mellon University. September 20, 2012

MLE/MAP + Naïve Bayes

The Naïve Bayes Classifier. Machine Learning Fall 2017

NLP Programming Tutorial 6 - Advanced Discriminative Learning

ECE521 Lecture7. Logistic Regression

Linear classifiers: Logistic regression

Stochastic Gradient Descent

Parameter estimation Conditional risk

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

CS460/626 : Natural Language

Sequential Data Modeling - The Structured Perceptron

COMP 328: Machine Learning

Machine Learning for Structured Prediction

Transcription:

SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 9, 2018 0

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 1: Classification tasks in NLP 1

Classification tasks in NLP Naive Bayes Classifier Log linear models 2

Prepositional Phrases noun attach: I bought the shirt with pockets verb attach: I washed the shirt with soap As in the case of other attachment decisions in parsing: it depends on the meaning of the entire sentence needs world knowledge, etc. Maybe there is a simpler solution: we can attempt to solve it using heuristics or associations between words 3

Ambiguity Resolution: Prepositional Phrases in English Learning Prepositional Phrase Attachment: Annotated Data v n 1 p n 2 Attachment join board as director V is chairman of N.V. N using crocidolite in filters V bring attention to problem V is asbestos in products N making paper for filters N including three with cancer N..... 4

Prepositional Phrase Attachment Method Accuracy Always noun attachment 59.0 Most likely for each preposition 72.2 Average Human (4 head words only) 88.2 Average Human (whole sentence) 93.2 5

Back-off Smoothing Random variable a represents attachment. a = n 1 or a = v (two-class classification) We want to compute probability of noun attachment: p(a = n 1 v, n 1, p, n 2 ). Probability of verb attachment is 1 p(a = n 1 v, n 1, p, n 2 ). 6

Back-off Smoothing 1. If f (v, n 1, p, n 2 ) > 0 and ˆp 0.5 ˆp(a n1 v, n 1, p, n 2 ) = f (a n 1, v, n 1, p, n 2 ) f (v, n 1, p, n 2 ) 2. Else if f (v, n 1, p) + f (v, p, n 2 ) + f (n 1, p, n 2 ) > 0 and ˆp 0.5 ˆp(a n1 v, n 1, p, n 2 ) = f (a n 1, v, n 1, p) + f (a n1, v, p, n 2 ) + f (a n1, n 1, p, n 2 ) f (v, n 1, p) + f (v, p, n 2 ) + f (n 1, p, n 2 ) 3. Else if f (v, p) + f (n 1, p) + f (p, n 2 ) > 0 ˆp(a n1 v, n 1, p, n 2 ) = f (a n 1, v, p) + f (a n1, n 1, p) + f (a n1, p, n 2 ) f (v, p) + f (n 1, p) + f (p, n 2 ) 4. Else if f (p) > 0 (try choosing attachment based on preposition alone) 5. Else ˆp(a n1 v, n 1, p, n 2 ) = 1.0 ˆp(a n1 v, n 1, p, n 2 ) = f (a n 1, p) f (p) 7

Prepositional Phrase Attachment: Results Results (Collins and Brooks 1995): 84.5% accuracy with the use of some limited word classes for dates, numbers, etc. Toutanova, Manning, and Ng, 2004: use sophisticated smoothing model for PP attachment 86.18% with words & stems; with word classes: 87.54% Merlo, Crocker and Berthouzoz, 1997: test on multiple PPs, generalize disambiguation of 1 PP to 2-3 PPs 1PP: 84.3% 2PP: 69.6% 3PP: 43.6% 8

Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University Part 2: Probabilistic Classifiers 9

Classification tasks in NLP Naive Bayes Classifier Log linear models 10

Naive Bayes Classifier x is the input that can be represented as d independent features f j, 1 j d y is the output classification P(y x) = P(y) P(x y) P(x) (Bayes Rule) P(x y) = d j=1 P(f j y) P(y x) = P(y) d j=1 P(f j y) 11

Classification tasks in NLP Naive Bayes Classifier Log linear models 12

Log linear model The model classifies input into output labels y Y Let there be m features, f k (x, y) for k = 1,..., m Define a parameter vector v R m Each (x, y) pair is mapped to score: s(x, y) = k v k f k (x, y) Using inner product notation: v f(x, y) = k v k f k (x, y) s(x, y) = v f(x, y) To get a probability from the score: Renormalize! Pr(y x; v) = exp (s(x, y)) y Y exp (s(x, y )) 13

Log linear model The name log-linear model comes from: log Pr(y x; v) = v f(x, y) log exp ( v f(x, y ) ) }{{} linear term y } {{} normalization term Once the weights v are learned, we can perform predictions using these features. The goal: to find v that maximizes the log likelihood L(v) of the labeled training set containing (x i, y i ) for i = 1... n L(v) = i = i log Pr(y i x i ; v) v f(x i, y i ) i log y exp ( v f(x i, y ) ) 14

Log linear model Maximize: L(v) = i v f(x i, y i ) i log y exp ( v f(x i, y ) ) Calculate gradient: dl(v) dv v = f(x i, y i ) 1 i i y exp (v f(x i, y )) f(x i, y ( ) exp v f(xi, y ) ) y = i f(x i, y i ) i = f(x i, y i ) i i }{{} Observed counts f(x i, y ) y exp (v f(x i, y )) y exp (v f(x i, y )) y f(x i, y ) Pr(y x i ; v) }{{} Expected counts 15

Gradient ascent Init: v (0) = 0 t 0 Iterate until convergence: Calculate: = dl(v) dv v=v (t) Find β = arg max β L(v (t) + β ) Set v (t+1) v (t) + β 16

Learning the weights: v: Generalized Iterative Scaling f # = max x,y j f j(x, y) (the maximum possible feature value; needed for scaling) Initialize v (0) For each iteration t expected[j] 0 for j = 1.. # of features For i = 1 to training data For each feature f j expected[j] += f j (x i, y i ) P(y i x i ; v (t) ) For each feature f j (x, y) observed[j] = f j (x, y) For each feature f j (x, y) v (t+1) j v (t) j cf. Goodman, NIPS 01 f # observed[j] expected[j] c(x,y) training data 17

Acknowledgements Many slides borrowed or inspired from lecture notes by Michael Collins, Chris Dyer, Kevin Knight, Philipp Koehn, Adam Lopez, Graham Neubig and Luke Zettlemoyer from their NLP course materials. All mistakes are my own. A big thank you to all the students who read through these notes and helped me improve them. 18