Natural Language Processing

Size: px
Start display at page:

Download "Natural Language Processing"

Transcription

1 Natural Language Processing Global linear models Based on slides from Michael Collins

2 Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability of an entire sequence p (y x) = P f(x, y) > y 0 2Y f(x, y0 ) > Y : all tag sequences for x

3 Global linear models Motivations: Optimize directly what you care about Flexibility in feature function Avoid the label bias problem Question: How to compute the denominator?

4 Label bias problem Consider a history-based model for names with two tokens, that are either PERSON or LOC. States b-person e-person b-loc e-loc other

5 Label bias problem corpus: Harvey Ford (person 9 times, location 1 time) Harvey Park (location 9 times, person 1 time) Myrtle Ford (person 9 times, location 1 time) Myrtle Park (location 9 times, person 1 time) other b-person e-person b-loc e-loc Second word provides good information

6 Label bias problem Conditional probabilities: p(b-person other, w = Harvey) = 0.5 p(b-locn other, w = Harvey) = 0.5 p(b-person other, w = Myrtle) = 0.5 p(b-locn other, w = Myrtle) = 0.5 p(e-person b-person, w = Ford) = 1 p(e-person b-person, w = Park) = 1 p(e-locn b-locn, w = Ford) = 1 p(e-locn b-locn, w = Park) = 1 other b-person b-loc e-person e-loc Information from second word is lost

7 Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State State State State State 5 What the local transition probabilities say: State 1 prefers to go to state 2 State 2 prefer to stay in state 2

8 Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State State State State State 5 Probability of path 1-> 1-> 1-> 1: 0.4 x 0.45 x 0.5 = 0.09

9 Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State State State State State 5 Probability of path 2->2->2->2 : X 0.3 X 0.3 = Other paths: 1-> 1-> 1-> 1: 0.09

10 Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State State State State State 5 Probability of path 1->2->1->2: 0.6 X X 0.5 = 0.06 Other paths: 1->1->1->1: >2->2->2: 0.018

11 Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State State State State State 5 Probability of path 1->1->2->2: 0.4 X 0.55 X 0.3 = Other paths: 1->1->1->1: >2->2->2: >2->1->2: 0.06

12 Label bias problem Observation 1 Observation 2 Observation 3 Observation 4 State State State State State 5 Most Likely Path: 1-> 1-> 1-> 1 Although locally it seems state 1 wants to go to state 2 and state 2 wants to remain in state 2.

13 Label bias problem Prefer states that have low entropy next-state distributions

14 Global linear models Main difference: We don t look at a history of decisions Instead, define features over full structure More flexible! We will see this in parsing But we directly consider the exponential space

15 Global linear models Contain three components A feature function f mapping a pair (x,y) to a feature vector f(x,y) A generating function GEN enumerating all candidate outputs A parameter vector θ F (x) = arg max y f(x, y) >

16 Tagging: GEN Inputs: sentences x = w 1 w n Tag set: T GEN(X) = T n In other setups: All parse trees Top-K parse trees produced by weaker model All translations

17 Tagging: feature function A history h is a 4-tuple <t i-2, t i-1, w, i> t i-2, t i-1 : previous two tags w: input sentence i: index of word being tagged Hispaniola/NNP quickly/rb became/vb an/dt important/jj base/?? from which Spain expanded its empire into the rest of the Western Hemisphere. ti-2, ti-1: DT JJ w: Hispaniola,, Hemisphere i: 6

18 Local features For every history/tag pair (h,t) g s (h, t) for s =1...,d: local features for tagging decision t given history h ( 1 w g 100 (h, t) i = base and t = VB 0 otherwise ( 1 w g 101 (h, t) i ends with ing and t = VBG 0 otherwise ( 1 (t g 102 (h, t) i 2,t i 1,t) = (DT, JJ, VB) 0 otherwise

19 Multiple local features Hispaniola/NNP quickly/rb became/vb an/dt important/jj base/nn History t-2 t-1 w i t * * * NNP Hispaniola Hispaniola NNP RB Hispaniola RB VB Hispaniola VB DT Hispaniola DT JJ Hispaniola Tag 1 NNP 2 RB 3 VB 4 DT 5 JJ 6 NN

20 Global feature function Global feature function f(x, y) = X i,j g j (h, i, t) = X i g(h, i, t) Features are now counts, rather than binary How many times a word ending in ing was tagged as VBG

21 Global log-linear model p (y x) = = P P exp(f(x, y) > ) y 0 2GEN(x) exp(f(x, y0 ) > ) exp( P i g(x, i, y i 2,y i 1,y i ) > ) y 0 2GEN(x) exp(p i g(x, i, y0 i 2,y0 i 1,y0 i )> )

22 Decoding As long as the feature function decomposes locally we can still apply Viterbi Otherwise we can do greedy/beam decoding

23 Decoding Find the structure that maximizes the dot linear score arg max y p (y x) = arg max y = arg max y = arg max y = arg max y exp(f(x, y) > ) P y 0 2GEN(x) exp(f(x, y0 ) > ) exp(f(x, y) > ) f(x, y) > X g(x, i, y i 2,y i 1,y i ) > i

24 Viterbi still works! Dependencies did not change Definition: Y i is the set of possible tags in position i Base: (1,,y)=g(x, 1,,,y) > for all i 2 {2...n}, for all u 2 Y i 1,v 2 Y i : (i, u, v) = max t2y i 2 (j 1,t,u)+g(x, i, t, u, v) >

25 Learning L( ) = X i log p (y i x i ) rl( ) i = f(x i,y i ) X y 0 p (y 0 x)f(x, y 0 ) How to compute the second term?

26 Learning X y Let s look at bigram features only p (y x)f(x, y) = X y = X i = X i = X i X q terms are the probability that the tag sequence has a and b in positions i-1, i i X a,b X a,b p (y x)g(x, i, y i 1,y i ) X p (y x)g(x, i, y i 1,y i ) y:y i 1 =a,y i =b g(x, i, a, b) X p (y x) y:y i 1 =a,y i =b X g(x, i, a, b)q i (a, b) a,b

27 Learning If we compute the q terms efficiently then we can compute the gradients and learn The q terms are computed with a dynamic programming algorithm called forward-backward. Also used in unsupervised parameter estimation for HMMs Similar to Viterbi

28 Forward-backward Definitions (y 0,y,j)=exp(g(x, j, y 0,y) > ) my (y 1,...,y m )= (y j 1,y j,j) = p(y 1,...y m x) = j=1 my j=1 =exp( exp(g j (x, j, y j 1,y j ) > ) mx j=1 g j (x, j, y j 1,y j ) > ) (y 1,...,y m ) P z 1,...,z m (z 1,...,z m )

29 Forward-backward What will be computed? Z = µ(j, a) = µ(j, a, b) = q j (a, b) = X (y 1,...,y m ) y 1,...,y X m (y 1,...,y m ) y 1,...,y m :y j =a X y 1,...,y m :y j =a,y j+1 =b µ(j, a, b) Z (y 1,...,y m )

30 Forward-backward What will be computed? Sum of the scores of all paths that end at tag y in position j (j, y) = X y 1,...,y j :y j =y jy k=1 (y k 1,y k,k) Sum of the scores of all paths start at tag y in position j (j, y) = X y j,...,y m :y j =y my k=j (y k,y k+1,k)

31 α terms Compute with dynamic programming (1,y)= (,y,1) for j 2 {2...m},y 2 Y X (j, y) = (j 1,y 0 ) (y 0,y,j) y 0 2Y y1 α(1, y1) y2 α(1, y2) y3 α(1, y3) (1,y 1 )= (,y 1, 1) (1,y 2 )= (,y 2, 1) (1,y 3 )= (,y 3, 1)

32 α terms y1 α(1, y1) α(2, y1) y2 α(1, y2) α(2, y2) y3 α(1, y3) α(3, y3) (2,y 1 )= (1,y 1 ) (y 1,y 1, 2) + (1,y 2 ) (y 2,y 1, 2) + (1,y 3 ) (y 3,y 1, 2) = (,y 1, 1) (y 1,y 1, 2) + (,y 2, 1) (y 2,y 1, 2) + (,y 3, 1) (y 3,y 1, 2) (2,y 2 )= (,y 1, 1) (y 1,y 2, 2) + (,y 2, 1) (y 2,y 2, 2) + (,y 3, 1) (y 3,y 2, 2) (2,y 3 )= (,y 1, 1) (y 1,y 3, 2) + (,y 2, 1) (y 2,y 3, 2) + (,y 3, 1) (y 3,y 3, 2)

33 α terms The distributive law works y1 α(1, y1) α(2, y1) α(3, y1) y2 α(1, y2) α(2, y2) y3 α(1, y3) α(3, y3) (3,y 1 )= (2,y 1 ) (y 1,y 1, 3) + (2,y 2 ) (y 2,y 1, 3) + (2,y 3 ) (y 3,y 1, 3) = (,y 1, 1) (y 1,y 1, 2) (y 1,y 1, 3) + (,y 2, 1) (y 2,y 1, 2) (y 1,y 1, 3) + (,y 3, 1) (y 3,y 1, 2) (y 1,y 1, 3) + (,y 1, 1) (y 1,y 2, 2) (y 2,y 1, 3) + (,y 2, 1) (y 2,y 2, 2) (y 2,y 1, 3) + (,y 3, 1) (y 3,y 2, 2) (y 2,y 1, 3) + (,y 1, 1) (y 1,y 3, 2) (y 3,y 1, 3) + (,y 2, 1) (y 2,y 3, 2) (y 3,y 1, 3) + (,y 3, 1) (y 3,y 3, 2) (y 3,y 1, 3)

34 β terms (m, y) =1 for j 2 {m 1...1},y 2 Y (j, y) = X y 0 2Y (j +1,y 0 ) (y, y 0,j) Use distributive property again to get final result Z = X y2y (m, y) µ(j, a) = (j, a) (j, a) µ(j, a, b) = (j, a) (a, b, j + 1) q j (a, b) = µ(j, a, b) Z (j +1,a)

35 Complexity If length of sentence is m, then m Y 2 We can compute the gradient We can apply SGD We can learn Limitation: decomposable features

36 Structured perceptron Similar to CRFs, but needs only Viterbi For every training example (x,y) Find best y according to model (Viterbi) If different from gold y Update weights: add features of y and subtract features of y Regularization is needed (averaged perceptron)

37 Global vs. Local models p L (y i x, y 1...y i 1 x) = exp(s(x, y 1...y i ) Z L () Z L (x, y 1...,y i 1 )= X t exp(s(x, y 1...y i 1,t)) p L (y 1...,y n x) = exp P n i=1 s(x, y 1...y i ) Q n i=1 Z L(x, y 1...,y i 1 ) p G (y 1...,y n x) = exp P n i=1 s(x, y 1...y i ) Z G (x) Z G (x) = X nx s(x, y 1...,y i ) y 1...y n exp i=1 It can be shown that some distribution can only be expressed with pg

38 Graphical models Viterbi and forward-backward are instances of max-product and sum-product algorithms for inference in graphical models Advanced ML class HMM Y 1 Y 2 Y n X 1 X 2 X n MEMM Y 1 Y 2 Y n X 1:n CRF p(y) = 1 Z ny i=1 (x, y i 1,y i ) Y 1 Y 2 Y n x 1:n

39 Deep learning models for tagging

40 Deep learning model We have a more powerful learning model Can we make things better? simpler?

41 Ling et al., 2015 Simple POS-tagging model x i : one-hot rep. for word i e i = W emb x i : word embedding h f i = LSTM(h i 1,e i ) h b i = LSTM(h i+1,e i ) c i = (W [h f i ; hb i]+b) p(y i x) = softmax(w s c i + b) L( ) = X i log p(y i = y x) All tag decisions are independent!! Works OK for POS tagging acc. compared to with a linear CRF Doesn t work for NER tagging. Why?

42 Bi-RNNs (Bi-LSTMs) Provide a representation of a word in context Have become the de-facto standard for word representation in context for NLP tasks

43 Lample et al., 2016 Greedy taggers A history x is a 4-tuple <t1 i-1, w, i> Predict ti from history p(ti x) exp(f(x,ti) x θ) = score(x,ti) Just replace the log-linear function with a non-linear neural network Need to define a neural network that reads the history and produces a label

44 Neural structured prediction s(x, y) =f(x, y) > = X i g(x, i, y i 1,y i ) > = s(x, 1,y 1,y 2 )+s(x, 2,y 2,y 3 )+s(x, 3,y 3,y 4 ) Replace linear score with non-linear neural network s(x, y) = X i NN(x, i, y i 1,y i ) y1 y2 y3 y4

45 Neural structured prediction What has changed? Decoding: As long as the neural network scores decompose we can still apply Viterbi Learning: p (y x) = exp(p i NN (x, i, y i 1,y i )) P y 0 exp( P i NN (x, i, yi 0 1,y0 i )) L (i) ( ) = log p(y (i) x (i) )

46 Neural structured prediction How do we compute the denominator (partition function)? Implement dynamic programming as part of the neural network All dynamic programming operations are differentiable (+ and x) In structured perceptron things are slightly simpler Compute the best structure (outside of the network) Define the loss: max(0, score(x, y) - max y score(x, y )

47 Lample et al., 2016 Bi-LSTM CRF for NER x i : one-hot rep. for word i e i = W emb x i : word embedding l i = LSTM(l i 1,e i ) r i = LSTM(r i+1,e i ) c i =[l i ; r i ] p i = W (proj) (Wc i + b) nx nx s(x, y) = p i,yi + i=1 i=0 A yi,y i+1 A is a learned parameter matrix (#tags 2 ). SOTA or close to that on POS tagging, NER, and other tasks

48 State-of-the-art To outperform state-of-the-art Regularization techniques Character-based information Pre-training with a lot of data

49 Summary Tagging Parsing Generative HMMs PCFGs Greedy Transitionbased parsing Log-linear History-based Viterbi CKY Global forwardbackward Inside-outside

with Local Dependencies

with Local Dependencies CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

Feedforward Neural Networks. Michael Collins, Columbia University

Feedforward Neural Networks. Michael Collins, Columbia University Feedforward Neural Networks Michael Collins, Columbia University Recap: Log-linear Models A log-linear model takes the following form: p(y x; v) = exp (v f(x, y)) y Y exp (v f(x, y )) f(x, y) is the representation

More information

6.891: Lecture 16 (November 2nd, 2003) Global Linear Models: Part III

6.891: Lecture 16 (November 2nd, 2003) Global Linear Models: Part III 6.891: Lecture 16 (November 2nd, 2003) Global Linear Models: Part III Overview Recap: global linear models, and boosting Log-linear models for parameter estimation An application: LFG parsing Global and

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

lecture 6: modeling sequences (final part)

lecture 6: modeling sequences (final part) Natural Language Processing 1 lecture 6: modeling sequences (final part) Ivan Titov Institute for Logic, Language and Computation Outline After a recap: } Few more words about unsupervised estimation of

More information

CSE 490 U Natural Language Processing Spring 2016

CSE 490 U Natural Language Processing Spring 2016 CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Structure in the output variable(s)? What is the

More information

Conditional Random Fields

Conditional Random Fields Conditional Random Fields Micha Elsner February 14, 2013 2 Sums of logs Issue: computing α forward probabilities can undeflow Normally we d fix this using logs But α requires a sum of probabilities Not

More information

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University

Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks. Michael Collins, Columbia University Word Embeddings in Feedforward Networks; Tagging and Dependency Parsing using Feedforward Networks Michael Collins, Columbia University Overview Introduction Multi-layer feedforward networks Representing

More information

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 20: Sequence labeling (April 9, 2019) David Bamman, UC Berkeley POS tagging NNP Labeling the tag that s correct for the context. IN JJ FW SYM IN JJ

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

Soft Inference and Posterior Marginals. September 19, 2013

Soft Inference and Posterior Marginals. September 19, 2013 Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard inference Give me a single solution Viterbi algorithm Maximum spanning tree (Chu-Liu-Edmonds alg.) Soft inference

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

CSE 447/547 Natural Language Processing Winter 2018

CSE 447/547 Natural Language Processing Winter 2018 CSE 447/547 Natural Language Processing Winter 2018 Feature Rich Models (Log Linear Models) Yejin Choi University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Announcements HW #3 Due Feb

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Structured Prediction Models via the Matrix-Tree Theorem

Structured Prediction Models via the Matrix-Tree Theorem Structured Prediction Models via the Matrix-Tree Theorem Terry Koo Amir Globerson Xavier Carreras Michael Collins maestro@csail.mit.edu gamir@csail.mit.edu carreras@csail.mit.edu mcollins@csail.mit.edu

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Neural Architectures for Image, Language, and Speech Processing

Neural Architectures for Image, Language, and Speech Processing Neural Architectures for Image, Language, and Speech Processing Karl Stratos June 26, 2018 1 / 31 Overview Feedforward Networks Need for Specialized Architectures Convolutional Neural Networks (CNNs) Recurrent

More information

Sequence Labeling: HMMs & Structured Perceptron

Sequence Labeling: HMMs & Structured Perceptron Sequence Labeling: HMMs & Structured Perceptron CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu HMM: Formal Specification Q: a finite set of N states Q = {q 0, q 1, q 2, q 3, } N N Transition

More information

Expectation Maximization (EM)

Expectation Maximization (EM) Expectation Maximization (EM) The EM algorithm is used to train models involving latent variables using training data in which the latent variables are not observed (unlabeled data). This is to be contrasted

More information

Conditional Random Fields for Sequential Supervised Learning

Conditional Random Fields for Sequential Supervised Learning Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd

More information

CS395T: Structured Models for NLP Lecture 5: Sequence Models II

CS395T: Structured Models for NLP Lecture 5: Sequence Models II CS395T: Structured Models for NLP Lecture 5: Sequence Models II Greg Durrett Some slides adapted from Dan Klein, UC Berkeley Recall: HMMs Input x =(x 1,...,x n ) Output y =(y 1,...,y n ) y 1 y 2 y n P

More information

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Hidden Markov Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 33 Introduction So far, we have classified texts/observations

More information

Probabilistic Models for Sequence Labeling

Probabilistic Models for Sequence Labeling Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative

More information

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu סקר הערכת הוראה Fill it! Presentations next week 21 projects 20/6: 12 presentations 27/6: 9 presentations 10 minutes

More information

Statistical methods in NLP, lecture 7 Tagging and parsing

Statistical methods in NLP, lecture 7 Tagging and parsing Statistical methods in NLP, lecture 7 Tagging and parsing Richard Johansson February 25, 2014 overview of today's lecture HMM tagging recap assignment 3 PCFG recap dependency parsing VG assignment 1 overview

More information

Structured Prediction

Structured Prediction Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured

More information

So# Inference and Posterior Marginals. September 19, 2013

So# Inference and Posterior Marginals. September 19, 2013 So# Inference and Posterior Marginals September 19, 2013 So# vs. Hard Inference Hard inference Give me a single solucon Viterbi algorithm Maximum spanning tree (Chu- Liu- Edmonds alg.) So# inference Task

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning

Probabilistic Context Free Grammars. Many slides from Michael Collins and Chris Manning Probabilistic Context Free Grammars Many slides from Michael Collins and Chris Manning Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic

More information

6.891: Lecture 8 (October 1st, 2003) Log-Linear Models for Parsing, and the EM Algorithm Part I

6.891: Lecture 8 (October 1st, 2003) Log-Linear Models for Parsing, and the EM Algorithm Part I 6.891: Lecture 8 (October 1st, 2003) Log-Linear Models for Parsing, and EM Algorithm Part I Overview Ratnaparkhi s Maximum-Entropy Parser The EM Algorithm Part I Log-Linear Taggers: Independence Assumptions

More information

NLP Programming Tutorial 11 - The Structured Perceptron

NLP Programming Tutorial 11 - The Structured Perceptron NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Prediction Problems Given x, A book review Oh, man I love this book! This book is

More information

text classification 3: neural networks

text classification 3: neural networks text classification 3: neural networks CS 585, Fall 2018 Introduction to Natural Language Processing http://people.cs.umass.edu/~miyyer/cs585/ Mohit Iyyer College of Information and Computer Sciences University

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Hidden Markov Models

Hidden Markov Models CS 2750: Machine Learning Hidden Markov Models Prof. Adriana Kovashka University of Pittsburgh March 21, 2016 All slides are from Ray Mooney Motivating Example: Part Of Speech Tagging Annotate each word

More information

EXPERIMENTS ON PHRASAL CHUNKING IN NLP USING EXPONENTIATED GRADIENT FOR STRUCTURED PREDICTION

EXPERIMENTS ON PHRASAL CHUNKING IN NLP USING EXPONENTIATED GRADIENT FOR STRUCTURED PREDICTION EXPERIMENTS ON PHRASAL CHUNKING IN NLP USING EXPONENTIATED GRADIENT FOR STRUCTURED PREDICTION by Porus Patell Bachelor of Engineering, University of Mumbai, 2009 a project report submitted in partial fulfillment

More information

Recurrent Neural Networks. COMP-550 Oct 5, 2017

Recurrent Neural Networks. COMP-550 Oct 5, 2017 Recurrent Neural Networks COMP-550 Oct 5, 2017 Outline Introduction to neural networks and deep learning Feedforward neural networks Recurrent neural networks 2 Classification Review y = f( x) output label

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I CS395T: Structured Models for NLP Lecture 19: Advanced NNs I Greg Durrett Administrivia Kyunghyun Cho (NYU) talk Friday 11am GDC 6.302 Project 3 due today! Final project out today! Proposal due in 1 week

More information

6.864: Lecture 21 (November 29th, 2005) Global Linear Models: Part II

6.864: Lecture 21 (November 29th, 2005) Global Linear Models: Part II 6.864: Lecture 21 (November 29th, 2005) Global Linear Models: Part II Overview Log-linear models for parameter estimation Global and local features The perceptron revisited Log-linear models revisited

More information

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I. Greg Durrett

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I. Greg Durrett CS395T: Structured Models for NLP Lecture 19: Advanced NNs I Greg Durrett Administrivia Kyunghyun Cho (NYU) talk Friday 11am GDC 6.302 Project 3 due today! Final project out today! Proposal due in 1 week

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania

More information

Conditional Language modeling with attention

Conditional Language modeling with attention Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

Feature-based Discriminative Models. More Sequence Models

Feature-based Discriminative Models. More Sequence Models 1 / 58 Feature-based Discriminative Models More Sequence Models Yoav Goldberg Bar Ilan University 2 / 58 Reminder PP-attachment He saw a mouse with a telescope. saw, mouse, with, telescope V POS-tagging

More information

Log-Linear Models for Tagging (Maximum-entropy Markov Models (MEMMs)) Michael Collins, Columbia University

Log-Linear Models for Tagging (Maximum-entropy Markov Models (MEMMs)) Michael Collins, Columbia University Log-Linear Models for Tagging (Maximum-entropy Markov Models (MEMMs)) Michael Collins, Columbia University Part-of-Speech Tagging INPUT: Profits soared at Boeing Co., easily topping forecasts on Wall Street,

More information

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018

Regularization Introduction to Machine Learning. Matt Gormley Lecture 10 Feb. 19, 2018 1-61 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Regularization Matt Gormley Lecture 1 Feb. 19, 218 1 Reminders Homework 4: Logistic

More information

CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs. CNNs. Administrivia. This Lecture. Greg Durrett. Project 1 due today at 5pm

CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs. CNNs. Administrivia. This Lecture. Greg Durrett. Project 1 due today at 5pm CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs Project 1 due today at 5pm Administrivia Mini 2 out tonight, due in two weeks Greg Durrett This Lecture CNNs CNNs for SenLment Neural CRFs

More information

CRF for human beings

CRF for human beings CRF for human beings Arne Skjærholt LNS seminar CRF for human beings LNS seminar 1 / 29 Let G = (V, E) be a graph such that Y = (Y v ) v V, so that Y is indexed by the vertices of G. Then (X, Y) is a conditional

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

AN ABSTRACT OF THE DISSERTATION OF

AN ABSTRACT OF THE DISSERTATION OF AN ABSTRACT OF THE DISSERTATION OF Kai Zhao for the degree of Doctor of Philosophy in Computer Science presented on May 30, 2017. Title: Structured Learning with Latent Variables: Theory and Algorithms

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Structured Prediction Theory and Algorithms

Structured Prediction Theory and Algorithms Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE

More information

Lecture 3: ASR: HMMs, Forward, Viterbi

Lecture 3: ASR: HMMs, Forward, Viterbi Original slides by Dan Jurafsky CS 224S / LINGUIST 285 Spoken Language Processing Andrew Maas Stanford University Spring 2017 Lecture 3: ASR: HMMs, Forward, Viterbi Fun informative read on phonetics The

More information

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Low-Dimensional Discriminative Reranking Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Discriminative Reranking Useful for many NLP tasks Enables us to use arbitrary features

More information

Ecient Higher-Order CRFs for Morphological Tagging

Ecient Higher-Order CRFs for Morphological Tagging Ecient Higher-Order CRFs for Morphological Tagging Thomas Müller, Helmut Schmid and Hinrich Schütze Center for Information and Language Processing University of Munich Outline 1 Contributions 2 Motivation

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Parameter learning in CRF s

Parameter learning in CRF s Parameter learning in CRF s June 01, 2009 Structured output learning We ish to learn a discriminant (or compatability) function: F : X Y R (1) here X is the space of inputs and Y is the space of outputs.

More information

Lecture 17: Neural Networks and Deep Learning

Lecture 17: Neural Networks and Deep Learning UVA CS 6316 / CS 4501-004 Machine Learning Fall 2016 Lecture 17: Neural Networks and Deep Learning Jack Lanchantin Dr. Yanjun Qi 1 Neurons 1-Layer Neural Network Multi-layer Neural Network Loss Functions

More information

Marrying Dynamic Programming with Recurrent Neural Networks

Marrying Dynamic Programming with Recurrent Neural Networks Marrying Dynamic Programming with Recurrent Neural Networks I eat sushi with tuna from Japan Liang Huang Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark Marrying

More information

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang Chapter 4 Dynamic Bayesian Networks 2016 Fall Jin Gu, Michael Zhang Reviews: BN Representation Basic steps for BN representations Define variables Define the preliminary relations between variables Check

More information

Lecture 7: Sequence Labeling

Lecture 7: Sequence Labeling http://courses.engr.illinois.edu/cs447 Lecture 7: Sequence Labeling Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Recap: Statistical POS tagging with HMMs (J. Hockenmaier) 2 Recap: Statistical

More information

Sequential Data Modeling - The Structured Perceptron

Sequential Data Modeling - The Structured Perceptron Sequential Data Modeling - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Prediction Problems Given x, predict y 2 Prediction Problems Given x, A book review

More information

Natural Language Processing with Deep Learning CS224N/Ling284

Natural Language Processing with Deep Learning CS224N/Ling284 Natural Language Processing with Deep Learning CS224N/Ling284 Lecture 4: Word Window Classification and Neural Networks Richard Socher Organization Main midterm: Feb 13 Alternative midterm: Friday Feb

More information

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms Neural networks Comments Assignment 3 code released implement classification algorithms use kernels for census dataset Thought questions 3 due this week Mini-project: hopefully you have started 2 Example:

More information

NLP Homework: Dependency Parsing with Feed-Forward Neural Network

NLP Homework: Dependency Parsing with Feed-Forward Neural Network NLP Homework: Dependency Parsing with Feed-Forward Neural Network Submission Deadline: Monday Dec. 11th, 5 pm 1 Background on Dependency Parsing Dependency trees are one of the main representations used

More information

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace A.I. in health informatics lecture 8 structured learning kevin small & byron wallace today models for structured learning: HMMs and CRFs structured learning is particularly useful in biomedical applications:

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

CS395T: Structured Models for NLP Lecture 10: Loopy Graphical Models

CS395T: Structured Models for NLP Lecture 10: Loopy Graphical Models CS395T: Structured Models for NLP Lecture 10: Loopy Graphical Models Recall: Global vs. Greedy State space Start state Gold end state Greg Durrett Greedy: 2n local training examples, only see gold states

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Word vectors Many slides borrowed from Richard Socher and Chris Manning Lecture plan Word representations Word vectors (embeddings) skip-gram algorithm Relation to matrix factorization

More information

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II) CISC 889 Bioinformatics (Spring 24) Hidden Markov Models (II) a. Likelihood: forward algorithm b. Decoding: Viterbi algorithm c. Model building: Baum-Welch algorithm Viterbi training Hidden Markov models

More information

Probabilistic Context-Free Grammars. Michael Collins, Columbia University

Probabilistic Context-Free Grammars. Michael Collins, Columbia University Probabilistic Context-Free Grammars Michael Collins, Columbia University Overview Probabilistic Context-Free Grammars (PCFGs) The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018. Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities

More information