CS395T: Structured Models for NLP Lecture 5: Sequence Models II

Size: px
Start display at page:

Download "CS395T: Structured Models for NLP Lecture 5: Sequence Models II"

Transcription

1 CS395T: Structured Models for NLP Lecture 5: Sequence Models II Greg Durrett Some slides adapted from Dan Klein, UC Berkeley

2 Recall: HMMs Input x =(x 1,...,x n ) Output y =(y 1,...,y n ) y 1 y 2 y n P (y, x) =P (y 1 ) P (y i y i 1 ) P (x i y i ) i=2 x 1 x 2 x n Training: maximum likelihood eslmalon (with smoothing) Inference problem: argmax y P (y x) = argmax y P (y, x) P (x) ExponenLally many possible y here! Viterbi: score i (s) = max y i 1 P (s y i 1 )P (x i s)score i 1 (y i 1 )

3 This Lecture GeneraLve vs. discriminalve models CRFs for sequence modeling Named enlty recognilon (NER) Structured SVM (if Lme) Beam search

4 Named EnLty RecogniLon B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG BIO tagset: begin, inside, outside POS tagging is a plausible generalve model of language NER with this vanilla tag set is not What s different about modeling P(y x) directly vs. P(x,y) and compulng the posterior later?

5 GeneraLve vs. DiscriminaLve Models slide credit: Dan Klein

6 GeneraLve vs. DiscriminaLve Models slide credit: Dan Klein

7 GeneraLve vs. DiscriminaLve Models What does the model say when both lights are red? Lights are working wrong! slide credit: Dan Klein

8 GeneraLve vs. DiscriminaLve Models What if P(b) were 1/2 instead of 1/7 (the NB eslmate)? Lights are broken correct! Data likelihood is lower but Data likelihood P(x,y) is lower but posterior P(y x) is more accurate slide credit: Dan Klein

9 CondiLonal Random Fields HMMs are expressible as Bayes nets (factor graphs) y 1 y 2 y n x 1 x 2 x n This reflects the following decomposilon: P (y, x) =P (y 1 )P (x 1 y 1 )P (y 2 y 1 )P (x 2 y 2 )... Locally normalized model: each factor is a probability distribulon that normalizes

10 CondiLonal Random Fields HMMs: P (y, x) =P (y 1 )P (x 1 y 1 )P (y 2 y 1 )P (x 2 y 2 )... CRFs: discriminalve models with the following globally-normalized form: Y P (y x) = 1 Z normalizer k exp( k (x, y)) any real-valued scoring funclon of its arguments Naive Bayes : logislc regression :: HMMs : CRFs local vs. global normalizalon <-> generalve vs. discriminalve How do we max over y? Intractable in general can we fix this?

11 SequenLal CRFs HMMs: P (y, x) =P (y 1 )P (x 1 y 1 )P (y 2 y 1 )P (x 2 y 2 )... y 1 y 2 y n x 1 x 2 x n o t y 1 y 2 y n CRFs: e P (y x) / Y k exp( k (x, y)) x 1 x 2 x n P (y x) / exp( o (y 1 )) exp( t (y i 1,y i )) exp( e (x i,y i )) i=2

12 SequenLal CRFs o t y 1 y 2 y n o t y y 1 2 y n e x 1 x 2 x n e x P (y x) / exp( o (y 1 )) exp( t (y i 1,y i )) exp( e (x i,y i )) i=2 We condilon on x, so every variable can exp( e (y i,i,x)) depend on all of x x can t depend arbitrarily on y in a generalve model would make inference hard token index lets us look at current word

13 SequenLal CRFs t t o y y 1 2 y n o y y 1 2 y n e x e in fact, we typically don t show x at all Don t include inilal distribulon, can bake into other factors SequenLal CRFs: P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) i=2

14 CompuLng (arg)maxes t P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) y y 1 2 y n i=2 e argmax P (y x): can use Viterbi exactly as in HMM case y max y 1,...,y n e t (y n 1,y n ) e e (y n,n,x) e e (y 2,2,x) e t (y 1,y 2 ) e e (y 1,1,x) = max y 2,...,y n e t (y n 1,y n ) e e (y n,n,x) e e (y 2,2,x) max y 1 e t (y 1,y 2 ) e e (y 1,1,x) { = max y 3,...,y n e t (y n 1,y n ) e e (y n,n,x) max y 2 e t (y 2,y 3 ) e e (y 2,2,x) max {y1 e t(y1,y2 ) score1 (y1) exp( t (y i 1,y i )) exp( e (y i,i,x)) and play the role of the Ps now, same dynamic program

15 CompuLng Marginals t P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) y y 1 2 y n i=2 e Normalizing constant Z = X y exp( t (y i 1,y i )) exp( e (y i,i,x)) i=2 Analogous to P(x) for HMMs P (y i = s, x) For both HMMs and CRFs: for HMMs; sums P (y i = s x) = P forward i (s)backward i (s) s 0 forward i (s 0 )backward i (s 0 ) out other ys Z for CRFs, P(x) for HMMs

16 Inference in General CRFs t Can do inference in any tree-structured CRF y y 1 2 y n e Sum-product algorithm: generalizalon of forward-backward to arbitrary tree-structured graphs We ll come back to this in a few lectures when we deal with other kinds of graphs

17 Feature FuncLons t P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) y y 1 2 y n i=2 e Phi can have sophislcated features! Generally look like linear models e(y i,i,x) =w > f e (y i,i,x) t(y i 1,y i )=w > f t (y i 1,y i ) # P (y x) / exp w > " n X f t (y i 1,y i )+ nx f e (y i,i,x) i=2 Log-linear model structurally like logislc regression!

18 Training CRFs P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) i=2 Assume and are both linear feature funclons w T f(args) t e nx nx L(y, x) = log P (y x) = w > f t (y i 1,y i )+ w > f e (x i,y i ) log Z i=2 Gradient is gold features minus expected features under model, like in nx j L(y, x) = i=2 f t,j (y i 1,y i )+ f e,j (x i,y i ) " X n E y f t,j (y i 1,y i )+ nx f e,j (x i,y i ) # i=2

19 Training CRFs P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) i=2 How to compute expectalons? Forward-backward helps you compute P (y i = s x) Take weighted sum over all features at all tags and posilons TransiLon features: need to compute P (y i = s 1,y i+1 = s 2 x) using forward-backward as well but you can build a preky good system without transilon features

20 ImplementaLon Tips Olen many features but only a few are aclve on a single sentence even across many different labels Maintain the gradient as a sparse vector for efficiency Counter in utils.py is a way to do this

21 Basic Features for NER P (y x) = 1 Z exp( t (y i 1,y i )) exp( e (y i,i,x)) i=2 B-LOC O Barack Obama will travel to Hangzhou today for the G20 mee=ng. TransiLons: f t (y i 1,y i )=Ind[y i 1 & y i ] Emissions: f e (y 6, 6, x) = Ind[B-LOC & Current word = Hangzhou] Ind[B-LOC & Prev word = to]

22 Features for NER LOC e(y i,i,x) Leicestershire is a nice place to visit PER Leonardo DiCaprio won an award LOC I took a vaca=on to Boston ORG LOC PER Apple released a new version Texas governor Greg AbboL said ORG According to the New York Times

23 Features for NER Word features CapitalizaLon Leicestershire Word shape Prefixes/suffixes Boston Lexical indicators Context features Words before/aler Apple released a new version According to the New York Times Tags before/aler Word clusters Gazekeers

24 Nonlocal Features The news agency Tanjug reported on the outcome of the meelng. ORG? PER? The delegalon met the president at the airport, Tanjug said. Various ways to capture this informalon we ll talk about this in a few lectures Finkel and Manning (2008), RaLnov and Roth (2009)

25 Semi-Markov Models Barack Obama will travel to Hangzhou today for the G20 mee=ng. { { { { { { PER O LOC O ORG O Chunk-level prediclon rather than token-level BIO y is a set of touching spans of the sentence Viterbi looks like looping over all spans that could lead to a given point Pros: features can look at whole span at once Cons: there s an extra factor of n during inference Sarawagi and Cohen (2004)

26 EvaluaLng NER B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG PredicLon of all Os slll gets 66% accuracy on this example! What we really want to know: how many named enlty chunk prediclons did we get right? Precision: of the ones we predicted, how many are right? Recall: of the gold named enlles, how many did we find? F-measure: harmonic mean of these two ParLal credit? Typically no but more complex metrics exist

27 EvaluaLng NER B-PER I-PER O O O B-LOC O O O B-ORG O O Barack Obama will travel to Hangzhou today for the G20 mee=ng. PERSON LOC ORG More correct: ROC curve Recall Measure the area under the curve as a way of evalualng the system holislcally Precision

28 How well do NER systems do? RaLnov and Roth (2009) Lample et al. (2016)

29 Structured SVM nx nx CRF: log P (y x) / w > f t (y i 1,y i )+ w > f e (x i,y i ) i=2 We can formulate an SVM using the same features nx nx w > f(x, y) = w > f t (y i 1,y i )+ w > f e (x i,y i ) i=2 mx Minimize kwk j s.t. 8j j 0 j=1 8j8y 2Y w > f(x j, y j ) w > f(x j, y)+`(y, y j ) j

30 Structured SVM nx nx w > f(x, y) = w > f t (y i 1,y i )+ w > f e (x i,y i ) i=2 mx Minimize kwk j s.t. 8j j 0 j=1 8j8y 2Y w > f(x j, y j ) w > f(x j, y)+`(y, y j ) j ExponenLally large state space! Use Viterbi for loss-augmented decode Same as normal Viterbi but boost wrong labels scores by 1 (if using Hamming loss) Only need Viterbi, not forward-backward hmm

31 Viterbi Time Complexity VBD VBN NNP VBZ NNS VB VBP NN VBZ NNS CD NN Fed raises interest rates 0.5 percent n word sentence, s tags to consider what is the Lme complexity? sentence tags O(ns 2 ) s is ~40 for POS, n is ~20

32 Viterbi Time Complexity VBD VBN NNP VBZ NNS VB VBP NN VBZ NNS CD NN Fed raises interest rates 0.5 percent Many tags are totally implausible Can any of these be: Determiners? PreposiLons? AdjecLves? Features quickly eliminate many outcomes from consideralon don t need to consider these going forward

33 Beam Search Maintain a beam of k plausible states at the current Lmestep Expand all states, only keep k top hypotheses at new state VBD +1.2 VBZ -2.0 NNS -1.0 VBZ VBZ +1.2 NNP +0.9 NNS -1.0 VBN +0.7 DT -5.3 NN +0.3 PRP -5.8 Not expanded Not expanded Fed raises O(nks) Lme complexity with beam size of k

34 How good is beam search? Big enough beam size: always exact! Usually works well even with smaller beams What s the case when k=1? How about when there s no transilon model? Depends on the strength of nonlocal interaclons we ll come back to this later!

35 ImplementaLon Tips for CRFs Caching is your friend! Cache feature vectors especially Try to reduce redundant computalon, e.g. if you compute both the gradient and the objeclve value, don t rerun the dynamic program Exploit sparsity in feature vectors where possible. The weight vector needs to be stored explicitly, but all features and gradients are typically faster to handle sparsely Think about your data structures: if things are too slow

36 Debugging Tips for CRFs Hard to know whether inference, learning, or the model is broken! Compute the objeclve is oplmizalon working? Inference: check gradient computalon (most likely place for bug) Are expectalons being computed correctly? Do probabililes normalize / expectalons look reasonable? Learning: are you applying the gradient correctly? If objeclve is going down but model performance is bad: Inference: check performance if you decode the training set Model: if dev set performance is bad: work on features more!

37 Next Time Unsupervised sequence modeling WriLng Lps as you prepare your report

CS395T: Structured Models for NLP Lecture 2: Machine Learning Review

CS395T: Structured Models for NLP Lecture 2: Machine Learning Review CS395T: Structured Models for NLP Lecture 2: Machine Learning Review Greg Durrett Some slides adapted from Vivek Srikumar, University of Utah Administrivia Course enrollment Lecture slides posted on website

More information

CS388: Natural Language Processing Lecture 4: Sequence Models I

CS388: Natural Language Processing Lecture 4: Sequence Models I CS388: Natural Language Processing Lecture 4: Sequence Models I Greg Durrett Mini 1 due today Administrivia Project 1 out today, due September 27 Viterbi algorithm, CRF NER system, extension Extension

More information

CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs. CNNs. Administrivia. This Lecture. Greg Durrett. Project 1 due today at 5pm

CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs. CNNs. Administrivia. This Lecture. Greg Durrett. Project 1 due today at 5pm CS388: Natural Language Processing Lecture 9: CNNs, Neural CRFs Project 1 due today at 5pm Administrivia Mini 2 out tonight, due in two weeks Greg Durrett This Lecture CNNs CNNs for SenLment Neural CRFs

More information

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I. Greg Durrett

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I. Greg Durrett CS395T: Structured Models for NLP Lecture 19: Advanced NNs I Greg Durrett Administrivia Kyunghyun Cho (NYU) talk Friday 11am GDC 6.302 Project 3 due today! Final project out today! Proposal due in 1 week

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I

CS395T: Structured Models for NLP Lecture 19: Advanced NNs I CS395T: Structured Models for NLP Lecture 19: Advanced NNs I Greg Durrett Administrivia Kyunghyun Cho (NYU) talk Friday 11am GDC 6.302 Project 3 due today! Final project out today! Proposal due in 1 week

More information

CS395T: Structured Models for NLP Lecture 10: Loopy Graphical Models

CS395T: Structured Models for NLP Lecture 10: Loopy Graphical Models CS395T: Structured Models for NLP Lecture 10: Loopy Graphical Models Recall: Global vs. Greedy State space Start state Gold end state Greg Durrett Greedy: 2n local training examples, only see gold states

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

with Local Dependencies

with Local Dependencies CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

Sequence Labeling: HMMs & Structured Perceptron

Sequence Labeling: HMMs & Structured Perceptron Sequence Labeling: HMMs & Structured Perceptron CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu HMM: Formal Specification Q: a finite set of N states Q = {q 0, q 1, q 2, q 3, } N N Transition

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

Statistical NLP Spring A Discriminative Approach

Statistical NLP Spring A Discriminative Approach Statistical NLP Spring 2008 Lecture 6: Classification Dan Klein UC Berkeley A Discriminative Approach View WSD as a discrimination task (regression, really) P(sense context:jail, context:county, context:feeding,

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

CSE 490 U Natural Language Processing Spring 2016

CSE 490 U Natural Language Processing Spring 2016 CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Structure in the output variable(s)? What is the

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018. Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

CS838-1 Advanced NLP: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn

More information

Natural Language Processing Winter 2013

Natural Language Processing Winter 2013 Natural Language Processing Winter 2013 Hidden Markov Models Luke Zettlemoyer - University of Washington [Many slides from Dan Klein and Michael Collins] Overview Hidden Markov Models Learning Supervised:

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 20: Sequence labeling (April 9, 2019) David Bamman, UC Berkeley POS tagging NNP Labeling the tag that s correct for the context. IN JJ FW SYM IN JJ

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Naïve Bayes classification

Naïve Bayes classification Naïve Bayes classification 1 Probability theory Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. Examples: A person s height, the outcome of a coin toss

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,

More information

Conditional Random Fields for Sequential Supervised Learning

Conditional Random Fields for Sequential Supervised Learning Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd

More information

Statistical methods in NLP, lecture 7 Tagging and parsing

Statistical methods in NLP, lecture 7 Tagging and parsing Statistical methods in NLP, lecture 7 Tagging and parsing Richard Johansson February 25, 2014 overview of today's lecture HMM tagging recap assignment 3 PCFG recap dependency parsing VG assignment 1 overview

More information

Neural Networks in Structured Prediction. November 17, 2015

Neural Networks in Structured Prediction. November 17, 2015 Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability Probability theory Naïve Bayes classification Random variable: a variable whose possible values are numerical outcomes of a random phenomenon. s: A person s height, the outcome of a coin toss Distinguish

More information

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9 CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification

More information

Linear Regression. Machine Learning Seyoung Kim. Many of these slides are derived from Tom Mitchell. Thanks!

Linear Regression. Machine Learning Seyoung Kim. Many of these slides are derived from Tom Mitchell. Thanks! Linear Regression Machine Learning 10-601 Seyoung Kim Many of these slides are derived from Tom Mitchell. Thanks! Regression So far, we ve been interested in learning P(Y X) where Y has discrete values

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

lecture 6: modeling sequences (final part)

lecture 6: modeling sequences (final part) Natural Language Processing 1 lecture 6: modeling sequences (final part) Ivan Titov Institute for Logic, Language and Computation Outline After a recap: } Few more words about unsupervised estimation of

More information

Hidden Markov Models (HMMs)

Hidden Markov Models (HMMs) Hidden Markov Models HMMs Raymond J. Mooney University of Texas at Austin 1 Part Of Speech Tagging Annotate each word in a sentence with a part-of-speech marker. Lowest level of syntactic analysis. John

More information

NLP Programming Tutorial 11 - The Structured Perceptron

NLP Programming Tutorial 11 - The Structured Perceptron NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Prediction Problems Given x, A book review Oh, man I love this book! This book is

More information

CSE 447/547 Natural Language Processing Winter 2018

CSE 447/547 Natural Language Processing Winter 2018 CSE 447/547 Natural Language Processing Winter 2018 Feature Rich Models (Log Linear Models) Yejin Choi University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Announcements HW #3 Due Feb

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

Lecture 7: Sequence Labeling

Lecture 7: Sequence Labeling http://courses.engr.illinois.edu/cs447 Lecture 7: Sequence Labeling Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Recap: Statistical POS tagging with HMMs (J. Hockenmaier) 2 Recap: Statistical

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering Types of learning Modeling data Supervised: we know input and targets Goal is to learn a model that, given input data, accurately predicts target data Unsupervised: we know the input only and want to make

More information

Generative Clustering, Topic Modeling, & Bayesian Inference

Generative Clustering, Topic Modeling, & Bayesian Inference Generative Clustering, Topic Modeling, & Bayesian Inference INFO-4604, Applied Machine Learning University of Colorado Boulder December 12-14, 2017 Prof. Michael Paul Unsupervised Naïve Bayes Last week

More information

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic

Announcements. CS 188: Artificial Intelligence Fall Causality? Example: Traffic. Topology Limits Distributions. Example: Reverse Traffic CS 188: Artificial Intelligence Fall 2008 Lecture 16: Bayes Nets III 10/23/2008 Announcements Midterms graded, up on glookup, back Tuesday W4 also graded, back in sections / box Past homeworks in return

More information

Hidden Markov Models

Hidden Markov Models CS 2750: Machine Learning Hidden Markov Models Prof. Adriana Kovashka University of Pittsburgh March 21, 2016 All slides are from Ray Mooney Motivating Example: Part Of Speech Tagging Annotate each word

More information

A gentle introduction to Hidden Markov Models

A gentle introduction to Hidden Markov Models A gentle introduction to Hidden Markov Models Mark Johnson Brown University November 2009 1 / 27 Outline What is sequence labeling? Markov models Hidden Markov models Finding the most likely state sequence

More information

Probabilistic Models for Sequence Labeling

Probabilistic Models for Sequence Labeling Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative

More information

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning

CS 446 Machine Learning Fall 2016 Nov 01, Bayesian Learning CS 446 Machine Learning Fall 206 Nov 0, 206 Bayesian Learning Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes Overview Bayesian Learning Naive Bayes Logistic Regression Bayesian Learning So far, we

More information

Conditional Random Fields

Conditional Random Fields Conditional Random Fields Micha Elsner February 14, 2013 2 Sums of logs Issue: computing α forward probabilities can undeflow Normally we d fix this using logs But α requires a sum of probabilities Not

More information

Dynamic Programming: Hidden Markov Models

Dynamic Programming: Hidden Markov Models University of Oslo : Department of Informatics Dynamic Programming: Hidden Markov Models Rebecca Dridan 16 October 2013 INF4820: Algorithms for AI and NLP Topics Recap n-grams Parts-of-speech Hidden Markov

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D.

Slides modified from: PATTERN RECOGNITION CHRISTOPHER M. BISHOP. and: Computer vision: models, learning and inference Simon J.D. Slides modified from: PATTERN RECOGNITION AND MACHINE LEARNING CHRISTOPHER M. BISHOP and: Computer vision: models, learning and inference. 2011 Simon J.D. Prince ClassificaLon Example: Gender ClassificaLon

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

CS 188: Artificial Intelligence. Bayes Nets

CS 188: Artificial Intelligence. Bayes Nets CS 188: Artificial Intelligence Probabilistic Inference: Enumeration, Variable Elimination, Sampling Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

CS 188: Artificial Intelligence Spring Announcements

CS 188: Artificial Intelligence Spring Announcements CS 188: Artificial Intelligence Spring 2011 Lecture 16: Bayes Nets IV Inference 3/28/2011 Pieter Abbeel UC Berkeley Many slides over this course adapted from Dan Klein, Stuart Russell, Andrew Moore Announcements

More information

CSE 447 / 547 Natural Language Processing Winter 2018

CSE 447 / 547 Natural Language Processing Winter 2018 CSE 447 / 547 Natural Language Processing Winter 2018 Hidden Markov Models Yejin Choi University of Washington [Many slides from Dan Klein, Michael Collins, Luke Zettlemoyer] Overview Hidden Markov Models

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor Logistic Regression Some slides adapted from Dan Jurfasky and Brendan O Connor Naïve Bayes Recap Bag of words (order independent) Features are assumed independent given class P (x 1,...,x n c) =P (x 1

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 12 Dynamical Models CS/CNS/EE 155 Andreas Krause Homework 3 out tonight Start early!! Announcements Project milestones due today Please email to TAs 2 Parameter learning

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22.

Announcements. Inference. Mid-term. Inference by Enumeration. Reminder: Alarm Network. Introduction to Artificial Intelligence. V22. Introduction to Artificial Intelligence V22.0472-001 Fall 2009 Lecture 15: Bayes Nets 3 Midterms graded Assignment 2 graded Announcements Rob Fergus Dept of Computer Science, Courant Institute, NYU Slides

More information

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace A.I. in health informatics lecture 8 structured learning kevin small & byron wallace today models for structured learning: HMMs and CRFs structured learning is particularly useful in biomedical applications:

More information

CS 188: Artificial Intelligence Spring Today

CS 188: Artificial Intelligence Spring Today CS 188: Artificial Intelligence Spring 2006 Lecture 9: Naïve Bayes 2/14/2006 Dan Klein UC Berkeley Many slides from either Stuart Russell or Andrew Moore Bayes rule Today Expectations and utilities Naïve

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence

More information

Structured Prediction

Structured Prediction Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari Partially Directed Graphs and Conditional Random Fields Sargur srihari@cedar.buffalo.edu 1 Topics Conditional Random Fields Gibbs distribution and CRF Directed and Undirected Independencies View as combination

More information

CS 343: Artificial Intelligence

CS 343: Artificial Intelligence CS 343: Artificial Intelligence Bayes Nets: Sampling Prof. Scott Niekum The University of Texas at Austin [These slides based on those of Dan Klein and Pieter Abbeel for CS188 Intro to AI at UC Berkeley.

More information

Maximum Entropy and Log-linear Models

Maximum Entropy and Log-linear Models Maximum Entropy and Log-linear Models Regina Barzilay EECS Department MIT October 1, 2004 Last Time Transformation-based tagger HMM-based tagger Maximum Entropy and Log-linear Models 1/29 Leftovers: POS

More information

CS 188: Artificial Intelligence Fall 2008

CS 188: Artificial Intelligence Fall 2008 CS 188: Artificial Intelligence Fall 2008 Lecture 23: Perceptrons 11/20/2008 Dan Klein UC Berkeley 1 General Naïve Bayes A general naive Bayes model: C E 1 E 2 E n We only specify how each feature depends

More information

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering

General Naïve Bayes. CS 188: Artificial Intelligence Fall Example: Overfitting. Example: OCR. Example: Spam Filtering. Example: Spam Filtering CS 188: Artificial Intelligence Fall 2008 General Naïve Bayes A general naive Bayes model: C Lecture 23: Perceptrons 11/20/2008 E 1 E 2 E n Dan Klein UC Berkeley We only specify how each feature depends

More information

Expectation Maximization Algorithm

Expectation Maximization Algorithm Expectation Maximization Algorithm Vibhav Gogate The University of Texas at Dallas Slides adapted from Carlos Guestrin, Dan Klein, Luke Zettlemoyer and Dan Weld The Evils of Hard Assignments? Clusters

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Sequential Data Modeling - The Structured Perceptron

Sequential Data Modeling - The Structured Perceptron Sequential Data Modeling - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Prediction Problems Given x, predict y 2 Prediction Problems Given x, A book review

More information

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Lecture 6: Graphical Models

Lecture 6: Graphical Models Lecture 6: Graphical Models Kai-Wei Chang CS @ Uniersity of Virginia kw@kwchang.net Some slides are adapted from Viek Skirmar s course on Structured Prediction 1 So far We discussed sequence labeling tasks:

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 2. Statistical Schools Adapted from slides by Ben Rubinstein Statistical Schools of Thought Remainder of lecture is to provide

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Lecture 11 CRFs, Exponential Family CS/CNS/EE 155 Andreas Krause Announcements Homework 2 due today Project milestones due next Monday (Nov 9) About half the work should

More information

CRF for human beings

CRF for human beings CRF for human beings Arne Skjærholt LNS seminar CRF for human beings LNS seminar 1 / 29 Let G = (V, E) be a graph such that Y = (Y v ) v V, so that Y is indexed by the vertices of G. Then (X, Y) is a conditional

More information

Introduction to Artificial Intelligence (AI)

Introduction to Artificial Intelligence (AI) Introduction to Artificial Intelligence (AI) Computer Science cpsc502, Lecture 9 Oct, 11, 2011 Slide credit Approx. Inference : S. Thrun, P, Norvig, D. Klein CPSC 502, Lecture 9 Slide 1 Today Oct 11 Bayesian

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania

More information