Structured Prediction with Perceptron: Theory and Algorithms
|
|
- Antony Cox
- 6 years ago
- Views:
Transcription
1 Structured Prediction with Perceptron: Theor and Algorithms the man bit the dog the man hit the dog DT NN VBD DT NN 那 人咬了狗 =+1 =-1 Kai Zhao Dept. Computer Science Graduate Center, the Cit Universit of New York
2 What is Structured Prediction? the man bit the dog the man bit the dog DT NN VBD DT NN S =+1 =-1 the man bit the dog DT NP NN VB VP NP 那 人咬了狗 the man bit DT NN binar classification: output is binar the dog NLP is all about structured prediction! multiclass classification: output is a (small) number structured classification: output is a structure (seq., tree, graph) part-of-speech tagging, parsing, summarization, translation eponentiall man classes: search (inference) efficienc is crucial! 2
3 Learning: Unstructured vs. Structured binar/multiclass structured learning generative naive baes HMMs (count & divide) Conditional Conditional discriminative logistic regression (maent) CRFs (epectations) Online+ Viterbi Online+ Viterbi perceptron structured perceptron (argma) SVM ma margin ma margin structured SVM (loss-augmented argma) 3
4 Wh Perceptron (Online Learning)? because we want scalabilit on big data! learning time has to be linear in the number of eamples can make onl constant number of passes over training data onl online learning (perceptron/mira) can guarantee this! SVM scales between O(n 2 ) and O(n 3 ); CRF no guarantee and inference on each eample must be super fast another advantage of perceptron: just need argma SVM CRF... 4
5 Perceptron: from binar to structured binar perceptron (Rosenblatt, 1959) 2 classes trivial w eact inference z update weights if z =+1 =-1 multiclass perceptron (Freund/Schapire, 1999) constant # of classes eas w eact inference z update weights if z structured perceptron (Collins, 2002) the man bit the dog 那 人咬了狗 eponential # of classes hard w eact inference z update weights if z 5
6 Scalabilit Challenges inference (on one eample) is too slow (even w/ DP) can we sacrifice search eactness for faster learning? would ineact search interfere with learning? if so, how should we modif learning?... 6
7 Outline Overview of Structured Learning Challenges in Scalabilit Structured Perceptron convergence proof Structured Perceptron with Ineact Search Latent-Variable Structured Perceptron 7
8 Generic Perceptron perceptron is the simplest machine learning algorithm online-learning: one eample at a time learning b doing find the best output under the current weights update weights at mistakes i inference i zi update weights w 8
9 Structured Perceptron the man bit the dog i inference w DT NN VBD DT NN i zi DT NN NN DT NN update weights 9
10 Eample: POS Tagging gold-standard: DT NN VBD DT NN the man bit the dog current output: DT NN NN DT NN the man bit the dog Φ(, ) z Φ(, z) assume onl two feature classes tag bigrams word/tag pairs ti-1 ti wi weights ++: (NN, VBD) (VBD, DT) (VBD bit) weights --: (NN, NN) (NN, DT) (NN bit) 10
11 Inference: Dnamic Programming w eact inference z update weights if z tagging: O(nT 3 ) CKY parsing: O(n 3 ) 11
12 Perceptron vs. CRFs perceptron is online and Viterbi approimation of CRF simpler to code; faster to converge; ~same accurac online stochastic gradient descent (SGD) X (,)2D CRFs (Laffert et al, 2001) X z2gen() ep(w (, z)) Z() Viterbi hard/viterbi CRFs Viterbi for (, ) 2 D, argma z2gen() w (, z) structured perceptron (Collins, 2002) online 12
13 Perceptron Convergence Proof binar classification: converges iff. data is separable structured prediction: converges iff. data is separable there is an oracle vector that correctl labels all eamples one vs the rest (correct label better than all incorrect labels) theorem: if separable, then # of updates R 2 / δ 2 R: diameter R: diameter δ R: diameter 100 δ 100 =-1 =+1 z 100 Novikoff => Freund & Schapire => Collins
14 Geometr of Convergence Proof pt 1 w z eact 1-best eact inference z update weights if z (,, z) update perceptron update: correct label w (k) current model update new model w (k+1) δ separation unit oracle vector u kw k+1 k k margin (b induction) (part 1: lowerbound) 14
15 Geometr of Convergence Proof pt 2 z eact 1-best summar: the proof uses 3 facts: w 1. separation (margin) eact inference 2. diameter (alwas finite) z update weights if z 3. violation (guaranteed b eact search) violation: incorrect label scored higher (,, z) update perceptron update: R: ma diameter correct label w (k) current model update <90 new w (k+1) model p kr k apple R 2 diameter violation b induction: kw k+1 k 2 apple kr 2 (part 2: upperbound) kw k+1 kapple p kr combine with: kw k+1 k k (part 1: lowerbound) bound on # of updates: k apple R 2 / 2 15
16 Outline Overview of Structured Learning Challenges in Scalabilit Structured Perceptron convergence proof Structured Perceptron with Ineact Search Latent-Variable Perceptron 16
17 Scalabilit Challenge 1: Inference binar classification =+1 the man bit the dog DT NN VBD DT NN =-1 structured classification constant # of classes eponential # of classes trivial w eact inference hard w eact inference z z update weights if z update weights if z challenge: search efficienc (eponentiall man classes) often use dnamic programming (DP) but DP is still too slow for repeated use, e.g. parsing O(n 3 ) Q: can we sacrifice search eactness for faster learning? 17
18 Perceptron w/ Ineact Inference w the man bit the dog DT NN VBD DT NN ineact inference z update weights if z Q: does perceptron still work??? greed search beam search routine use of ineact inference in NLP (e.g. beam search) how does structured perceptron work with ineact search? so far most structured learning theor assume eact search would search errors break these learning properties? if so how to modif learning to accommodate ineact search? 18
19 Bad News and Good News w the man bit the dog DT NN VBD DT NN ineact inference z update weights if z greed search beam search A: it no longer works as is, but we can make it work b some magic. bad news: no more guarantee of convergence in practice perceptron degrades a lot due to search errors good news: new update methods guarantee convergence new perceptron variants that live with search errors in practice the work reall well w/ ineact search 19
20 Convergence with Eact Search current model w (k) w (k+1) update V V V N V N N V correct label training eample time flies N V output space {N,V} {N, V} structured perceptron converges with eact search 20
21 No Convergence w/ Greed Search current model w (k) w (k) new model w (k+1) V V V V V N V V N update (,, z) N V correct label N V training eample time flies N V output space {N,V} {N, V} structured perceptron does not converge N V N with greed search Which part of the convergence proof no longer holds? the proof onl uses 3 facts: 1. separation (margin) 2. diameter (alwas finite) 3. violation (guaranteed b eact search) 21
22 Geometr of Convergence Proof pt 2 w z eact 1-best eact inference z update weights if z violation: incorrect label scored higher (,, z) update perceptron update: R: ma diameter z correct label w (k) current model update <90 new model w (k+1) ineact 1-best apple R 2 diameter violation ineact search doesn t guarantee violation! 22
23 Observation: Violation is all we need! eact search is not reall required b the proof rather, it is onl used to ensure violation! all violations z eact 1-best update violation: incorrect label scored higher (,, z) R: ma diameter correct label the proof onl uses 3 facts: w (k) current model update <90 new w (k+1) model 1. separation (margin) 2. diameter (alwas finite) 3. violation (but no need for eact) 23
24 Violation-Fiing Perceptron if we guarantee violation, we don t care about eactness! violation is good b/c we can at least fi a mistake all violations all possible updates same mistake bound as before! w standard perceptron violation-fiing perceptron w eact inference z update weights if z find violation z update weights if z 24
25 What if can t guarantee violation this is wh perceptron doesn t work well w/ ineact search because not ever update is guaranteed to be a violation thus the proof breaks; no convergence guarantee eample: beam or greed search the model might prefer the correct label (if eact search) but the search prunes it awa such a non-violation update is bad because it doesn t fi an mistake beam bad update the new model still misguides the search current Q: how can we alwas guarantee violation? model 25
26 Solution 1: Earl update (Collins/Roark 2004) current model w (k) w (k) new model w (k+1) V V V N V V N update (,, z) N V correct label N V training eample time flies N V output space {N,V} {N, V} standard perceptron does not converge V N with greed search V w (k) w (k+1) new model N (,, z) stop and update at the first mistake 26
27 Earl Update: Guarantees Violation w (k) w (k) w (k+1) V V V V V N V N V N V N (,, z) N V correct label N V training eample time flies N V output space {N,V} {N, V} standard update doesn t converge b/c it doesn t guarantee violation V w (k) w (k+1) N (,, z) earl update: incorrect prefi scores higher: a violation! 27
28 Earl Update: from Greed to Beam beam search is a generalization of greed (where b=1) at each stage we keep top b hpothesis widel used: tagging, parsing, translation... earl update -- when correct label first falls off the beam up to this point the incorrect prefi should score higher standard update (full update) -- no guarantee! Model correct incorrect earl update correct label falls off beam (pruned) violation guaranteed: incorrect prefi scores higher up to this point standard update (no guarantee!) 28
29 Solution 2: Ma-Violation (Huang et al 2012) beam Model earl ma-violation standard (bad!) we now established a theor for earl update (Collins/Roark) but it learns too slowl due to partial updates ma-violation: use the prefi where violation is maimum worst-mistake in the search space all these update methods are violation-fiing perceptrons 29
30 Four Eperiments part-of-speech tagging the man bit the dog incremental parsing the man bit the dog DT NN VBD DT NN the man bit the dog bottom-up parsing w/ cube pruning the man bit the dog machine translation the man bit the dog the man bit the dog 那 人咬了狗
31 Ma-Violation > Earl >> Standard ep 1 on part-of-speech tagging w/ beam search (on CTB5) earl and ma-violation >> standard update at smallest beams this advantage shrinks as beam size increases ma-violation converges faster than earl (and slightl better) tagging accurac on held-out beam=1 (greed) best tagging accurac on held-out best accurac vs. beam size ma-violation earl standard training time (hours) beam size 31
32 Ma-Violation > Earl >> Standard ep 1 on part-of-speech tagging w/ beam search (on CTB5) earl and ma-violation >> standard update at smallest beams this advantage shrinks as beam size increases ma-violation converges faster than earl (and slightl better) tagging accurac on held-out beam= best tagging accurac on held-out best accurac vs. beam size ma-violation earl standard training time (hours) beam size 32
33 Ma-Violation > Earl >> Standard ep 2 on incremental dependenc parser (Huang & Sagae 10) standard update is horrible due to search errors earl update: 38 iterations, 15.4 hours (92.24) ma-violation: 10 iterations, 4.6 hours (92.25) parsing accurac on held-out beam=8 ma-violation earl standard training time (hours) parsing accurac on held-out beam=8 ma-violation earl standard (79.0, omitted) training time (hours) ma-violation is 3.3 faster than earl update 33
34 Wh standard update so bad for parsing standard update works horribl with severe search error due to large number of invalid updates (non-violation) parsing accurac on held-out O(n 11 ) => O(nb) parsing b=8 ma-violation earl standard tagging accurac on held-out O(nT 3 ) => O(nb) tagging b= training time (hours) training time (hours) % of invalid updates parsing tagging % of invalid updates in standard update beam size 34
35 Ep 3: Bottom-up Parsing CKY parsing with cube pruning for higher-order features we etended our framework from graphs to hpergraphs UAS s-ma p-ma skip standard (Zhang et al 2013) UAS on Penn-YM dev epochs 35
36 Ep 4: Machine Translation standard perceptron works poorl for machine translation b/c invalid update ratio is ver high (search qualit is low) ma-violation converges faster than earl update first trul successful effort in large-scale training for translation BLEU ma-violation earl standard local Number of iteration Ratio 90% 80% 70% 60% 50% (Yu et al 2013) Ratio of invalid updates +non-local feature (standard perceptron) beam size
37 Comparison of Four Eps the harder our search, the more advantageous 94 tagging accurac on held-out tagging b= training time (hours) parsing accurac on held-out incremental parsing b=8 ma-violation earl standard training time (hours) UAS s-ma p-ma skip standard epochs bottom-up parsing BLEU ma-violation earl standard local machine translation Number of iteration 37
38 Outline Overview of Structured Learning Challenges in Scalabilit Structured Perceptron convergence proof Structured Perceptron with Ineact Search Latent-Variable Perceptron 38
39 Learning with Latent Variables aka weakl-supervised or partiall-observed learning learning from natural annotations ; more scalable eamples: translation, transliteration, semantic parsing... 布什与沙 龙会谈 computer What is the largest state? latent derivation latent derivation argma(state, size) Bush talked with Sharon コンピューター ko n p u : ta : Alaska transliteration QA pairs parallel tet (Liang et al 2006; Yu et al 2013; Xiao and Xiong 2013) (Knight & Graehl, 1998; Kondrak et al 2007, etc.) (Clark et al 2010; Liang et al 2013; Kwiatkowski et al 2013) 39
40 Learning Latent Structures binar/multiclass naive baes Conditional structured learning HMMs Conditional latent structures EM (forward-backward) logistic regression (maent) CRFs latent CRFs perceptron SVM Online+ Viterbi ma margin Online+ Viterbi structured perceptron ma margin structured SVM latent perceptron latent structured SVM 40
41 Latent Structured Perceptron no eplicit positive signal hallucinate the correct derivation b current weights training eample during online learning... 那 人咬了狗 那 人咬了狗 forced decoding space d highest-scoring gold derivation full search space beam ˆd highest-scoring derivation the man bit the dog the dog bit the man z w w + (, d ) (, ˆd) (Liang et al 2006; Yu et al 2013) reward correct penalize wrong 41
42 Unconstrained Search eample: beam search phrase-based decoding Bushi u Shalong juing le huitan???... meeting... Shalong _ Bush... talks... Sharon... talk 42
43 Constrained Search forced decoding: must produce the eact reference translation Bushi u Shalong juing le huitan... meeting Bush held talks with Sharon... Shalong _ Bush... talks... Sharon one gold derivation gold derivation lattice... talk held talks with Sharon Bush held talks with Sharon _ _ _
44 Search Errors in Decoding no eplicit positive signal hallucinate the correct derivation b current weights training eample during online learning... 那 人咬了狗 那 人咬了狗 forced decoding space d highest-scoring gold derivation full search space beam ˆd highest-scoring derivation the man bit the dog (Liang et al 2006; Yu et al 2013) the dog bit the man z w w + (, d ) (, ˆd) reward correct penalize wrong problem: search errors 44
45 Search Error: Gold Derivations Pruned gold derivation lattice Bush held held talks with Sharon talks with Sharon _ _ _ real decoding beam search should address search errors here! _ _ _ _ _ _ _ _ _ _ _ _ _
46 Fiing Search Error 1: Earl Update earl update (Collins/Roark 04) when the correct falls off beam up to this point the incorrect prefi should score higher that s a violation we want to fi; proof in (Huang et al 2012) standard perceptron does not guarantee violation the correct sequence (pruned) might score higher at the end! invalid update b/c it reinforces the model error Model correct incorrect earl update correct sequence falls off beam (pruned) violation guaranteed: incorrect prefi scores higher up to this point standard update (no guarantee!) 46
47 Earl Update w/ Latent Variable the gold-standard derivations are not annotated we treat an reference-producing derivation as good gold derivation lattice held talks with Sharon Bush held talks with Sharon _ _ _ Model correct incorrect earl update violation guaranteed: incorrect prefi scores higher up to this point all correct derivations fall off stop decoding 47
48 Fiing Search Error 2: Ma-Violation + h best in the beam h i e h i std model correct sequences worst in the beam earl h + i e maviolation h local standard update is invalid correct seqs all fall off beam h + i earl update works but learns slowl due to partial updates ma-violation: use the prefi where violation is maimum worst-mistake in the search space now etended to handle latent-variable 48
49 Latent-Variable Perceptron best in the beam full (standard) correct sequence best in the beam h i e h i earl maviolation latest worst in the beam falls off the beam biggest violation last valid update invalid update! + h std model correct sequences worst in the beam earl h + i e maviolation h local standard update is invalid correct seqs all fall off beam h + i
50 Roadmap of Techniques structured perceptron (Collins, 2002) latent-variable perceptron (Zettlemoer and Collins, 2005; Sun et al., 2009) perceptron w/ ineact search (Collins & Roark, 2004; Huang et al 2012) latent-variable perceptron w/ ineact search (Yu et al 2013; Zhao et al 2014) MT sntactic parsing semantic parsing transliteration 50
51 Eperiments: Discriminative Training for MT standard update (Liang et al s bold ) works poorl b/c invalid update ratio is ver high (search qualit is low) ma-violation converges faster than earl update best in the beam d i d i d + std d this eplains wh Liang et al 06 failed std ~ bold ; local ~ local model w worst in the beam earl d + i maviolation d + i d local standard update is invalid BLEU MaForce local earl MERT Ratio 90% 80% 70% Ratio of invalid updates +non-local feature standard latent-variable perceptron standard 60% Number of iteration 50% beam size 51
52 Final Conclusions online structured learning is simple and powerful search efficienc is the ke challenge search errors do interfere with learning but we can use violation-fiing perceptron w/ ineact search we can etend perceptron to learn latent structures best in the beam earl maviolation latest full (standard) correct sequence worst in the beam falls off the beam biggest violation last valid update invalid update! 52
Structured Prediction
Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the
More informationLanguage and Statistics II
Language and Statistics II Lecture 19: EM for Models of Structure Noah Smith Epectation-Maimization E step: i,, q i # p r $ t = p r i % ' $ t i, p r $ t i,' soft assignment or voting M step: r t +1 # argma
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationMachine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.
Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning
More informationDecoding and Inference with Syntactic Translation Models
Decoding and Inference with Syntactic Translation Models March 5, 2013 CFGs S NP VP VP NP V V NP NP CFGs S NP VP S VP NP V V NP NP CFGs S NP VP S VP NP V NP VP V NP NP CFGs S NP VP S VP NP V NP VP V NP
More informationNLP Programming Tutorial 11 - The Structured Perceptron
NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Prediction Problems Given x, A book review Oh, man I love this book! This book is
More informationLecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)
Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for
More informationMACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion
More informationInexact Search is Good Enough
Inexact Search is Good Enough Advanced Machine Learning for NLP Jordan Boyd-Graber MATHEMATICAL TREATMENT Advanced Machine Learning for NLP Boyd-Graber Inexact Search is Good Enough 1 of 1 Preliminaries:
More informationMachine Learning A Geometric Approach
Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU) Perceptron Frank Rosenblatt deep learning multilayer perceptron linear regression
More informationAN ABSTRACT OF THE DISSERTATION OF
AN ABSTRACT OF THE DISSERTATION OF Kai Zhao for the degree of Doctor of Philosophy in Computer Science presented on May 30, 2017. Title: Structured Learning with Latent Variables: Theory and Algorithms
More informationFeedforward Neural Networks
Chapter 4 Feedforward Neural Networks 4. Motivation Let s start with our logistic regression model from before: P(k d) = softma k =k ( λ(k ) + w d λ(k, w) ). (4.) Recall that this model gives us a lot
More informationLogistic Regression Logistic
Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,
More informationLow-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park
Low-Dimensional Discriminative Reranking Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Discriminative Reranking Useful for many NLP tasks Enables us to use arbitrary features
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationStatistical Methods for NLP
Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification
More informationCSE 546 Midterm Exam, Fall 2014
CSE 546 Midterm Eam, Fall 2014 1. Personal info: Name: UW NetID: Student ID: 2. There should be 14 numbered pages in this eam (including this cover sheet). 3. You can use an material ou brought: an book,
More informationPerceptron (Theory) + Linear Regression
10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A
More informationHTF: Ch4 B: Ch4. Linear Classifiers. R Greiner Cmput 466/551
HTF: Ch4 B: Ch4 Linear Classifiers R Greiner Cmput 466/55 Outline Framework Eact Minimize Mistakes Perceptron Training Matri inversion LMS Logistic Regression Ma Likelihood Estimation MLE of P Gradient
More informationMarrying Dynamic Programming with Recurrent Neural Networks
Marrying Dynamic Programming with Recurrent Neural Networks I eat sushi with tuna from Japan Liang Huang Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark Marrying
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationhttp://futurama.wikia.com/wiki/dr._perceptron 1 Where we are Eperiments with a hash-trick implementation of logistic regression Net question: how do you parallelize SGD, or more generally, this kind of
More informationWarm up: risk prediction with logistic regression
Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationLecture 12: Algorithms for HMMs
Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.
More informationMore on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013
More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative
More informationPredicting Sequences: Structured Perceptron. CS 6355: Structured Prediction
Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of
More informationUnit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing
CS 562: Empirical Methods in Natural Language Processing Unit 2: Tree Models Lectures 19-23: Context-Free Grammars and Parsing Oct-Nov 2009 Liang Huang (lhuang@isi.edu) Big Picture we have already covered...
More informationNatural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley
Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationRecap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.
Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities
More informationNatural Language Processing
Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability
More informationStatistical NLP Spring A Discriminative Approach
Statistical NLP Spring 2008 Lecture 6: Classification Dan Klein UC Berkeley A Discriminative Approach View WSD as a discrimination task (regression, really) P(sense context:jail, context:county, context:feeding,
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationLecture 12: Algorithms for HMMs
Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationMachine Learning: The Perceptron. Lecture 06
Machine Learning: he Perceptron Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu 1 McCulloch-Pitts Neuron Function 0 1 w 0 activation / output function 1 w 1 w w
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationAlgorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,
More informationCross-validation for detecting and preventing overfitting
Cross-validation for detecting and preventing overfitting A Regression Problem = f() + noise Can we learn f from this data? Note to other teachers and users of these slides. Andrew would be delighted if
More informationPersonal Project: Shift-Reduce Dependency Parsing
Personal Project: Shift-Reduce Dependency Parsing 1 Problem Statement The goal of this project is to implement a shift-reduce dependency parser. This entails two subgoals: Inference: We must have a shift-reduce
More informationSequence Labeling: HMMs & Structured Perceptron
Sequence Labeling: HMMs & Structured Perceptron CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu HMM: Formal Specification Q: a finite set of N states Q = {q 0, q 1, q 2, q 3, } N N Transition
More informationA brief introduction to Conditional Random Fields
A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood
More informationMethods for Advanced Mathematics (C3) Coursework Numerical Methods
Woodhouse College 0 Page Introduction... 3 Terminolog... 3 Activit... 4 Wh use numerical methods?... Change of sign... Activit... 6 Interval Bisection... 7 Decimal Search... 8 Coursework Requirements on
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationMIRA, SVM, k-nn. Lirong Xia
MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output
More informationLECTURER: BURCU CAN Spring
LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can
More informationlecture 6: modeling sequences (final part)
Natural Language Processing 1 lecture 6: modeling sequences (final part) Ivan Titov Institute for Logic, Language and Computation Outline After a recap: } Few more words about unsupervised estimation of
More informationINTRODUCTION TO DIOPHANTINE EQUATIONS
INTRODUCTION TO DIOPHANTINE EQUATIONS In the earl 20th centur, Thue made an important breakthrough in the stud of diophantine equations. His proof is one of the first eamples of the polnomial method. His
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationLab 12: Structured Prediction
December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?
More informationNatural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science
Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):
More informationwith Local Dependencies
CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence
More informationEmpirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs
Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21
More informationAlgorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley
Algorithms for NLP Machine Translation II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Project 4: Word Alignment! Will be released soon! (~Monday) Phrase-Based System Overview
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationAlgorithms for Predicting Structured Data
1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure
More informationCase Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!
Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:
More informationProbabilistic Context-free Grammars
Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More information2.2 Structured Prediction
The hinge loss (also called the margin loss), which is optimized by the SVM, is a ramp function that has slope 1 when yf(x) < 1 and is zero otherwise. Two other loss functions squared loss and exponential
More informationStatistical Methods for NLP
Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured
More informationEcient Higher-Order CRFs for Morphological Tagging
Ecient Higher-Order CRFs for Morphological Tagging Thomas Müller, Helmut Schmid and Hinrich Schütze Center for Information and Language Processing University of Munich Outline 1 Contributions 2 Motivation
More informationHidden Markov Models
CS 2750: Machine Learning Hidden Markov Models Prof. Adriana Kovashka University of Pittsburgh March 21, 2016 All slides are from Ray Mooney Motivating Example: Part Of Speech Tagging Annotate each word
More informationOnline Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri
Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over
More informationWarm up. Regrade requests submitted directly in Gradescope, do not instructors.
Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required
More informationLecture 3: Pattern Classification. Pattern classification
EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and
More informationDistributed Training Strategies for the Structured Perceptron
Distributed Training Strategies for the Structured Perceptron Ryan McDonald Keith Hall Gideon Mann Google, Inc., New York / Zurich {ryanmcd kbhall gmann}@google.com Abstract Perceptron training is widely
More informationKernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin
Kernels Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin 2005-2013 1 Linear Separability: More formally, Using Margin Data linearly separable, if there
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationSupport vector machines Lecture 4
Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The
More informationCPSC 340: Machine Learning and Data Mining
CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next
More informationNatural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu
Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak
More informationLinear Classifiers IV
Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic
More informationThe Perceptron Algorithm
The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2 Where are we? The Perceptron
More informationChapter 14 (Partially) Unsupervised Parsing
Chapter 14 (Partially) Unsupervised Parsing The linguistically-motivated tree transformations we discussed previously are very effective, but when we move to a new language, we may have to come up with
More informationBrief Introduction of Machine Learning Techniques for Content Analysis
1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview
More informationA Support Vector Method for Multivariate Performance Measures
A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme
More informationGaussian and Linear Discriminant Analysis; Multiclass Classification
Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015
More informationLinear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))
Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by
More informationWhat s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction
Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What
More informationLinear & nonlinear classifiers
Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table
More informationConditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013
Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General
More informationStructured Prediction
Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured
More informationGeneralized Linear Classifiers in NLP
Generalized Linear Classifiers in NLP (or Discriminative Generalized Linear Feature-Based Classifiers) Graduate School of Language Technology, Sweden 2009 Ryan McDonald Google Inc., New York, USA E-mail:
More informationLearning From Data Lecture 7 Approximation Versus Generalization
Learning From Data Lecture 7 Approimation Versus Generalization The VC Dimension Approimation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100 recap: The Vapnik-Chervonenkis
More informationUnit 3 NOTES Honors Common Core Math 2 1. Day 1: Properties of Exponents
Unit NOTES Honors Common Core Math Da : Properties of Eponents Warm-Up: Before we begin toda s lesson, how much do ou remember about eponents? Use epanded form to write the rules for the eponents. OBJECTIVE
More informationSVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning
SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are
More informationDoctoral Course in Speech Recognition. May 2007 Kjell Elenius
Doctoral Course in Speech Recognition May 2007 Kjell Elenius CHAPTER 12 BASIC SEARCH ALGORITHMS State-based search paradigm Triplet S, O, G S, set of initial states O, set of operators applied on a state
More informationHidden Markov Models
10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework
More informationSequential Data Modeling - Conditional Random Fields
Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Prediction Problems Given x, predict y 2 Prediction Problems Given x, A book review
More informationAd Placement Strategies
Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad
More informationLinear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)
Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,
More informationLearning to translate with neural networks. Michael Auli
Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each
More informationProbabilistic Context Free Grammars. Many slides from Michael Collins
Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar
More informationCRF for human beings
CRF for human beings Arne Skjærholt LNS seminar CRF for human beings LNS seminar 1 / 29 Let G = (V, E) be a graph such that Y = (Y v ) v V, so that Y is indexed by the vertices of G. Then (X, Y) is a conditional
More informationAE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)
1 Decision Trees (13 pts) Data points are: Negative: (-1, 0) (2, 1) (2, -2) Positive: (0, 0) (1, 0) Construct a decision tree using the algorithm described in the notes for the data above. 1. Show the
More information