Structured Prediction with Perceptron: Theory and Algorithms

Size: px
Start display at page:

Download "Structured Prediction with Perceptron: Theory and Algorithms"

Transcription

1 Structured Prediction with Perceptron: Theor and Algorithms the man bit the dog the man hit the dog DT NN VBD DT NN 那 人咬了狗 =+1 =-1 Kai Zhao Dept. Computer Science Graduate Center, the Cit Universit of New York

2 What is Structured Prediction? the man bit the dog the man bit the dog DT NN VBD DT NN S =+1 =-1 the man bit the dog DT NP NN VB VP NP 那 人咬了狗 the man bit DT NN binar classification: output is binar the dog NLP is all about structured prediction! multiclass classification: output is a (small) number structured classification: output is a structure (seq., tree, graph) part-of-speech tagging, parsing, summarization, translation eponentiall man classes: search (inference) efficienc is crucial! 2

3 Learning: Unstructured vs. Structured binar/multiclass structured learning generative naive baes HMMs (count & divide) Conditional Conditional discriminative logistic regression (maent) CRFs (epectations) Online+ Viterbi Online+ Viterbi perceptron structured perceptron (argma) SVM ma margin ma margin structured SVM (loss-augmented argma) 3

4 Wh Perceptron (Online Learning)? because we want scalabilit on big data! learning time has to be linear in the number of eamples can make onl constant number of passes over training data onl online learning (perceptron/mira) can guarantee this! SVM scales between O(n 2 ) and O(n 3 ); CRF no guarantee and inference on each eample must be super fast another advantage of perceptron: just need argma SVM CRF... 4

5 Perceptron: from binar to structured binar perceptron (Rosenblatt, 1959) 2 classes trivial w eact inference z update weights if z =+1 =-1 multiclass perceptron (Freund/Schapire, 1999) constant # of classes eas w eact inference z update weights if z structured perceptron (Collins, 2002) the man bit the dog 那 人咬了狗 eponential # of classes hard w eact inference z update weights if z 5

6 Scalabilit Challenges inference (on one eample) is too slow (even w/ DP) can we sacrifice search eactness for faster learning? would ineact search interfere with learning? if so, how should we modif learning?... 6

7 Outline Overview of Structured Learning Challenges in Scalabilit Structured Perceptron convergence proof Structured Perceptron with Ineact Search Latent-Variable Structured Perceptron 7

8 Generic Perceptron perceptron is the simplest machine learning algorithm online-learning: one eample at a time learning b doing find the best output under the current weights update weights at mistakes i inference i zi update weights w 8

9 Structured Perceptron the man bit the dog i inference w DT NN VBD DT NN i zi DT NN NN DT NN update weights 9

10 Eample: POS Tagging gold-standard: DT NN VBD DT NN the man bit the dog current output: DT NN NN DT NN the man bit the dog Φ(, ) z Φ(, z) assume onl two feature classes tag bigrams word/tag pairs ti-1 ti wi weights ++: (NN, VBD) (VBD, DT) (VBD bit) weights --: (NN, NN) (NN, DT) (NN bit) 10

11 Inference: Dnamic Programming w eact inference z update weights if z tagging: O(nT 3 ) CKY parsing: O(n 3 ) 11

12 Perceptron vs. CRFs perceptron is online and Viterbi approimation of CRF simpler to code; faster to converge; ~same accurac online stochastic gradient descent (SGD) X (,)2D CRFs (Laffert et al, 2001) X z2gen() ep(w (, z)) Z() Viterbi hard/viterbi CRFs Viterbi for (, ) 2 D, argma z2gen() w (, z) structured perceptron (Collins, 2002) online 12

13 Perceptron Convergence Proof binar classification: converges iff. data is separable structured prediction: converges iff. data is separable there is an oracle vector that correctl labels all eamples one vs the rest (correct label better than all incorrect labels) theorem: if separable, then # of updates R 2 / δ 2 R: diameter R: diameter δ R: diameter 100 δ 100 =-1 =+1 z 100 Novikoff => Freund & Schapire => Collins

14 Geometr of Convergence Proof pt 1 w z eact 1-best eact inference z update weights if z (,, z) update perceptron update: correct label w (k) current model update new model w (k+1) δ separation unit oracle vector u kw k+1 k k margin (b induction) (part 1: lowerbound) 14

15 Geometr of Convergence Proof pt 2 z eact 1-best summar: the proof uses 3 facts: w 1. separation (margin) eact inference 2. diameter (alwas finite) z update weights if z 3. violation (guaranteed b eact search) violation: incorrect label scored higher (,, z) update perceptron update: R: ma diameter correct label w (k) current model update <90 new w (k+1) model p kr k apple R 2 diameter violation b induction: kw k+1 k 2 apple kr 2 (part 2: upperbound) kw k+1 kapple p kr combine with: kw k+1 k k (part 1: lowerbound) bound on # of updates: k apple R 2 / 2 15

16 Outline Overview of Structured Learning Challenges in Scalabilit Structured Perceptron convergence proof Structured Perceptron with Ineact Search Latent-Variable Perceptron 16

17 Scalabilit Challenge 1: Inference binar classification =+1 the man bit the dog DT NN VBD DT NN =-1 structured classification constant # of classes eponential # of classes trivial w eact inference hard w eact inference z z update weights if z update weights if z challenge: search efficienc (eponentiall man classes) often use dnamic programming (DP) but DP is still too slow for repeated use, e.g. parsing O(n 3 ) Q: can we sacrifice search eactness for faster learning? 17

18 Perceptron w/ Ineact Inference w the man bit the dog DT NN VBD DT NN ineact inference z update weights if z Q: does perceptron still work??? greed search beam search routine use of ineact inference in NLP (e.g. beam search) how does structured perceptron work with ineact search? so far most structured learning theor assume eact search would search errors break these learning properties? if so how to modif learning to accommodate ineact search? 18

19 Bad News and Good News w the man bit the dog DT NN VBD DT NN ineact inference z update weights if z greed search beam search A: it no longer works as is, but we can make it work b some magic. bad news: no more guarantee of convergence in practice perceptron degrades a lot due to search errors good news: new update methods guarantee convergence new perceptron variants that live with search errors in practice the work reall well w/ ineact search 19

20 Convergence with Eact Search current model w (k) w (k+1) update V V V N V N N V correct label training eample time flies N V output space {N,V} {N, V} structured perceptron converges with eact search 20

21 No Convergence w/ Greed Search current model w (k) w (k) new model w (k+1) V V V V V N V V N update (,, z) N V correct label N V training eample time flies N V output space {N,V} {N, V} structured perceptron does not converge N V N with greed search Which part of the convergence proof no longer holds? the proof onl uses 3 facts: 1. separation (margin) 2. diameter (alwas finite) 3. violation (guaranteed b eact search) 21

22 Geometr of Convergence Proof pt 2 w z eact 1-best eact inference z update weights if z violation: incorrect label scored higher (,, z) update perceptron update: R: ma diameter z correct label w (k) current model update <90 new model w (k+1) ineact 1-best apple R 2 diameter violation ineact search doesn t guarantee violation! 22

23 Observation: Violation is all we need! eact search is not reall required b the proof rather, it is onl used to ensure violation! all violations z eact 1-best update violation: incorrect label scored higher (,, z) R: ma diameter correct label the proof onl uses 3 facts: w (k) current model update <90 new w (k+1) model 1. separation (margin) 2. diameter (alwas finite) 3. violation (but no need for eact) 23

24 Violation-Fiing Perceptron if we guarantee violation, we don t care about eactness! violation is good b/c we can at least fi a mistake all violations all possible updates same mistake bound as before! w standard perceptron violation-fiing perceptron w eact inference z update weights if z find violation z update weights if z 24

25 What if can t guarantee violation this is wh perceptron doesn t work well w/ ineact search because not ever update is guaranteed to be a violation thus the proof breaks; no convergence guarantee eample: beam or greed search the model might prefer the correct label (if eact search) but the search prunes it awa such a non-violation update is bad because it doesn t fi an mistake beam bad update the new model still misguides the search current Q: how can we alwas guarantee violation? model 25

26 Solution 1: Earl update (Collins/Roark 2004) current model w (k) w (k) new model w (k+1) V V V N V V N update (,, z) N V correct label N V training eample time flies N V output space {N,V} {N, V} standard perceptron does not converge V N with greed search V w (k) w (k+1) new model N (,, z) stop and update at the first mistake 26

27 Earl Update: Guarantees Violation w (k) w (k) w (k+1) V V V V V N V N V N V N (,, z) N V correct label N V training eample time flies N V output space {N,V} {N, V} standard update doesn t converge b/c it doesn t guarantee violation V w (k) w (k+1) N (,, z) earl update: incorrect prefi scores higher: a violation! 27

28 Earl Update: from Greed to Beam beam search is a generalization of greed (where b=1) at each stage we keep top b hpothesis widel used: tagging, parsing, translation... earl update -- when correct label first falls off the beam up to this point the incorrect prefi should score higher standard update (full update) -- no guarantee! Model correct incorrect earl update correct label falls off beam (pruned) violation guaranteed: incorrect prefi scores higher up to this point standard update (no guarantee!) 28

29 Solution 2: Ma-Violation (Huang et al 2012) beam Model earl ma-violation standard (bad!) we now established a theor for earl update (Collins/Roark) but it learns too slowl due to partial updates ma-violation: use the prefi where violation is maimum worst-mistake in the search space all these update methods are violation-fiing perceptrons 29

30 Four Eperiments part-of-speech tagging the man bit the dog incremental parsing the man bit the dog DT NN VBD DT NN the man bit the dog bottom-up parsing w/ cube pruning the man bit the dog machine translation the man bit the dog the man bit the dog 那 人咬了狗

31 Ma-Violation > Earl >> Standard ep 1 on part-of-speech tagging w/ beam search (on CTB5) earl and ma-violation >> standard update at smallest beams this advantage shrinks as beam size increases ma-violation converges faster than earl (and slightl better) tagging accurac on held-out beam=1 (greed) best tagging accurac on held-out best accurac vs. beam size ma-violation earl standard training time (hours) beam size 31

32 Ma-Violation > Earl >> Standard ep 1 on part-of-speech tagging w/ beam search (on CTB5) earl and ma-violation >> standard update at smallest beams this advantage shrinks as beam size increases ma-violation converges faster than earl (and slightl better) tagging accurac on held-out beam= best tagging accurac on held-out best accurac vs. beam size ma-violation earl standard training time (hours) beam size 32

33 Ma-Violation > Earl >> Standard ep 2 on incremental dependenc parser (Huang & Sagae 10) standard update is horrible due to search errors earl update: 38 iterations, 15.4 hours (92.24) ma-violation: 10 iterations, 4.6 hours (92.25) parsing accurac on held-out beam=8 ma-violation earl standard training time (hours) parsing accurac on held-out beam=8 ma-violation earl standard (79.0, omitted) training time (hours) ma-violation is 3.3 faster than earl update 33

34 Wh standard update so bad for parsing standard update works horribl with severe search error due to large number of invalid updates (non-violation) parsing accurac on held-out O(n 11 ) => O(nb) parsing b=8 ma-violation earl standard tagging accurac on held-out O(nT 3 ) => O(nb) tagging b= training time (hours) training time (hours) % of invalid updates parsing tagging % of invalid updates in standard update beam size 34

35 Ep 3: Bottom-up Parsing CKY parsing with cube pruning for higher-order features we etended our framework from graphs to hpergraphs UAS s-ma p-ma skip standard (Zhang et al 2013) UAS on Penn-YM dev epochs 35

36 Ep 4: Machine Translation standard perceptron works poorl for machine translation b/c invalid update ratio is ver high (search qualit is low) ma-violation converges faster than earl update first trul successful effort in large-scale training for translation BLEU ma-violation earl standard local Number of iteration Ratio 90% 80% 70% 60% 50% (Yu et al 2013) Ratio of invalid updates +non-local feature (standard perceptron) beam size

37 Comparison of Four Eps the harder our search, the more advantageous 94 tagging accurac on held-out tagging b= training time (hours) parsing accurac on held-out incremental parsing b=8 ma-violation earl standard training time (hours) UAS s-ma p-ma skip standard epochs bottom-up parsing BLEU ma-violation earl standard local machine translation Number of iteration 37

38 Outline Overview of Structured Learning Challenges in Scalabilit Structured Perceptron convergence proof Structured Perceptron with Ineact Search Latent-Variable Perceptron 38

39 Learning with Latent Variables aka weakl-supervised or partiall-observed learning learning from natural annotations ; more scalable eamples: translation, transliteration, semantic parsing... 布什与沙 龙会谈 computer What is the largest state? latent derivation latent derivation argma(state, size) Bush talked with Sharon コンピューター ko n p u : ta : Alaska transliteration QA pairs parallel tet (Liang et al 2006; Yu et al 2013; Xiao and Xiong 2013) (Knight & Graehl, 1998; Kondrak et al 2007, etc.) (Clark et al 2010; Liang et al 2013; Kwiatkowski et al 2013) 39

40 Learning Latent Structures binar/multiclass naive baes Conditional structured learning HMMs Conditional latent structures EM (forward-backward) logistic regression (maent) CRFs latent CRFs perceptron SVM Online+ Viterbi ma margin Online+ Viterbi structured perceptron ma margin structured SVM latent perceptron latent structured SVM 40

41 Latent Structured Perceptron no eplicit positive signal hallucinate the correct derivation b current weights training eample during online learning... 那 人咬了狗 那 人咬了狗 forced decoding space d highest-scoring gold derivation full search space beam ˆd highest-scoring derivation the man bit the dog the dog bit the man z w w + (, d ) (, ˆd) (Liang et al 2006; Yu et al 2013) reward correct penalize wrong 41

42 Unconstrained Search eample: beam search phrase-based decoding Bushi u Shalong juing le huitan???... meeting... Shalong _ Bush... talks... Sharon... talk 42

43 Constrained Search forced decoding: must produce the eact reference translation Bushi u Shalong juing le huitan... meeting Bush held talks with Sharon... Shalong _ Bush... talks... Sharon one gold derivation gold derivation lattice... talk held talks with Sharon Bush held talks with Sharon _ _ _

44 Search Errors in Decoding no eplicit positive signal hallucinate the correct derivation b current weights training eample during online learning... 那 人咬了狗 那 人咬了狗 forced decoding space d highest-scoring gold derivation full search space beam ˆd highest-scoring derivation the man bit the dog (Liang et al 2006; Yu et al 2013) the dog bit the man z w w + (, d ) (, ˆd) reward correct penalize wrong problem: search errors 44

45 Search Error: Gold Derivations Pruned gold derivation lattice Bush held held talks with Sharon talks with Sharon _ _ _ real decoding beam search should address search errors here! _ _ _ _ _ _ _ _ _ _ _ _ _

46 Fiing Search Error 1: Earl Update earl update (Collins/Roark 04) when the correct falls off beam up to this point the incorrect prefi should score higher that s a violation we want to fi; proof in (Huang et al 2012) standard perceptron does not guarantee violation the correct sequence (pruned) might score higher at the end! invalid update b/c it reinforces the model error Model correct incorrect earl update correct sequence falls off beam (pruned) violation guaranteed: incorrect prefi scores higher up to this point standard update (no guarantee!) 46

47 Earl Update w/ Latent Variable the gold-standard derivations are not annotated we treat an reference-producing derivation as good gold derivation lattice held talks with Sharon Bush held talks with Sharon _ _ _ Model correct incorrect earl update violation guaranteed: incorrect prefi scores higher up to this point all correct derivations fall off stop decoding 47

48 Fiing Search Error 2: Ma-Violation + h best in the beam h i e h i std model correct sequences worst in the beam earl h + i e maviolation h local standard update is invalid correct seqs all fall off beam h + i earl update works but learns slowl due to partial updates ma-violation: use the prefi where violation is maimum worst-mistake in the search space now etended to handle latent-variable 48

49 Latent-Variable Perceptron best in the beam full (standard) correct sequence best in the beam h i e h i earl maviolation latest worst in the beam falls off the beam biggest violation last valid update invalid update! + h std model correct sequences worst in the beam earl h + i e maviolation h local standard update is invalid correct seqs all fall off beam h + i

50 Roadmap of Techniques structured perceptron (Collins, 2002) latent-variable perceptron (Zettlemoer and Collins, 2005; Sun et al., 2009) perceptron w/ ineact search (Collins & Roark, 2004; Huang et al 2012) latent-variable perceptron w/ ineact search (Yu et al 2013; Zhao et al 2014) MT sntactic parsing semantic parsing transliteration 50

51 Eperiments: Discriminative Training for MT standard update (Liang et al s bold ) works poorl b/c invalid update ratio is ver high (search qualit is low) ma-violation converges faster than earl update best in the beam d i d i d + std d this eplains wh Liang et al 06 failed std ~ bold ; local ~ local model w worst in the beam earl d + i maviolation d + i d local standard update is invalid BLEU MaForce local earl MERT Ratio 90% 80% 70% Ratio of invalid updates +non-local feature standard latent-variable perceptron standard 60% Number of iteration 50% beam size 51

52 Final Conclusions online structured learning is simple and powerful search efficienc is the ke challenge search errors do interfere with learning but we can use violation-fiing perceptron w/ ineact search we can etend perceptron to learn latent structures best in the beam earl maviolation latest full (standard) correct sequence worst in the beam falls off the beam biggest violation last valid update invalid update! 52

Structured Prediction

Structured Prediction Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the

More information

Language and Statistics II

Language and Statistics II Language and Statistics II Lecture 19: EM for Models of Structure Noah Smith Epectation-Maimization E step: i,, q i # p r $ t = p r i % ' $ t i, p r $ t i,' soft assignment or voting M step: r t +1 # argma

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang. Machine Learning CUNY Graduate Center, Spring 2013 Lectures 11-12: Unsupervised Learning 1 (Clustering: k-means, EM, mixture models) Professor Liang Huang huang@cs.qc.cuny.edu http://acl.cs.qc.edu/~lhuang/teaching/machine-learning

More information

Decoding and Inference with Syntactic Translation Models

Decoding and Inference with Syntactic Translation Models Decoding and Inference with Syntactic Translation Models March 5, 2013 CFGs S NP VP VP NP V V NP NP CFGs S NP VP S VP NP V V NP NP CFGs S NP VP S VP NP V NP VP V NP NP CFGs S NP VP S VP NP V NP VP V NP

More information

NLP Programming Tutorial 11 - The Structured Perceptron

NLP Programming Tutorial 11 - The Structured Perceptron NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Prediction Problems Given x, A book review Oh, man I love this book! This book is

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion

More information

Inexact Search is Good Enough

Inexact Search is Good Enough Inexact Search is Good Enough Advanced Machine Learning for NLP Jordan Boyd-Graber MATHEMATICAL TREATMENT Advanced Machine Learning for NLP Boyd-Graber Inexact Search is Good Enough 1 of 1 Preliminaries:

More information

Machine Learning A Geometric Approach

Machine Learning A Geometric Approach Machine Learning A Geometric Approach Linear Classification: Perceptron Professor Liang Huang some slides from Alex Smola (CMU) Perceptron Frank Rosenblatt deep learning multilayer perceptron linear regression

More information

AN ABSTRACT OF THE DISSERTATION OF

AN ABSTRACT OF THE DISSERTATION OF AN ABSTRACT OF THE DISSERTATION OF Kai Zhao for the degree of Doctor of Philosophy in Computer Science presented on May 30, 2017. Title: Structured Learning with Latent Variables: Theory and Algorithms

More information

Feedforward Neural Networks

Feedforward Neural Networks Chapter 4 Feedforward Neural Networks 4. Motivation Let s start with our logistic regression model from before: P(k d) = softma k =k ( λ(k ) + w d λ(k, w) ). (4.) Recall that this model gives us a lot

More information

Logistic Regression Logistic

Logistic Regression Logistic Case Study 1: Estimating Click Probabilities L2 Regularization for Logistic Regression Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 10 th,

More information

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Low-Dimensional Discriminative Reranking Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Discriminative Reranking Useful for many NLP tasks Enables us to use arbitrary features

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification

More information

CSE 546 Midterm Exam, Fall 2014

CSE 546 Midterm Exam, Fall 2014 CSE 546 Midterm Eam, Fall 2014 1. Personal info: Name: UW NetID: Student ID: 2. There should be 14 numbered pages in this eam (including this cover sheet). 3. You can use an material ou brought: an book,

More information

Perceptron (Theory) + Linear Regression

Perceptron (Theory) + Linear Regression 10601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Perceptron (Theory) Linear Regression Matt Gormley Lecture 6 Feb. 5, 2018 1 Q&A

More information

HTF: Ch4 B: Ch4. Linear Classifiers. R Greiner Cmput 466/551

HTF: Ch4 B: Ch4. Linear Classifiers. R Greiner Cmput 466/551 HTF: Ch4 B: Ch4 Linear Classifiers R Greiner Cmput 466/55 Outline Framework Eact Minimize Mistakes Perceptron Training Matri inversion LMS Logistic Regression Ma Likelihood Estimation MLE of P Gradient

More information

Marrying Dynamic Programming with Recurrent Neural Networks

Marrying Dynamic Programming with Recurrent Neural Networks Marrying Dynamic Programming with Recurrent Neural Networks I eat sushi with tuna from Japan Liang Huang Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark Marrying

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

http://futurama.wikia.com/wiki/dr._perceptron 1 Where we are Eperiments with a hash-trick implementation of logistic regression Net question: how do you parallelize SGD, or more generally, this kind of

More information

Warm up: risk prediction with logistic regression

Warm up: risk prediction with logistic regression Warm up: risk prediction with logistic regression Boss gives you a bunch of data on loans defaulting or not: {(x i,y i )} n i= x i 2 R d, y i 2 {, } You model the data as: P (Y = y x, w) = + exp( yw T

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 26 February 2018 Recap: tagging POS tagging is a sequence labelling task.

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of

More information

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing

Unit 2: Tree Models. CS 562: Empirical Methods in Natural Language Processing. Lectures 19-23: Context-Free Grammars and Parsing CS 562: Empirical Methods in Natural Language Processing Unit 2: Tree Models Lectures 19-23: Context-Free Grammars and Parsing Oct-Nov 2009 Liang Huang (lhuang@isi.edu) Big Picture we have already covered...

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018. Recap: HMM ANLP Lecture 9: Algorithms for HMMs Sharon Goldwater 4 Oct 2018 Elements of HMM: Set of states (tags) Output alphabet (word types) Start state (beginning of sentence) State transition probabilities

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability

More information

Statistical NLP Spring A Discriminative Approach

Statistical NLP Spring A Discriminative Approach Statistical NLP Spring 2008 Lecture 6: Classification Dan Klein UC Berkeley A Discriminative Approach View WSD as a discrimination task (regression, really) P(sense context:jail, context:county, context:feeding,

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs Lecture 12: Algorithms for HMMs Nathan Schneider (some slides from Sharon Goldwater; thanks to Jonathan May for bug fixes) ENLP 17 October 2016 updated 9 September 2017 Recap: tagging POS tagging is a

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

Machine Learning: The Perceptron. Lecture 06

Machine Learning: The Perceptron. Lecture 06 Machine Learning: he Perceptron Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu 1 McCulloch-Pitts Neuron Function 0 1 w 0 activation / output function 1 w 1 w w

More information

Logistic Regression & Neural Networks

Logistic Regression & Neural Networks Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability

More information

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,

More information

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Cross-validation for detecting and preventing overfitting A Regression Problem = f() + noise Can we learn f from this data? Note to other teachers and users of these slides. Andrew would be delighted if

More information

Personal Project: Shift-Reduce Dependency Parsing

Personal Project: Shift-Reduce Dependency Parsing Personal Project: Shift-Reduce Dependency Parsing 1 Problem Statement The goal of this project is to implement a shift-reduce dependency parser. This entails two subgoals: Inference: We must have a shift-reduce

More information

Sequence Labeling: HMMs & Structured Perceptron

Sequence Labeling: HMMs & Structured Perceptron Sequence Labeling: HMMs & Structured Perceptron CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu HMM: Formal Specification Q: a finite set of N states Q = {q 0, q 1, q 2, q 3, } N N Transition

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Methods for Advanced Mathematics (C3) Coursework Numerical Methods

Methods for Advanced Mathematics (C3) Coursework Numerical Methods Woodhouse College 0 Page Introduction... 3 Terminolog... 3 Activit... 4 Wh use numerical methods?... Change of sign... Activit... 6 Interval Bisection... 7 Decimal Search... 8 Coursework Requirements on

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

MIRA, SVM, k-nn. Lirong Xia

MIRA, SVM, k-nn. Lirong Xia MIRA, SVM, k-nn Lirong Xia Linear Classifiers (perceptrons) Inputs are feature values Each feature has a weight Sum is the activation activation w If the activation is: Positive: output +1 Negative, output

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can

More information

lecture 6: modeling sequences (final part)

lecture 6: modeling sequences (final part) Natural Language Processing 1 lecture 6: modeling sequences (final part) Ivan Titov Institute for Logic, Language and Computation Outline After a recap: } Few more words about unsupervised estimation of

More information

INTRODUCTION TO DIOPHANTINE EQUATIONS

INTRODUCTION TO DIOPHANTINE EQUATIONS INTRODUCTION TO DIOPHANTINE EQUATIONS In the earl 20th centur, Thue made an important breakthrough in the stud of diophantine equations. His proof is one of the first eamples of the polnomial method. His

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):

More information

with Local Dependencies

with Local Dependencies CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Machine Translation II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Machine Translation II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Project 4: Word Alignment! Will be released soon! (~Monday) Phrase-Based System Overview

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday! Case Study 1: Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 4, 017 1 Announcements:

More information

Probabilistic Context-free Grammars

Probabilistic Context-free Grammars Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

2.2 Structured Prediction

2.2 Structured Prediction The hinge loss (also called the margin loss), which is optimized by the SVM, is a ramp function that has slope 1 when yf(x) < 1 and is zero otherwise. Two other loss functions squared loss and exponential

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Ecient Higher-Order CRFs for Morphological Tagging

Ecient Higher-Order CRFs for Morphological Tagging Ecient Higher-Order CRFs for Morphological Tagging Thomas Müller, Helmut Schmid and Hinrich Schütze Center for Information and Language Processing University of Munich Outline 1 Contributions 2 Motivation

More information

Hidden Markov Models

Hidden Markov Models CS 2750: Machine Learning Hidden Markov Models Prof. Adriana Kovashka University of Pittsburgh March 21, 2016 All slides are from Ray Mooney Motivating Example: Part Of Speech Tagging Annotate each word

More information

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri Online Learning Jordan Boyd-Graber University of Colorado Boulder LECTURE 21 Slides adapted from Mohri Jordan Boyd-Graber Boulder Online Learning 1 of 31 Motivation PAC learning: distribution fixed over

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Lecture 3: Pattern Classification. Pattern classification

Lecture 3: Pattern Classification. Pattern classification EE E68: Speech & Audio Processing & Recognition Lecture 3: Pattern Classification 3 4 5 The problem of classification Linear and nonlinear classifiers Probabilistic classification Gaussians, mitures and

More information

Distributed Training Strategies for the Structured Perceptron

Distributed Training Strategies for the Structured Perceptron Distributed Training Strategies for the Structured Perceptron Ryan McDonald Keith Hall Gideon Mann Google, Inc., New York / Zurich {ryanmcd kbhall gmann}@google.com Abstract Perceptron training is widely

More information

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin

Kernels. Machine Learning CSE446 Carlos Guestrin University of Washington. October 28, Carlos Guestrin Kernels Machine Learning CSE446 Carlos Guestrin University of Washington October 28, 2013 Carlos Guestrin 2005-2013 1 Linear Separability: More formally, Using Margin Data linearly separable, if there

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Support vector machines Lecture 4

Support vector machines Lecture 4 Support vector machines Lecture 4 David Sontag New York University Slides adapted from Luke Zettlemoyer, Vibhav Gogate, and Carlos Guestrin Q: What does the Perceptron mistake bound tell us? Theorem: The

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: predictions Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due Friday of next

More information

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak

More information

Linear Classifiers IV

Linear Classifiers IV Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

The Perceptron Algorithm

The Perceptron Algorithm The Perceptron Algorithm Machine Learning Spring 2018 The slides are mainly from Vivek Srikumar 1 Outline The Perceptron Algorithm Perceptron Mistake Bound Variants of Perceptron 2 Where are we? The Perceptron

More information

Chapter 14 (Partially) Unsupervised Parsing

Chapter 14 (Partially) Unsupervised Parsing Chapter 14 (Partially) Unsupervised Parsing The linguistically-motivated tree transformations we discussed previously are very effective, but when we move to a new language, we may have to come up with

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

A Support Vector Method for Multivariate Performance Measures

A Support Vector Method for Multivariate Performance Measures A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What

More information

Linear & nonlinear classifiers

Linear & nonlinear classifiers Linear & nonlinear classifiers Machine Learning Hamid Beigy Sharif University of Technology Fall 1394 Hamid Beigy (Sharif University of Technology) Linear & nonlinear classifiers Fall 1394 1 / 34 Table

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Structured Prediction

Structured Prediction Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured

More information

Generalized Linear Classifiers in NLP

Generalized Linear Classifiers in NLP Generalized Linear Classifiers in NLP (or Discriminative Generalized Linear Feature-Based Classifiers) Graduate School of Language Technology, Sweden 2009 Ryan McDonald Google Inc., New York, USA E-mail:

More information

Learning From Data Lecture 7 Approximation Versus Generalization

Learning From Data Lecture 7 Approximation Versus Generalization Learning From Data Lecture 7 Approimation Versus Generalization The VC Dimension Approimation Versus Generalization Bias and Variance The Learning Curve M. Magdon-Ismail CSCI 4100/6100 recap: The Vapnik-Chervonenkis

More information

Unit 3 NOTES Honors Common Core Math 2 1. Day 1: Properties of Exponents

Unit 3 NOTES Honors Common Core Math 2 1. Day 1: Properties of Exponents Unit NOTES Honors Common Core Math Da : Properties of Eponents Warm-Up: Before we begin toda s lesson, how much do ou remember about eponents? Use epanded form to write the rules for the eponents. OBJECTIVE

More information

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning Mark Schmidt University of British Columbia, May 2016 www.cs.ubc.ca/~schmidtm/svan16 Some images from this lecture are

More information

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius

Doctoral Course in Speech Recognition. May 2007 Kjell Elenius Doctoral Course in Speech Recognition May 2007 Kjell Elenius CHAPTER 12 BASIC SEARCH ALGORITHMS State-based search paradigm Triplet S, O, G S, set of initial states O, set of operators applied on a state

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework

More information

Sequential Data Modeling - Conditional Random Fields

Sequential Data Modeling - Conditional Random Fields Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Prediction Problems Given x, predict y 2 Prediction Problems Given x, A book review

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt) Nathan Schneider (some slides borrowed from Chris Dyer) ENLP 12 February 2018 23 Outline Words, probabilities Features,

More information

Learning to translate with neural networks. Michael Auli

Learning to translate with neural networks. Michael Auli Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

CRF for human beings

CRF for human beings CRF for human beings Arne Skjærholt LNS seminar CRF for human beings LNS seminar 1 / 29 Let G = (V, E) be a graph such that Y = (Y v ) v V, so that Y is indexed by the vertices of G. Then (X, Y) is a conditional

More information

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p)

AE = q < H(p < ) + (1 q < )H(p > ) H(p) = p lg(p) (1 p) lg(1 p) 1 Decision Trees (13 pts) Data points are: Negative: (-1, 0) (2, 1) (2, -2) Positive: (0, 0) (1, 0) Construct a decision tree using the algorithm described in the notes for the data above. 1. Show the

More information