Ecient Higher-Order CRFs for Morphological Tagging
|
|
- Lisa Allen
- 6 years ago
- Views:
Transcription
1 Ecient Higher-Order CRFs for Morphological Tagging Thomas Müller, Helmut Schmid and Hinrich Schütze Center for Information and Language Processing University of Munich
2 Outline 1 Contributions 2 Motivation 3 Model 4 Experiments
3 Contributions Fast approximate CRF tagger for big tag sets
4 Contributions Fast approximate CRF tagger for big tag sets Allows to train CRFs with high orders
5 Contributions Fast approximate CRF tagger for big tag sets Allows to train CRFs with high orders Training time reductions from several days to several hours
6 Contributions Fast approximate CRF tagger for big tag sets Allows to train CRFs with high orders Training time reductions from several days to several hours Accuracy improvements due to higher orders
7 Motivation
8 Introduction Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP nom pl masc nom pl masc pl 3 acc sg neut acc sg neut Assign coarse POS and ne MORPH tags
9 Introduction Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP nom pl masc nom pl masc pl 3 acc sg neut acc sg neut Assign coarse POS and ne MORPH tags
10 Introduction Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP nom pl masc nom pl masc pl 3 acc sg neut acc sg neut Tagging works for short-distance dependencies
11 Introduction Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP nom pl masc nom pl masc pl 3 acc sg neut acc sg neut Need higher orders for long-distance dependencies weil er kein Lösegeld verlangt because he no ransom demands KO PP PI NN VVF
12 Introduction Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP nom pl masc nom pl masc pl 3 acc sg neut acc sg neut Higher-order tagging is expensive Coarse-To-Fine approach
13 Coarse-To-Fine Decoding Basic Idea: 1 Create a 0-order lattice
14 Coarse-To-Fine Decoding Basic Idea: 1 Create a 0-order lattice 2 Calculate posterior probabilities
15 Coarse-To-Fine Decoding Basic Idea: 1 Create a 0-order lattice 2 Calculate posterior probabilities 3 Prune states with low posteriors
16 Coarse-To-Fine Decoding Basic Idea: 1 Create a 0-order lattice 2 Calculate posterior probabilities 3 Prune states with low posteriors 4 Increase lattice order
17 Coarse-To-Fine Decoding Basic Idea: 1 Create a 0-order lattice 2 Calculate posterior probabilities 3 Prune states with low posteriors 4 Increase lattice order 5 Go to 2
18 Coarse-To-Fine Decoding II Create a 0-order lattice Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP ART NN VAF PI NN VVF PDS, VAI ART NE VVP PRO PRO VVI PRO ADJ VAF, VAF,, FM VAP
19 Coarse-To-Fine Decoding II Calculate posterior probabilities Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP ART 0.90 NN 1.00 VAF 0.93 PI 1.00 NN 0.93 VVF 0.54 PDS 0.10, 0.00 VAI 0.07 ART 0.00 NE 0.05 VVP 0.46 PRO 0.00 PRO 0.00 VVI 0.00 PRO 0.00 ADJ 0.01 VAF 0.00, 0.00 VAF 0.00, 0.00, 0.00 FM 0.01 VAP
20 Coarse-To-Fine Decoding II Prune states with low posteriors Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP ART 0.90 NN 1.00 VAF 0.93 PI 1.00 NN 0.93 VVF 0.54 PDS 0.10 VAI 0.07 NE 0.05 VVP 0.46 VVI 0.00 ADJ 0.01 FM 0.01
21 Coarse-To-Fine Decoding II Increase lattice order to 1 Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP ART 1.00 NN 1.00 VAF 1.00 PI 1.00 NN 1.00 VVF 0.69 PDS 0.00 VAI 0.00 NE 0.00 VVP 0.31 VVI 0.00 ADJ 0.00 FM 0.00
22 Coarse-To-Fine Decoding II Increase lattice order to 2 Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP (S ART) 1.00 (ART NN) 1.00 (NN VAF) 1.00 (VAF PI) 0.99 (PI NN) 1.00 (NN VVP) 0.55 (NN VAI) 0.00 (VAI PI) 0.01 (PI NE) 0.00 (NN VVF) 0.45 (NE VVP) 0.00 (NE VVF) 0.00
23 Coarse-To-Fine Decoding II Increase lattice order to 3 Die Rebellen haben kein Lösegeld verlangt The rebels have no ransom demanded ART NN VAF PI NN VVP (S ART) 1.00 (S ART NN) 1.00 (ART NN VAF) 0.99 (NN VAF PI) 0.99 (VAF PI NN) 0.99 (PI NN VVP) 0.63 (ART NN VAI) 0.01 (NN VAI PI) 0.01 (VAI PI NE) 0.01 (PI NN VVF) 0.37 (PI NE VVP) 0.00 (PI NE VVF) 0.00
24 Model
25 Conditional Random Fields Model:
26 Conditional Random Fields Model: pme ( y x) = t 1 Z ME ( exp λ φ( x, yt, t) λ, x)
27 Conditional Random Fields Model: pme ( y x) = t 1 Z ME ( exp λ φ( x, yt, t) λ, x) pcrf ( y x) = 1 Z CRF ( λ, x) exp t λ φ( x, yt, yt 1, t)
28 Conditional Random Fields Model: Prune: pme ( y x) = t 1 Z ME ( exp λ φ( x, yt, t) λ, x) pcrf ( y x) = 1 Z CRF ( λ, x) exp t λ φ( x, yt, yt 1, t)
29 Conditional Random Fields Model: Prune: pme ( y x) = t 1 Z ME ( exp λ φ( x, yt, t) λ, x) pcrf ( y x) = 1 Z CRF ( λ, x) exp t λ φ( x, yt, yt 1, t) pme (y x, t) = 1 Z ME ( λ, x) exp λ φ( x, y, t)
30 Conditional Random Fields Model: Prune: pme ( y x) = t 1 Z ME ( exp λ φ( x, yt, t) λ, x) pcrf ( y x) = 1 Z CRF ( λ, x) exp t λ φ( x, yt, yt 1, t) 1 pme (y x, t) = Z ME ( exp λ φ( x, y, t) λ, x) pcrf (y x, t) = ( y :y t =y exp t λ φ( x,y t,y t 1,t )) Z CRF ( x)
31 Conditional Random Fields Model: Prune: pme ( y x) = t 1 Z ME ( exp λ φ( x, yt, t) λ, x) pcrf ( y x) = 1 Z CRF ( λ, x) exp t λ φ( x, yt, yt 1, t) 1 pme (y x, t) = Z ME ( exp λ φ( x, y, t) λ, x) pcrf (y x, t) = ( y :y t =y exp t λ φ( x,y t,y t 1,t )) Z CRF ( x) Train using L1-regularized Stochastic Gradient Descent (SGD) [Tsuruoka et al., 2009]
32 Coarse-To-Fine Decoding We could now do the following:
33 Coarse-To-Fine Decoding We could now do the following: Train a 0-order model and lter Train a 1 st -order model on the ltered data and lter...
34 Coarse-To-Fine Decoding We could now do the following: Train a 0-order model and lter Train a 1 st -order model on the ltered data and lter... Do not want to do that, because: We would need to train multiple models We would get multiple weights for the same features
35 Coarse-To-Fine Decoding We could now do the following: Train a 0-order model and lter Train a 1 st -order model on the ltered data and lter... Do not want to do that, because: We would need to train multiple models We would get multiple weights for the same features We do not need optimal lower-order models
36 Coarse-To-Fine Decoding We could now do the following: Train a 0-order model and lter Train a 1 st -order model on the ltered data and lter... Do not want to do that, because: We would need to train multiple models We would get multiple weights for the same features We do not need optimal lower-order models Train a single joint model instead!
37 Lattice Generation function GetSumLattice(sentence, τ, n) candidates getallcandidates(sentence) lattice ZeroOrderLattice(candidates) for i = 1 n do candidates lattice. prune(τ i 1) lattice SequenceLattice(candidates, i) end for return lattice end function
38 Lattice Generation II With the current lattice generation we never do lower-order updates
39 Lattice Generation II With the current lattice generation we never do lower-order updates We thus never force the model to keep the gold tags in the lower-order lattices
40 Lattice Generation II With the current lattice generation we never do lower-order updates We thus never force the model to keep the gold tags in the lower-order lattices If a gold tag gets pruned do an Early Update [Collins and Roark, 2004]
41 Lattice Generation III function GetSumLattice(sentence, τ, n) gold-tags gettags(sentence) candidates getallcandidates(sentence) lattice ZeroOrderLattice(candidates) for i = 1 n do candidates lattice. prune(τ i 1) if gold-tags candidates then return lattice end if lattice SequenceLattice(candidates, i) end for return lattice end function
42 Lattice Generation IV 0.2 Unreachable gold candidates Epochs
43 Lattice Generation V How do we set the τ i, the pruning threshold at order i?
44 Lattice Generation V How do we set the τ i, the pruning threshold at order i? Fixed τ i do not work, because
45 Lattice Generation V How do we set the τ i, the pruning threshold at order i? Fixed τ i do not work, because p(y x, t) decreases with increasing tag sizes
46 Lattice Generation V How do we set the τ i, the pruning threshold at order i? Fixed τ i do not work, because p(y x, t) decreases with increasing tag sizes During training, we start with uniform models and end with sparse models
47 Lattice Generation V How do we set the τ i, the pruning threshold at order i? Fixed τ i do not work, because p(y x, t) decreases with increasing tag sizes During training, we start with uniform models and end with sparse models Solution: Set τ i dynamically to achieve a certain average number of tags µ τ i = { +0.1 τ i if ˆµ i < µ i 0.1 τ i if ˆµ i > µ i
48 Experiments
49 Languages Language Sentences POS Arabic 15, Czech 38, English 38, Spanish 14, German 40, Hungarian 61,034 57
50 Baseline Taggers SVMTool [Giménez and Màrquez, 2004] SVM-based Left-To-Right tagger CRFSuite [Okazaki, 2007] First-Order Conditional Random Field (trained with SGD) Morfette [Chrupaªa et al., 2008] Averaged Perceptron Stanford Tagger [Toutanova et al., 2003] Bidirectional Maximum Entropy Markov Model
51 POS Experiments Arabic Czech Spanish German Hungarian English n TT ACC TT ACC TT ACC TT ACC TT ACC TT ACC CRF CRF training is fast for small tag sets (Czech and Spanish) but slow for big tagsets
52 POS Experiments Arabic Czech Spanish German Hungarian English n TT ACC TT ACC TT ACC TT ACC TT ACC TT ACC CRF PCRF CRF training is fast for small tag sets (Czech and Spanish) but slow for big tagsets First order: PCRF is twice as fast to 30-times as fast as CRF
53 POS Experiments Arabic Czech Spanish German Hungarian English n TT ACC TT ACC TT ACC TT ACC TT ACC TT ACC CRF PCRF CRF training is fast for small tag sets (Czech and Spanish) but slow for big tagsets First order: PCRF is twice as fast to 30-times as fast as CRF
54 POS Experiments Arabic Czech Spanish German Hungarian English n TT ACC TT ACC TT ACC TT ACC TT ACC TT ACC CRF PCRF PCRF * * * * * * PCRF * * * * * CRF training is fast for small tag sets (Czech and Spanish) but slow for big tagsets First order: PCRF is twice as fast to 30-times as fast as CRF Higher orders: Small but signicant improvements in accuracy for all languages
55 POS Experiments II - Accuracy Arabic Czech Spanish German Hungarian English SVMTool Morfette CRFSuite Stanford PCRF * 98.83* * 97.09* PCRF * 98.66* 97.36* 97.50* PCRF * 98.66* 97.44* 97.49* 97.19* PCRF outperforms best baseline for 3 out of 6 languages
56 POS Experiments II - Accuracy Arabic Czech Spanish German Hungarian English SVMTool Morfette CRFSuite Stanford PCRF * 98.83* * 97.09* PCRF * 98.66* 97.36* 97.50* PCRF * 98.66* 97.44* 97.49* 97.19* PCRF outperforms best baseline for 3 out of 6 languages Never signicantly worse than best baseline
57 POS Experiments II - Training Times Arabic Czech Spanish German Hungarian English Morfette CRFSuite PCRF PCRF CRFSuite is fastest baseline tagger for all languages
58 POS Experiments II - Training Times Arabic Czech Spanish German Hungarian English Morfette CRFSuite PCRF PCRF CRFSuite is fastest baseline tagger for all languages PCRF is faster for bigger tag sets (> 38)
59 Languages Language Sentences POS POS+MORPH A Arabic 15, Czech 38, , Spanish 14, German 40, Hungarian 61, ,
60 POS + MORPH Experiments Order Arabic Czech Spanish German Hungarian Oracle Oracle is a rst-order PCRF, but gold tags get reinserted when pruned
61 POS + MORPH Experiments Order Arabic Czech Spanish German Hungarian Oracle PCRF * * Oracle is a rst-order PCRF, but gold tags get reinserted when pruned Small losses for Spanish and Hungarian, greater losses for Arabic, English and German
62 POS + MORPH Experiments Order Arabic Czech Spanish German Hungarian Oracle PCRF * * PCRF * 93.06* * 96.57* PCRF * 92.97* * Oracle is a rst-order PCRF, but gold tags get reinserted when pruned Small losses for Spanish and Hungarian, greater losses for Arabic, English and German Higher-order models outperform Oracle
63 POS + MORPH Experiments II - Accuracy Arabic Czech Spanish German Hungarian SVMTool Morfette CRFSuite PCRF PCRF PCRF PCRF outperforms best baselines
64 POS + MORPH Experiments II - Accuracy Arabic Czech Spanish German Hungarian SVMTool Morfette CRFSuite PCRF PCRF PCRF PCRF outperforms best baselines Moderate improvements for less ambiguous languages (Spanish, Hungarian)
65 POS + MORPH Experiments II - Accuracy Arabic Czech Spanish German Hungarian SVMTool Morfette CRFSuite PCRF PCRF PCRF PCRF outperforms best baselines Moderate improvements for less ambiguous languages (Spanish, Hungarian) Large improvements for more ambiguous languages (Arabic, Czech, German)
66 POS + MORPH Experiments II - Training Times Arabic Czech Spanish German Hungarian Morfette CRFSuite PCRF PCRF PCRF CRFSuite slower than Morfette (An order of magnitude for Hungarian and Czech)
67 POS + MORPH Experiments II - Training Times Arabic Czech Spanish German Hungarian Morfette CRFSuite PCRF PCRF PCRF CRFSuite slower than Morfette (An order of magnitude for Hungarian and Czech) PCRF usually twice as fast as Morfette
68 POS + MORPH Experiments II - Training Times Arabic Czech Spanish German Hungarian Morfette CRFSuite PCRF PCRF PCRF CRFSuite slower than Morfette (An order of magnitude for Hungarian and Czech) PCRF usually twice as fast as Morfette Czech training takes a week for CRFSuite and 5h for PCRF
69 Conclusion Approximate CRF tagger for big tag sets Fast due to coarse-to-ne decoding (speedups of up to 30) Supports high-order CRFs Higher accuracy than a number of baselines thanks to high order
70 MarMoT - MarMoT Morphological Tagger Our open-source implementation MarMoT is available at Thank you for your attention!
71 References I Chrupaªa, G., Dinu, G., and van Genabith, J. (2008). Learning morphology with Morfette. In Proceedings of LREC. Collins, M. and Roark, B. (2004). Incremental parsing with the perceptron algorithm. In Proceedings of ACL. Giménez, J. and Màrquez, L. (2004). Svmtool: A general POS tagger generator based on Support Vector Machines. In Proceedings of LREC. Okazaki, N. (2007). Crfsuite: A fast implementation of conditional random elds (CRFs). URL
72 References II Schmid, H. and Laws, F. (2008). Estimation of conditional probabilities with decision trees and an application to ne-grained POS tagging. In Proceedings of COLING. Toutanova, K., Klein, D., Manning, C. D., and Singer, Y. (2003). Feature-rich part-of-speech tagging with a cyclic dependency network. In Proceedings of NAACL. Tsuruoka, Y., Tsujii, J., and Ananiadou, S. (2009). Stochastic gradient descent training for L1-regularized log-linear models with cumulative penalty. In Proceedings of ACL.
Natural Language Processing
Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationDependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing
Dependency Parsing Statistical NLP Fall 2016 Lecture 9: Dependency Parsing Slav Petrov Google prep dobj ROOT nsubj pobj det PRON VERB DET NOUN ADP NOUN They solved the problem with statistics CoNLL Format
More informationStatistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.
http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT
More informationLecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)
Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for
More informationA Comparative Study of Parameter Estimation Methods for Statistical Natural Language Processing
A Comparative Study of Parameter Estimation Methods for Statistical Natural Language Processing Jianfeng Gao *, Galen Andrew *, Mark Johnson *&, Kristina Toutanova * * Microsoft Research, Redmond WA 98052,
More informationConditional Random Fields for Sequential Supervised Learning
Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd
More informationTnT Part of Speech Tagger
TnT Part of Speech Tagger By Thorsten Brants Presented By Arghya Roy Chaudhuri Kevin Patel Satyam July 29, 2014 1 / 31 Outline 1 Why Then? Why Now? 2 Underlying Model Other technicalities 3 Evaluation
More informationACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging
ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers
More informationLecture 13: Structured Prediction
Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page
More informationSequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015
Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative
More informationStructured Prediction
Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the
More informationGraphical models for part of speech tagging
Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional
More informationProbabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov
Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly
More informationCSE 490 U Natural Language Processing Spring 2016
CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Structure in the output variable(s)? What is the
More informationLow-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park
Low-Dimensional Discriminative Reranking Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Discriminative Reranking Useful for many NLP tasks Enables us to use arbitrary features
More informationSequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them
HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated
More informationwith Local Dependencies
CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence
More informationLab 12: Structured Prediction
December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?
More informationMarrying Dynamic Programming with Recurrent Neural Networks
Marrying Dynamic Programming with Recurrent Neural Networks I eat sushi with tuna from Japan Liang Huang Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark Marrying
More informationECS171: Machine Learning
ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f
More informationProbabilistic Models for Sequence Labeling
Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative
More informationMore on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013
More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative
More informationMachine Learning for Structured Prediction
Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for
More informationNeural Networks in Structured Prediction. November 17, 2015
Neural Networks in Structured Prediction November 17, 2015 HWs and Paper Last homework is going to be posted soon Neural net NER tagging model This is a new structured model Paper - Thursday after Thanksgiving
More informationAutomatic Speech Recognition and Statistical Machine Translation under Uncertainty
Outlines Automatic Speech Recognition and Statistical Machine Translation under Uncertainty Lambert Mathias Advisor: Prof. William Byrne Thesis Committee: Prof. Gerard Meyer, Prof. Trac Tran and Prof.
More informationQuasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti
Quasi-Synchronous Phrase Dependency Grammars for Machine Translation Kevin Gimpel Noah A. Smith 1 Introduction MT using dependency grammars on phrases Phrases capture local reordering and idiomatic translations
More informationLog-Linear Models, MEMMs, and CRFs
Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx
More informationChunking with Support Vector Machines
NAACL2001 Chunking with Support Vector Machines Graduate School of Information Science, Nara Institute of Science and Technology, JAPAN Taku Kudo, Yuji Matsumoto {taku-ku,matsu}@is.aist-nara.ac.jp Chunking
More informationLogistic Regression: Online, Lazy, Kernelized, Sequential, etc.
Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010
More informationDeep Learning for Natural Language Processing
Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca borui.ye@uwaterloo.ca July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationCS395T: Structured Models for NLP Lecture 19: Advanced NNs I
CS395T: Structured Models for NLP Lecture 19: Advanced NNs I Greg Durrett Administrivia Kyunghyun Cho (NYU) talk Friday 11am GDC 6.302 Project 3 due today! Final project out today! Proposal due in 1 week
More informationApprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning
Apprentissage, réseaux de neurones et modèles graphiques (RCP209) Neural Networks and Deep Learning Nicolas Thome Prenom.Nom@cnam.fr http://cedric.cnam.fr/vertigo/cours/ml2/ Département Informatique Conservatoire
More informationCS388: Natural Language Processing Lecture 4: Sequence Models I
CS388: Natural Language Processing Lecture 4: Sequence Models I Greg Durrett Mini 1 due today Administrivia Project 1 out today, due September 27 Viterbi algorithm, CRF NER system, extension Extension
More informationCS260: Machine Learning Algorithms
CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {
More informationDiscrimina)ve Latent Variable Models. SPFLODD November 15, 2011
Discrimina)ve Latent Variable Models SPFLODD November 15, 2011 Lecture Plan 1. Latent variables in genera)ve models (review) 2. Latent variables in condi)onal models 3. Latent variables in structural SVMs
More informationMachine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6
Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)
More informationSequential Supervised Learning
Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given
More informationMinibatch and Parallelization for Online Large Margin Structured Learning
Minibatch and Parallelization for Online Large Margin Structured Learning Kai Zhao Computer Science Program, Graduate Center City University of New York kzhao@gc.cuny.edu Liang Huang, Computer Science
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationLecture 5 Neural models for NLP
CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm
More informationLinear classifiers: Overfitting and regularization
Linear classifiers: Overfitting and regularization Emily Fox University of Washington January 25, 2017 Logistic regression recap 1 . Thus far, we focused on decision boundaries Score(x i ) = w 0 h 0 (x
More informationStatistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields
Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project
More informationCSE 447/547 Natural Language Processing Winter 2018
CSE 447/547 Natural Language Processing Winter 2018 Feature Rich Models (Log Linear Models) Yejin Choi University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Announcements HW #3 Due Feb
More informationNLP Homework: Dependency Parsing with Feed-Forward Neural Network
NLP Homework: Dependency Parsing with Feed-Forward Neural Network Submission Deadline: Monday Dec. 11th, 5 pm 1 Background on Dependency Parsing Dependency trees are one of the main representations used
More informationLEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING
LEARNING SPARSE STRUCTURED ENSEMBLES WITH STOCASTIC GTADIENT MCMC SAMPLING AND NETWORK PRUNING Yichi Zhang Zhijian Ou Speech Processing and Machine Intelligence (SPMI) Lab Department of Electronic Engineering
More informationSparse Forward-Backward for Fast Training of Conditional Random Fields
Sparse Forward-Backward for Fast Training of Conditional Random Fields Charles Sutton, Chris Pal and Andrew McCallum University of Massachusetts Amherst Dept. Computer Science Amherst, MA 01003 {casutton,
More informationSoft Inference and Posterior Marginals. September 19, 2013
Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard inference Give me a single solution Viterbi algorithm Maximum spanning tree (Chu-Liu-Edmonds alg.) Soft inference
More informationCPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017
CPSC 340: Machine Learning and Data Mining Stochastic Gradient Fall 2017 Assignment 3: Admin Check update thread on Piazza for correct definition of trainndx. This could make your cross-validation code
More informationCMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss
CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss Jeffrey Flanigan Chris Dyer Noah A. Smith Jaime Carbonell School of Computer Science, Carnegie Mellon University, Pittsburgh,
More informationSYNTHER A NEW M-GRAM POS TAGGER
SYNTHER A NEW M-GRAM POS TAGGER David Sündermann and Hermann Ney RWTH Aachen University of Technology, Computer Science Department Ahornstr. 55, 52056 Aachen, Germany {suendermann,ney}@cs.rwth-aachen.de
More informationPenn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark
Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument
More informationDEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY
DEEP LEARNING AND NEURAL NETWORKS: BACKGROUND AND HISTORY 1 On-line Resources http://neuralnetworksanddeeplearning.com/index.html Online book by Michael Nielsen http://matlabtricks.com/post-5/3x3-convolution-kernelswith-online-demo
More informationCS395T: Structured Models for NLP Lecture 19: Advanced NNs I. Greg Durrett
CS395T: Structured Models for NLP Lecture 19: Advanced NNs I Greg Durrett Administrivia Kyunghyun Cho (NYU) talk Friday 11am GDC 6.302 Project 3 due today! Final project out today! Proposal due in 1 week
More informationStructured Prediction
Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured
More informationStructured Prediction with Perceptron: Theory and Algorithms
Structured Prediction with Perceptron: Theor and Algorithms the man bit the dog the man hit the dog DT NN VBD DT NN 那 人咬了狗 =+1 =-1 Kai Zhao Dept. Computer Science Graduate Center, the Cit Universit of
More informationMACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING
MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion
More informationSpectral Unsupervised Parsing with Additive Tree Metrics
Spectral Unsupervised Parsing with Additive Tree Metrics Ankur Parikh, Shay Cohen, Eric P. Xing Carnegie Mellon, University of Edinburgh Ankur Parikh 2014 1 Overview Model: We present a novel approach
More informationFeature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager
Feature Noising Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager Outline Part 0: Some backgrounds Part 1: Dropout as adaptive
More informationLearning to translate with neural networks. Michael Auli
Learning to translate with neural networks Michael Auli 1 Neural networks for text processing Similar words near each other France Spain dog cat Neural networks for text processing Similar words near each
More information2.2 Structured Prediction
The hinge loss (also called the margin loss), which is optimized by the SVM, is a ramp function that has slope 1 when yf(x) < 1 and is zero otherwise. Two other loss functions squared loss and exponential
More informationLogistic Regression & Neural Networks
Logistic Regression & Neural Networks CMSC 723 / LING 723 / INST 725 Marine Carpuat Slides credit: Graham Neubig, Jacob Eisenstein Logistic Regression Perceptron & Probabilities What if we want a probability
More informationTransition-Based Parsing
Transition-Based Parsing Based on atutorial at COLING-ACL, Sydney 2006 with Joakim Nivre Sandra Kübler, Markus Dickinson Indiana University E-mail: skuebler,md7@indiana.edu Transition-Based Parsing 1(11)
More informationA Support Vector Method for Multivariate Performance Measures
A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationLogistic Regression. William Cohen
Logistic Regression William Cohen 1 Outline Quick review classi5ication, naïve Bayes, perceptrons new result for naïve Bayes Learning as optimization Logistic regression via gradient ascent Over5itting
More informationSupport Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017
Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem
More informationMachine Learning for NLP
Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers
More informationFeature-Frequency Adaptive On-line Training for Fast and Accurate Natural Language Processing
Feature-Frequency Adaptive On-line Training for Fast and Accurate Natural Language Processing Xu Sun Peking University Wenjie Li Hong Kong Polytechnic University Houfeng Wang Peking University Qin Lu Hong
More informationIntelligent Systems (AI-2)
Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,
More informationMachine Learning for NLP
Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline
More informationBetter! Faster! Stronger*!
Jason Eisner Jiarong Jiang He He Better! Faster! Stronger*! Learning to balance accuracy and efficiency when predicting linguistic structures (*theorems) Hal Daumé III UMD CS, UMIACS, Linguistics me@hal3.name
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationMachine Learning Basics
Security and Fairness of Deep Learning Machine Learning Basics Anupam Datta CMU Spring 2019 Image Classification Image Classification Image classification pipeline Input: A training set of N images, each
More informationNotes on AdaGrad. Joseph Perla 2014
Notes on AdaGrad Joseph Perla 2014 1 Introduction Stochastic Gradient Descent (SGD) is a common online learning algorithm for optimizing convex (and often non-convex) functions in machine learning today.
More informationTuning as Linear Regression
Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical
More informationFrom perceptrons to word embeddings. Simon Šuster University of Groningen
From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written
More informationStructured Prediction Models via the Matrix-Tree Theorem
Structured Prediction Models via the Matrix-Tree Theorem Terry Koo Amir Globerson Xavier Carreras Michael Collins maestro@csail.mit.edu gamir@csail.mit.edu carreras@csail.mit.edu mcollins@csail.mit.edu
More informationProbabilistic Context Free Grammars. Many slides from Michael Collins
Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar
More informationIntegrating Morphology in Probabilistic Translation Models
Integrating Morphology in Probabilistic Translation Models Chris Dyer joint work with Jon Clark, Alon Lavie, and Noah Smith January 24, 2011 lti das alte Haus the old house mach das do that 2 das alte
More informationClick Prediction and Preference Ranking of RSS Feeds
Click Prediction and Preference Ranking of RSS Feeds 1 Introduction December 11, 2009 Steven Wu RSS (Really Simple Syndication) is a family of data formats used to publish frequently updated works. RSS
More informationA Discriminative Model for Semantics-to-String Translation
A Discriminative Model for Semantics-to-String Translation Aleš Tamchyna 1 and Chris Quirk 2 and Michel Galley 2 1 Charles University in Prague 2 Microsoft Research July 30, 2015 Tamchyna, Quirk, Galley
More informationNatural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu
Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak
More informationDual Decomposition for Natural Language Processing. Decoding complexity
Dual Decomposition for atural Language Processing Alexander M. Rush and Michael Collins Decoding complexity focus: decoding problem for natural language tasks motivation: y = arg max y f (y) richer model
More informationLogistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu
Logistic Regression Review 10-601 Fall 2012 Recitation September 25, 2012 TA: Selen Uguroglu!1 Outline Decision Theory Logistic regression Goal Loss function Inference Gradient Descent!2 Training Data
More informationMachine Learning. Neural Networks. (slides from Domingos, Pardo, others)
Machine Learning Neural Networks (slides from Domingos, Pardo, others) For this week, Reading Chapter 4: Neural Networks (Mitchell, 1997) See Canvas For subsequent weeks: Scaling Learning Algorithms toward
More informationLecture 9: PGM Learning
13 Oct 2014 Intro. to Stats. Machine Learning COMP SCI 4401/7401 Table of Contents I Learning parameters in MRFs 1 Learning parameters in MRFs Inference and Learning Given parameters (of potentials) and
More informationlecture 6: modeling sequences (final part)
Natural Language Processing 1 lecture 6: modeling sequences (final part) Ivan Titov Institute for Logic, Language and Computation Outline After a recap: } Few more words about unsupervised estimation of
More informationNeural networks. Chapter 19, Sections 1 5 1
Neural networks Chapter 19, Sections 1 5 Chapter 19, Sections 1 5 1 Outline Brains Neural networks Perceptrons Multilayer perceptrons Applications of neural networks Chapter 19, Sections 1 5 2 Brains 10
More informationHMM part 1. Dr Philip Jackson
Centre for Vision Speech & Signal Processing University of Surrey, Guildford GU2 7XH. HMM part 1 Dr Philip Jackson Probability fundamentals Markov models State topology diagrams Hidden Markov models -
More informationConditional Random Field
Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions
More informationLinear discriminant functions
Andrea Passerini passerini@disi.unitn.it Machine Learning Discriminative learning Discriminative vs generative Generative learning assumes knowledge of the distribution governing the data Discriminative
More informationA Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister
A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model
More informationFeedforward Neural Networks
Feedforward Neural Networks Michael Collins 1 Introduction In the previous notes, we introduced an important class of models, log-linear models. In this note, we describe feedforward neural networks, which
More informationClassification with Perceptrons. Reading:
Classification with Perceptrons Reading: Chapters 1-3 of Michael Nielsen's online book on neural networks covers the basics of perceptrons and multilayer neural networks We will cover material in Chapters
More informationFeatures of Statistical Parsers
Features of tatistical Parsers Preliminary results Mark Johnson Brown University TTI, October 2003 Joint work with Michael Collins (MIT) upported by NF grants LI 9720368 and II0095940 1 Talk outline tatistical
More informationAN ABSTRACT OF THE DISSERTATION OF
AN ABSTRACT OF THE DISSERTATION OF Kai Zhao for the degree of Doctor of Philosophy in Computer Science presented on May 30, 2017. Title: Structured Learning with Latent Variables: Theory and Algorithms
More informationINF4820: Algorithms for Artificial Intelligence and Natural Language Processing. Hidden Markov Models
INF4820: Algorithms for Artificial Intelligence and Natural Language Processing Hidden Markov Models Murhaf Fares & Stephan Oepen Language Technology Group (LTG) October 27, 2016 Recap: Probabilistic Language
More informationReducing Multiclass to Binary: A Unifying Approach for Margin Classifiers
Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning
More information