Search-Based Structured Prediction

Size: px
Start display at page:

Download "Search-Based Structured Prediction"

Transcription

1 Search-Based Structured Prediction by Harold C. Daumé III (Utah), John Langford (Yahoo), and Daniel Marcu (USC) Submitted to Machine Learning, 2007 Presented by: Eugene Weinstein, NYU/Courant Institute October 2nd, 2007

2 Structured Prediction Intro Given: labeled training data Task: learn mapping from inputs Special cases Binary classification: Y = { 1, 1} ution. Unlike the Y = {1,..., k} bered labels, we Multiclass classification: Natural language parsing example: (x 1, y 1 ),..., (x m, y m ) X Y x X to outputs y Y 2

3 Exploiting Structure Naive approach: treat each possible output in Y as discrete label, apply multiclass classification. But: Enumerating all members of Y often intractable Cannot model closeness of examples (changing one node of tree vs. changing the entire tree) Approach: try to exploit structure and dependencies within the output space Represent closeness of outputs using loss function small loss 3 big loss

4 SP Overview Discriminative structured prediction papers typically extend multiclass classification or regression techniques Most classification schemes use SVM-like max-margin linear classifications incorporating loss functions [Taskar, Guestrin, Koller 03], [Tsochantaridis, Hofmann, Joachims, Altun 04] [Sha, Saul 07] Regression formulation of SP: [Cortes, Mohri, Weston 06] Searn is a meta-algorithm. Claim: given multiclass classifier achieving good generalization, Searn does the same for SP 4

5 Search-based SP [Daumé 06] [Daumé, Langford, Marcu 07] Searn: view structured prediction as search problem SP: distribution D over inputs, output costs (x, c) c = Y e.g.: x i is input, c y is the loss for any y to the true label y i Define loss of cost-sensitive classifier as h : X Y L(D, h) = E (x,c) D { ch(x) } View outputs as vectors y = [y (1),..., y (l) ], but classification problems not limited to sequences A classifier defines a path through space of input/output pairs, and training process iteratively refines the classifier 5

6 Searn Specifics We need to provide: Cost-sensitive multiclass learning algorithm Initial classifier Loss function Initial classifier should have low training error, but need not generalize well Could be best path from any standard search algorithm Each Searn iteration finds a classifier that is not as good on the training set, but generalizes a little better 6

7 Searn Training Search state space: (input, partial output): Initial classifier: pick next label that minimizes cost, assuming that all future decisions are also optimal: s = (x, y (1),..., y (l) ) h 0 (s, c) = arg min y (l+1) min y (l+2),...,y (L) c [(y (1),...,y (L) )] Iterative step: use current classifier h to construct a set of examples to train the next classifier; then interpolate For each state, try every possible next output Cost assigned to each output tried is loss difference l h (c, s, a) = E y (s,a,h) c y min a 7 E y (s,a,h)c y

8 Searn Training Illustration y i = 1 l h (c, s, a) = E y (s,a,h) c y min a i = 1 i = 2 i = 3 i = 4 i = 5 E y (s,a,h)c y i = Prediction of current classifier h Other path being considered (s, a, h) Current state s Potential next state a 8

9 Searn Training Illustration y i = 1 2 l h (c, s, a) = E y (s,a,h) c y min a i = 1 i = 2 i = 3 i = 4 i = 5 l h = 2 E y (s,a,h)c y i = Prediction of current classifier h Other path being considered (s, a, h) Current state s Potential next state a 8

10 Searn Training Illustration y i = l h (c, s, a) = E y (s,a,h) c y min a i = 1 i = 2 i = 3 i = 4 i = 5 l h = 2 l h = 5 E y (s,a,h)c y i = 6 4 Prediction of current classifier h Other path being considered (s, a, h) Current state s Potential next state a 8

11 Searn Training Illustration y i = l h (c, s, a) = E y (s,a,h) c y min a i = 1 i = 2 i = 3 i = 4 i = 5 l h = 2 l h = 5 l h = 1 E y (s,a,h)c y i = 6 Prediction of current classifier h Other path being considered (s, a, h) Current state s Potential next state a 8

12 Searn Training Illustration y i = l h (c, s, a) = E y (s,a,h) c y min a i = 1 i = 2 i = 3 i = 4 i = 5 l h = 2 l h = 5 l h = 1 l h = 0 E y (s,a,h)c y i = 6 Prediction of current classifier h Other path being considered (s, a, h) Current state s Potential next state a 8

13 Searn Meta-Algorithm Input: (x 1, y 1 ),..., (x m, y m ), h 0, A while h has a significant dependence on h 0 : Initialize set of cost-sensitive examples: for i 1,..., m Compute prediction: State consists of input and for l 1,..., L s l (x i, y (1),..., y (l) ) for each next output a after : Compute features and add example: Learn and interpolate: Return h with h 0 removed S (y (1),..., y (L) ) h(x i ) Use losses to build up training examples for next iteration s c l s l,a l h (c, s l, a) S f(sl, c ) h A(S); h βh + (1 β)h

14 Algorithm Analysis h i is the classifier trained up to the ith iteration and is the loss of on this iteration s training examples T is the maximum length of any output sequence Theorem: If c max = E (x,c) D max c y and y (average loss over I iterations) then total loss with and iterations is bounded as 2T 3 ln T h i L(D, h last ) L(D, h 0 ) + 2T l avg log T + (1 + log T )c max /T Proof analyses the mixture of old and new classifiers In practice, β can be larger (more aggressive learning) 10 l hi (h i) l avg = 1 I I i=1 l h i (h i ) β = 1/T 3

15 Proof Lemma 1: For classifier h new learned by interpolating and as, if c max = E (x,c) D max, we have h new βh + (1 β)h L(D, h new ) L(D, h) + T βlh CS (h ) β2 T 2 c max Proof: Consider 3 cases: h is never called ( c = 0), is called exactly once ( c = 1), and is called more than once ( c 2 ) Then loss of h new is bounded as L(D, h new ) =P r(c = 0)L(D, h new c = 0) + P r(c = 1)L(D, h new c = 1) y c y h h + P r(c 2)L(D, h new c 2) (1 β) T L(D, h) + T β(1 β) T 1[ ] L(D, h) + l CS h (h ) [ [ + 1 (1 β) T T β(1 β) T 1] c max

16 Proof Cont d [ [ [ ] ] [ [ ] (1 β) T L(D, h) + T β(1 β) T 1[ ] ] L(D, h new ) = L(D, h) + l CS h (h ) [ + 1 (1 β) T T β(1 β) T 1] ( ( ) ) ( c max ( [ ] T ( ) ) T =L(D, D[ h) + T β(1 β) T 1 l CS h (h ) + ] ( 1) i β i L(D, h) ( i i=2 ] ( ) [ ) + 1 (1 β) T T β(1 β) T 1] [ ] c max L(D, [ h) + T βl CS h (h ) ] [ + 1 (1 β) T T β(1 β) T 1] [ ] (c max L(D, h)) [ L(D, h) + T βl CS h (h ] ) [ + 1 (1 β) T T β(1 ( β) T 1] c max ( ( ) T ( ) ) T =L(D, h) + T βl CS h (h ) + ( ( 1) i β i c i ( ) max ) i=2 L(D, h) + T βl CS h (h ) T 2 β 2 c max [Binomial Expansion] [Binomial Expansion] [Keep first term and t β < T/2 ] 12

17 Proof Cont d Lemma 2: After C/β iterations of Searn, the loss of the final classifier learned is bounded as ( ) 1 L(D, h last ) L(D, h 0 ) + CT l avg + c max 2 CT 2 β + T exp( C) Proof: Invoking Lemma 1 repeatedly, we get ( ) L(D, h) L(D, h 0 ) + CT l avg + ( 1 2 CT 2 β If we remove the initial (optimal) classifier, might incur a loss of ; probability of failing after C/β iterations c max T (1 β) C/β T exp[ C] ) 13

18 Experiments Handwriting recognition [Kassel 95] Named entity recognition El presidente de la [Junta de Extremadura] ORG, [Juan Carlos Rodríguez Ibarra] PER, recibirá en la sede de la [Presidencia del Gobierno] ORG extremeño a familiares de varios de los condenados por el proceso [Lasa-Zabala] MISC, entre ellos a [Lourdes Díez Urraca] PER, esposa del ex gobernador civil de [Guipúzcoa] LOC [Julen Elgorriaga] PER ; y a [Antonio Rodríguez Galindo] PER, hermano del general [Enrique Rodríguez Galindo] PER. Syntactic chunking and part-of-speech (POS) tagging [Great American] NP [said] VP [it] NP [increased] VP [its loan-loss reserves] NP [by] PP [$ 93 million] NP [after] PP [reviewing] VP [its loan portfolio] NP, [raising] VP [its total loan and real estate reserves] NP [to] PP [$ 217 million] NP. Great NNP B-NP reserves NNS I-NP portfolio NN I-NP. Ȯ American NNP I-NP said VBD B-VP it PRP B-NP 14 increased VBD B-VP its PRP$ B-NP by IN B-PP $ $ B-NP 93 CD I-NP million CD I-NP after IN B-PP reviewing VBG B-VP its PRP$ B-NP loan-loss NN I-NP loan NN I-NP

19 Experiments ALGORITHM Handwriting NER Chunk C+T Small Large Small Large CLASSIFICATION Perceptron Log Reg SVM-Lin SVM-Quad STRUCTURED Str. Perc CRF SVM struct M 3 N-Lin M 3 N-Quad SEARN Perceptron Log Reg SVM-Lin SVM-Quad

20 Experiments New vine-growth model for sentence summarization DUC 2005 data set: 50 sets of 25 documents each Evaluation: Rouge ( n-gram overlap) vs. human summaries 26 Hal Daumé III et al. ORACLE SEARN BAYESUM Vine Extr Vine Extr D05 D03 Base Best 100 w w Table 2 Summarization results; values are Rouge 2 scores (higher is better). 16

21 Bibliography Harold C. Daumé III, Practical structured learning for natural language processing, Ph.D. Thesis, University of Southern California, Harold C. Daumé III, John Langford, and Daniel Marcu. Search-Based Structured Prediction, Submitted to Machine Learning, 2007 Robert Kassel. A Comparison of Approaches to On-line Handwritten Character Recognition. PhD thesis, Massachusetts Institute of Technology, Spoken Language Systems Group, Koby Crammer and Yoram Singer. On the algorithmic implementation of multiclass kernel-based vector machines. Journal of Machine Learning Research, 2(5): , Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin Markov networks. Neural Information Processing Systems (NIPS) 16, Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support vector machine learning for interdependent and structured output spaces. Proceedings ICML, Fei Sha and Lawrence K. Saul. Large margin hidden Markov models for automatic speech recognition, Neural Information Processing Systems (NIPS) 19, William W. Cohen and Vitor Carvalho. Stacked sequential learning. In Proceedings of the International Joint Conference on Artificial Intelligence (IJ-CAI), Michael Collins and Brian Roark. Incremental parsing with the perceptron algorithm. In Proceedings of the Conference of the Association for Computational Linguistics (ACL), Alina Beygelzimer, Varsha Dani, Tom Hayes, John Langford, and Bianca Zadrozny. Error limiting reductions between classification tasks. In Proceedings of the International Conference on Machine Learning (ICML), 2005.

Discriminative Methods for Structured Prediction

Discriminative Methods for Structured Prediction Discriminative Methods for Structured Prediction Eugene Weinstein, PhD Candidate New York University, Courant Institute Department of Computer Science Depth Qualifying Exam June 20th, 2007 Talk Outline

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

Support Vector Machines: Kernels

Support Vector Machines: Kernels Support Vector Machines: Kernels CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 14.1, 14.2, 14.4 Schoelkopf/Smola Chapter 7.4, 7.6, 7.8 Non-Linear Problems

More information

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss

CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss CMU at SemEval-2016 Task 8: Graph-based AMR Parsing with Infinite Ramp Loss Jeffrey Flanigan Chris Dyer Noah A. Smith Jaime Carbonell School of Computer Science, Carnegie Mellon University, Pittsburgh,

More information

Structured Prediction

Structured Prediction Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML) x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the

More information

Algorithms for Predicting Structured Data

Algorithms for Predicting Structured Data 1 / 70 Algorithms for Predicting Structured Data Thomas Gärtner / Shankar Vembu Fraunhofer IAIS / UIUC ECML PKDD 2010 Structured Prediction 2 / 70 Predicting multiple outputs with complex internal structure

More information

Sequence Labelling SVMs Trained in One Pass

Sequence Labelling SVMs Trained in One Pass Sequence Labelling SVMs Trained in One Pass Antoine Bordes 12, Nicolas Usunier 1, and Léon Bottou 2 1 LIP6, Université Paris 6, 104 Avenue du Pdt Kennedy, 75016 Paris, France 2 NEC Laboratories America,

More information

2.2 Structured Prediction

2.2 Structured Prediction The hinge loss (also called the margin loss), which is optimized by the SVM, is a ramp function that has slope 1 when yf(x) < 1 and is zero otherwise. Two other loss functions squared loss and exponential

More information

A Support Vector Method for Multivariate Performance Measures

A Support Vector Method for Multivariate Performance Measures A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme

More information

Generalized Linear Classifiers in NLP

Generalized Linear Classifiers in NLP Generalized Linear Classifiers in NLP (or Discriminative Generalized Linear Feature-Based Classifiers) Graduate School of Language Technology, Sweden 2009 Ryan McDonald Google Inc., New York, USA E-mail:

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Introduction to Data-Driven Dependency Parsing

Introduction to Data-Driven Dependency Parsing Introduction to Data-Driven Dependency Parsing Introductory Course, ESSLLI 2007 Ryan McDonald 1 Joakim Nivre 2 1 Google Inc., New York, USA E-mail: ryanmcd@google.com 2 Uppsala University and Växjö University,

More information

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research

Introduction to Machine Learning Lecture 13. Mehryar Mohri Courant Institute and Google Research Introduction to Machine Learning Lecture 13 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification Mehryar Mohri - Introduction to Machine Learning page 2 Motivation

More information

Distributed Training Strategies for the Structured Perceptron

Distributed Training Strategies for the Structured Perceptron Distributed Training Strategies for the Structured Perceptron Ryan McDonald Keith Hall Gideon Mann Google, Inc., New York / Zurich {ryanmcd kbhall gmann}@google.com Abstract Perceptron training is widely

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

The Structured Weighted Violations Perceptron Algorithm

The Structured Weighted Violations Perceptron Algorithm The Structured Weighted Violations Perceptron Algorithm Rotem Dror and Roi Reichart Faculty of Industrial Engineering and Management, Technion, IIT {rtmdrr@campus roiri@ie}.technion.ac.il Abstract We present

More information

Structured Prediction

Structured Prediction Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured

More information

Polyhedral Outer Approximations with Application to Natural Language Parsing

Polyhedral Outer Approximations with Application to Natural Language Parsing Polyhedral Outer Approximations with Application to Natural Language Parsing André F. T. Martins 1,2 Noah A. Smith 1 Eric P. Xing 1 1 Language Technologies Institute School of Computer Science Carnegie

More information

NLP Programming Tutorial 11 - The Structured Perceptron

NLP Programming Tutorial 11 - The Structured Perceptron NLP Programming Tutorial 11 - The Structured Perceptron Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Prediction Problems Given x, A book review Oh, man I love this book! This book is

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Machine Learning for natural language processing

Machine Learning for natural language processing Machine Learning for natural language processing Classification: Maximum Entropy Models Laura Kallmeyer Heinrich-Heine-Universität Düsseldorf Summer 2016 1 / 24 Introduction Classification = supervised

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Multi-Class Classification. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Multi-Class Classification Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation Real-world problems often have multiple classes: text, speech,

More information

1 SVM Learning for Interdependent and Structured Output Spaces

1 SVM Learning for Interdependent and Structured Output Spaces 1 SVM Learning for Interdependent and Structured Output Spaces Yasemin Altun Toyota Technological Institute at Chicago, Chicago IL 60637 USA altun@tti-c.org Thomas Hofmann Google, Zurich, Austria thofmann@google.com

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Applied Natural Language Processing

Applied Natural Language Processing Applied Natural Language Processing Info 256 Lecture 20: Sequence labeling (April 9, 2019) David Bamman, UC Berkeley POS tagging NNP Labeling the tag that s correct for the context. IN JJ FW SYM IN JJ

More information

Support Vector Machines.

Support Vector Machines. Support Vector Machines www.cs.wisc.edu/~dpage 1 Goals for the lecture you should understand the following concepts the margin slack variables the linear support vector machine nonlinear SVMs the kernel

More information

Linear Classifiers IV

Linear Classifiers IV Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning Lecture 9. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning Lecture 9 Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Multi-Class Classification page 2 Motivation Real-world problems often have multiple classes:

More information

2006/06/19 14:19. Predicting Structured Data

2006/06/19 14:19. Predicting Structured Data Predicting Structured Data Advances in Neural Information Processing Systems Published by Morgan-Kaufmann NIPS-1 Advances in Neural Information Processing Systems 1: Proceedings of the 1988 Conference,

More information

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Classification II Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Minimize Training Error? A loss function declares how costly each mistake is E.g. 0 loss for correct label,

More information

ML4NLP Multiclass Classification

ML4NLP Multiclass Classification ML4NLP Multiclass Classification CS 590NLP Dan Goldwasser Purdue University dgoldwas@purdue.edu Social NLP Last week we discussed the speed-dates paper. Interesting perspective on NLP problems- Can we

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers

Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Reducing Multiclass to Binary: A Unifying Approach for Margin Classifiers Erin Allwein, Robert Schapire and Yoram Singer Journal of Machine Learning Research, 1:113-141, 000 CSE 54: Seminar on Learning

More information

Machine Learning Overview

Machine Learning Overview Machine Learning Overview Sargur N. Srihari University at Buffalo, State University of New York USA 1 Outline 1. What is Machine Learning (ML)? 2. Types of Information Processing Problems Solved 1. Regression

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

AN ABSTRACT OF THE DISSERTATION OF

AN ABSTRACT OF THE DISSERTATION OF AN ABSTRACT OF THE DISSERTATION OF Kai Zhao for the degree of Doctor of Philosophy in Computer Science presented on May 30, 2017. Title: Structured Learning with Latent Variables: Theory and Algorithms

More information

Probabilistic Graphical Models

Probabilistic Graphical Models School of Computer Science Probabilistic Graphical Models Max-margin learning of GM Eric Xing Lecture 28, Apr 28, 2014 b r a c e Reading: 1 Classical Predictive Models Input and output space: Predictive

More information

Structured Prediction Theory and Algorithms

Structured Prediction Theory and Algorithms Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE

More information

Chunking with Support Vector Machines

Chunking with Support Vector Machines NAACL2001 Chunking with Support Vector Machines Graduate School of Information Science, Nara Institute of Science and Technology, JAPAN Taku Kudo, Yuji Matsumoto {taku-ku,matsu}@is.aist-nara.ac.jp Chunking

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 18, 2016 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

CSCI-567: Machine Learning (Spring 2019)

CSCI-567: Machine Learning (Spring 2019) CSCI-567: Machine Learning (Spring 2019) Prof. Victor Adamchik U of Southern California Mar. 19, 2019 March 19, 2019 1 / 43 Administration March 19, 2019 2 / 43 Administration TA3 is due this week March

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Gaussian and Linear Discriminant Analysis; Multiclass Classification Gaussian and Linear Discriminant Analysis; Multiclass Classification Professor Ameet Talwalkar Slide Credit: Professor Fei Sha Professor Ameet Talwalkar CS260 Machine Learning Algorithms October 13, 2015

More information

Personal Project: Shift-Reduce Dependency Parsing

Personal Project: Shift-Reduce Dependency Parsing Personal Project: Shift-Reduce Dependency Parsing 1 Problem Statement The goal of this project is to implement a shift-reduce dependency parser. This entails two subgoals: Inference: We must have a shift-reduce

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Oct 27, 2015 Outline One versus all/one versus one Ranking loss for multiclass/multilabel classification Scaling to millions of labels Multiclass

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Lecture 7: Sequence Labeling

Lecture 7: Sequence Labeling http://courses.engr.illinois.edu/cs447 Lecture 7: Sequence Labeling Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Recap: Statistical POS tagging with HMMs (J. Hockenmaier) 2 Recap: Statistical

More information

1 Training and Approximation of a Primal Multiclass Support Vector Machine

1 Training and Approximation of a Primal Multiclass Support Vector Machine 1 Training and Approximation of a Primal Multiclass Support Vector Machine Alexander Zien 1,2 and Fabio De Bona 1 and Cheng Soon Ong 1,2 1 Friedrich Miescher Lab., Max Planck Soc., Spemannstr. 39, Tübingen,

More information

Multiclass and Introduction to Structured Prediction

Multiclass and Introduction to Structured Prediction Multiclass and Introduction to Structured Prediction David S. Rosenberg New York University March 27, 2018 David S. Rosenberg (New York University) DS-GA 1003 / CSCI-GA 2567 March 27, 2018 1 / 49 Contents

More information

Subgradient Methods for Maximum Margin Structured Learning

Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing

More information

Introduction to Support Vector Machines

Introduction to Support Vector Machines Introduction to Support Vector Machines Hsuan-Tien Lin Learning Systems Group, California Institute of Technology Talk in NTU EE/CS Speech Lab, November 16, 2005 H.-T. Lin (Learning Systems Group) Introduction

More information

Multiclass and Introduction to Structured Prediction

Multiclass and Introduction to Structured Prediction Multiclass and Introduction to Structured Prediction David S. Rosenberg Bloomberg ML EDU November 28, 2017 David S. Rosenberg (Bloomberg ML EDU) ML 101 November 28, 2017 1 / 48 Introduction David S. Rosenberg

More information

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley Natural Language Processing Classification Classification I Dan Klein UC Berkeley Classification Automatically make a decision about inputs Example: document category Example: image of digit digit Example:

More information

Sequence Labeling: HMMs & Structured Perceptron

Sequence Labeling: HMMs & Structured Perceptron Sequence Labeling: HMMs & Structured Perceptron CMSC 723 / LING 723 / INST 725 MARINE CARPUAT marine@cs.umd.edu HMM: Formal Specification Q: a finite set of N states Q = {q 0, q 1, q 2, q 3, } N N Transition

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework

More information

Bayes Risk Minimization in Natural Language Parsing

Bayes Risk Minimization in Natural Language Parsing UNIVERSITE DE GENEVE CENTRE UNIVERSITAIRE D INFORMATIQUE ARTIFICIAL INTELLIGENCE LABORATORY Date: June, 2006 TECHNICAL REPORT Baes Risk Minimization in Natural Language Parsing Ivan Titov Universit of

More information

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel

> DEPARTMENT OF MATHEMATICS AND COMPUTER SCIENCE GRAVIS 2016 BASEL. Logistic Regression. Pattern Recognition 2016 Sandro Schönborn University of Basel Logistic Regression Pattern Recognition 2016 Sandro Schönborn University of Basel Two Worlds: Probabilistic & Algorithmic We have seen two conceptual approaches to classification: data class density estimation

More information

Nonlinear Classification

Nonlinear Classification Nonlinear Classification INFO-4604, Applied Machine Learning University of Colorado Boulder October 5-10, 2017 Prof. Michael Paul Linear Classification Most classifiers we ve seen use linear functions

More information

Tuning as Linear Regression

Tuning as Linear Regression Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical

More information

Learning to Rank and Quadratic Assignment

Learning to Rank and Quadratic Assignment Learning to Rank and Quadratic Assignment Thomas Mensink TVPA - XRCE & LEAR - INRIA Grenoble, France Jakob Verbeek LEAR Team INRIA Rhône-Alpes Grenoble, France Abstract Tiberio Caetano Machine Learning

More information

Abstract of Discriminative Methods for Label Sequence Learning by Yasemin Altun, Ph.D., Brown University, May 2005.

Abstract of Discriminative Methods for Label Sequence Learning by Yasemin Altun, Ph.D., Brown University, May 2005. Abstract of Discriminative Methods for Label Sequence Learning by Yasemin Altun, Ph.D., Brown University, May 2005. Discriminative learning framework is one of the very successful fields of machine learning.

More information

Fast Inference and Learning with Sparse Belief Propagation

Fast Inference and Learning with Sparse Belief Propagation Fast Inference and Learning with Sparse Belief Propagation Chris Pal, Charles Sutton and Andrew McCallum University of Massachusetts Department of Computer Science Amherst, MA 01003 {pal,casutton,mccallum}@cs.umass.edu

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated

Overview of Statistical Tools. Statistical Inference. Bayesian Framework. Modeling. Very simple case. Things are usually more complicated Fall 3 Computer Vision Overview of Statistical Tools Statistical Inference Haibin Ling Observation inference Decision Prior knowledge http://www.dabi.temple.edu/~hbling/teaching/3f_5543/index.html Bayesian

More information

Structured Output Prediction: Generative Models

Structured Output Prediction: Generative Models Structured Output Prediction: Generative Models CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 17.3, 17.4, 17.5.1 Structured Output Prediction Supervised

More information

An Introduction to Machine Learning

An Introduction to Machine Learning An Introduction to Machine Learning L6: Structured Estimation Alexander J. Smola Statistical Machine Learning Program Canberra, ACT 0200 Australia Alex.Smola@nicta.com.au Tata Institute, Pune, January

More information

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x)) Linear smoother ŷ = S y where s ij = s ij (x) e.g. s ij = diag(l i (x)) 2 Online Learning: LMS and Perceptrons Partially adapted from slides by Ryan Gabbard and Mitch Marcus (and lots original slides by

More information

Brief Introduction of Machine Learning Techniques for Content Analysis

Brief Introduction of Machine Learning Techniques for Content Analysis 1 Brief Introduction of Machine Learning Techniques for Content Analysis Wei-Ta Chu 2008/11/20 Outline 2 Overview Gaussian Mixture Model (GMM) Hidden Markov Model (HMM) Support Vector Machine (SVM) Overview

More information

Numerical Learning Algorithms

Numerical Learning Algorithms Numerical Learning Algorithms Example SVM for Separable Examples.......................... Example SVM for Nonseparable Examples....................... 4 Example Gaussian Kernel SVM...............................

More information

Learning and Inference over Constrained Output

Learning and Inference over Constrained Output Learning and Inference over Constrained Output Vasin Punyakanok Dan Roth Wen-tau Yih Dav Zimak Department of Computer Science University of Illinois at Urbana-Champaign {punyakan, danr, yih, davzimak}@uiuc.edu

More information

Protein tertiary structure prediction with new machine learning approaches

Protein tertiary structure prediction with new machine learning approaches Protein tertiary structure prediction with new machine learning approaches Rui Kuang Department of Computer Science Columbia University Supervisor: Jason Weston(NEC) and Christina Leslie(Columbia) NEC

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

with Local Dependencies

with Local Dependencies CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence

More information

Neural Networks and the Back-propagation Algorithm

Neural Networks and the Back-propagation Algorithm Neural Networks and the Back-propagation Algorithm Francisco S. Melo In these notes, we provide a brief overview of the main concepts concerning neural networks and the back-propagation algorithm. We closely

More information

The Perceptron. Volker Tresp Summer 2016

The Perceptron. Volker Tresp Summer 2016 The Perceptron Volker Tresp Summer 2016 1 Elements in Learning Tasks Collection, cleaning and preprocessing of training data Definition of a class of learning models. Often defined by the free model parameters

More information

Andrew McCallum Department of Computer Science University of Massachusetts Amherst, MA

Andrew McCallum Department of Computer Science University of Massachusetts Amherst, MA University of Massachusetts TR 04-49; July 2004 1 Collective Segmentation and Labeling of Distant Entities in Information Extraction Charles Sutton Department of Computer Science University of Massachusetts

More information

Task-Oriented Dialogue System (Young, 2000)

Task-Oriented Dialogue System (Young, 2000) 2 Review Task-Oriented Dialogue System (Young, 2000) 3 http://rsta.royalsocietypublishing.org/content/358/1769/1389.short Speech Signal Speech Recognition Hypothesis are there any action movies to see

More information

Learning Tetris. 1 Tetris. February 3, 2009

Learning Tetris. 1 Tetris. February 3, 2009 Learning Tetris Matt Zucker Andrew Maas February 3, 2009 1 Tetris The Tetris game has been used as a benchmark for Machine Learning tasks because its large state space (over 2 200 cell configurations are

More information

Logistic Regression. Machine Learning Fall 2018

Logistic Regression. Machine Learning Fall 2018 Logistic Regression Machine Learning Fall 2018 1 Where are e? We have seen the folloing ideas Linear models Learning as loss minimization Bayesian learning criteria (MAP and MLE estimation) The Naïve Bayes

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability

More information

Hidden Markov Models

Hidden Markov Models CS 2750: Machine Learning Hidden Markov Models Prof. Adriana Kovashka University of Pittsburgh March 21, 2016 All slides are from Ray Mooney Motivating Example: Part Of Speech Tagging Annotate each word

More information

Probabilistic Context Free Grammars. Many slides from Michael Collins

Probabilistic Context Free Grammars. Many slides from Michael Collins Probabilistic Context Free Grammars Many slides from Michael Collins Overview I Probabilistic Context-Free Grammars (PCFGs) I The CKY Algorithm for parsing with PCFGs A Probabilistic Context-Free Grammar

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania

More information

Structural Learning with Amortized Inference

Structural Learning with Amortized Inference Structural Learning with Amortized Inference AAAI 15 Kai-Wei Chang, Shyam Upadhyay, Gourab Kundu, and Dan Roth Department of Computer Science University of Illinois at Urbana-Champaign {kchang10,upadhya3,kundu2,danr}@illinois.edu

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

The Perceptron. Volker Tresp Summer 2014

The Perceptron. Volker Tresp Summer 2014 The Perceptron Volker Tresp Summer 2014 1 Introduction One of the first serious learning machines Most important elements in learning tasks Collection and preprocessing of training data Definition of a

More information

Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager

Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager Feature Noising Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager Outline Part 0: Some backgrounds Part 1: Dropout as adaptive

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

On Label Dependence in Multi-Label Classification

On Label Dependence in Multi-Label Classification Krzysztof Dembczynski 1,2 dembczynski@informatik.uni-marburg.de Willem Waegeman 3 willem.waegeman@ugent.be Weiwei Cheng 1 cheng@informatik.uni-marburg.de Eyke Hüllermeier 1 eyke@informatik.uni-marburg.de

More information