Structured Prediction

Similar documents
Machine Learning A Geometric Approach

Structured Prediction with Perceptron: Theory and Algorithms

Machine Learning A Geometric Approach

Machine Learning for Structured Prediction

Algorithms for Predicting Structured Data

Machine Learning for NLP

Algorithms for NLP. Classification II. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Machine Learning. Kernels. Fall (Kernels, Kernelized Perceptron and SVM) Professor Liang Huang. (Chap. 12 of CIML)

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Lecture 3: Multiclass Classification

Natural Language Processing. Classification. Features. Some Definitions. Classification. Feature Vectors. Classification I. Dan Klein UC Berkeley

Perceptron (Theory) + Linear Regression

Machine Learning. CUNY Graduate Center, Spring Lectures 11-12: Unsupervised Learning 1. Professor Liang Huang.

Lecture 13: Structured Prediction

Introduction to Data-Driven Dependency Parsing

Generalized Linear Classifiers in NLP

Machine Learning for NLP

Search-Based Structured Prediction

Perceptron. Subhransu Maji. CMPSCI 689: Machine Learning. 3 February February 2015

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

NLP Programming Tutorial 11 - The Structured Perceptron

A Support Vector Method for Multivariate Performance Measures

Discriminative Models

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

Personal Project: Shift-Reduce Dependency Parsing

Discriminative Models

2.2 Structured Prediction

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Logistic Regression Logistic

Statistical Methods for NLP

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Warm up: risk prediction with logistic regression

EXPERIMENTS ON PHRASAL CHUNKING IN NLP USING EXPONENTIATED GRADIENT FOR STRUCTURED PREDICTION

The Perceptron Algorithm

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Probabilistic Graphical Models

Distributed Training Strategies for the Structured Perceptron

Probabilistic Context Free Grammars. Many slides from Michael Collins

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

MIRA, SVM, k-nn. Lirong Xia

Linear Classifiers IV

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Online Learning. Jordan Boyd-Graber. University of Colorado Boulder LECTURE 21. Slides adapted from Mohri

Machine Learning Lecture 6 Note

ML4NLP Multiclass Classification

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Probabilistic Graphical Models

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Minibatch and Parallelization for Online Large Margin Structured Learning

A short introduction to supervised learning, with applications to cancer pathway analysis Dr. Christina Leslie

Kernelized Perceptron Support Vector Machines

Linear and Logistic Regression. Dr. Xiaowei Huang

Logistic Regression & Neural Networks

Undirected Graphical Models for Sequence Analysis

Multiclass Classification-1

Support vector machines Lecture 4

CS7267 MACHINE LEARNING

Sequential Supervised Learning

CSC 411 Lecture 17: Support Vector Machine

Statistical NLP Spring A Discriminative Approach

Lecture 5 Neural models for NLP

Introduction to Classification and Sequence Labeling

Support Vector Machines. Machine Learning Fall 2017

COMS 4771 Introduction to Machine Learning. Nakul Verma

EE 511 Online Learning Perceptron

Online Passive-Aggressive Algorithms

Support Vector Machines for Classification: A Statistical Portrait

Sequence Labeling: HMMs & Structured Perceptron

Structured Prediction Theory and Algorithms

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Lab 12: Structured Prediction

6.036 midterm review. Wednesday, March 18, 15

Introduction to Logistic Regression and Support Vector Machine

6.891: Lecture 24 (December 8th, 2003) Kernel Methods

CS 188: Artificial Intelligence. Outline

with Local Dependencies

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

Probabilistic Context-free Grammars

CSE 546 Midterm Exam, Fall 2014

Support Vector Machines

Natural Language Processing

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Abstract of Discriminative Methods for Label Sequence Learning by Yasemin Altun, Ph.D., Brown University, May 2005.

NONLINEAR CLASSIFICATION AND REGRESSION. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

ESS2222. Lecture 4 Linear model

Multiclass and Introduction to Structured Prediction

Support Vector Machines. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Sequential Data Modeling - The Structured Perceptron

Intelligent Systems (AI-2)

Multiclass and Introduction to Structured Prediction

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

AN ABSTRACT OF THE DISSERTATION OF

Introduction to Support Vector Machines

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Structured Prediction

Graphical models for part of speech tagging

Transcription:

Machine Learning Fall 2017 (structured perceptron, HMM, structured SVM) Professor Liang Huang (Chap. 17 of CIML)

x x the man bit the dog x the man bit the dog x DT NN VBD DT NN S =+1 =-1 the man bit the dog x DT NP NN VB VP NP the man bit DT NN binar classification: output is binar multiclass classification: output is a number (small # of classes) structured classification: output is a structure (seq., tree, graph) part-of-speech tagging, parsing, summarization, translation exponentiall man classes: search (inference) efficienc is crucial! 2 the dog

Generic Perceptron online-learning: one example at a time learning b doing update weights at mistakes find the best output under the current weights xi inference i zi update weights w 3

Perceptron: from binar to structured binar classification x x 2 classes trivial x w exact inference z update weights if z =+1 =-1 multiclass classification constant x # of classes eas w exact inference z update weights if z structured classification exponential # of classes hard w the man bit the dog DT NN VBD DT NN x x exact inference z update weights if z 4

From Perceptron to SVM max margin 1964 Vapnik Chervonenkis same journal! 1964 Aizerman+ online kernels kernel. perc. +soft-margin fall of USSR 1995 Cortes/Vapnik SVM conservative updates online approx. max margin subgradient descent 2003 Crammer/Singer MIRA minibatch batch 2007--2010 Singer group Pegasos 2006 Singer group aggressive 1959 Rosenblatt invention 1962 Novikoff proof inseparable case 1999 Freund/Schapire voted/avg: revived 2003 Taskar M3N 2004 Tsochantaridis struct. SVM multinomial logistic regression (max. entrop) 2001 Laffert+ CRF AT&T Research 2002 Collins structured ex-at&t and students 2005 McDonald+ structured MIRA 5

Multiclass Classification one weight vector ( prototpe ) for each class: multiclass decision rule: ŷ = argmax z21...m (best agreement w/ prototpe) w (z) x Q1: what about 2-class? Q2: do we still need augmented space?

Multiclass Perceptron on an error, penalize the weight for the wrong class, and reward the weight for the true class

Convergence of Multiclass update rule: w w + (x,, z) separabilit: 9u, s.t. 8(x, ) 2 D, z 6= u (x,, z) for all

Example: POS Tagging gold-standard: DT NN VBD DT NN the man bit the dog current output: DT NN NN DT NN the man bit the dog assume onl two feature classes tag bigrams t i-1 ti word/tag pairs w i weights ++: (NN, VBD) (VBD, DT) (VBD, bit) weights --: (NN, NN) (NN, DT) (NN, bit) x Φ(x, ) z Φ(x, z) x ɸ(x,)={(<s>, DT): 1, (DT, NN):2,..., (NN, </s>): 1, (DT, the):1,..., (VBD, bit): 1,..., (NN, dog): 1} ɸ(x,z)={(<s>, DT): 1, (DT, NN):2,..., (NN, </s>): 1, (DT, the):1,..., (NN, bit): 1,..., (NN, dog): 1} ɸ(x,) - ɸ(x,z) 9

Structured Perceptron xi inference w i zi update weights 10

Inference: Dnamic Programming w x exact inference z update weights if z 11

Pthon implementation Q: what about top-down recursive + memoization? CS 562 - Lec 5-6: Probs & WFSTs 12

xi Efficienc vs. Expressiveness argmax GEN(x) inference w the inference (argmax) must be efficient i zi update weights either the search space GEN(x) is small, or factored features must be local to (but can be global to x) e.g. bigram tagger, but look at all input words (cf. CRFs) x 13

Averaged Perceptron 0 j j +1 j = j W j more stable and accurate results approximation of voted perceptron (Freund & Schapire, 1999) 14

Averaging Tricks Daume (2006, PhD thesis) sparse vector: defaultdict w t t w w (0) = w (1) = w (2) = w (3) = w (4) = w (1) w (1) w (2) w (1) w (2) w (3) w (1) w (2) w (3) w (4) 15

Do we need smoothing? x smoothing is much easier in discriminative models just make sure for each feature template, its subset templates are also included e.g., to include (t 0 w0 w-1) ou must also include (t 0 w0) (t0 w-1) (w0 w-1) and mabe also (t 0 t-1) because t is less sparse than w 16

Convergence with Exact Search linear classification: converges iff. data is separable structured: converges iff. data separable & search exact there is an oracle vector that correctl labels all examples one vs the rest (correct label better than all incorrect labels) theorem: if separable, then # of updates R 2 / δ 2 R: diameter R: diameter R: diameter x 100 x 100 x 111 100 x 3012 δ δ x 2000 =-1 =+1 Rosenblatt => Collins 1957 2002 z 100 17

Geometr of Convergence Proof pt 1 w z exact 1-best x exact inference z update weights if z (x,, z) update perceptron update: correct label w (k) current model update new model w (k+1) δ separation unit oracle vector u margin (b induction) (part 1: upperbound) 18

Geometr of Convergence Proof pt 2 z (x,, z) exact 1-best update x w exact inference z update weights if z violation: incorrect label scored higher perceptron update: R: max diameter correct label w (k) current model update <90 new w (k+1) model b induction: apple R 2 diameter violation (part 2: upperbound) parts 1+2 => update bounds: k apple R 2 / 2 19

Experiments

Experiments: Tagging (almost) identical features from (Ratnaparkhi, 1996) trigram tagger: current tag t i, previous tags ti-1, ti-2 current word w i and its spelling features surrounding words w i-1 wi+1 wi-2 wi+2.. 21

Experiments: NP Chunking B-I-O scheme Rockwell International Corp. B I I 's Tulsa unit said it signed B I I O B O a tentative agreement... B I I features: unigram model surrounding words and POS tags 22

Experiments: NP Chunking results (Sha and Pereira, 2003) trigram tagger voted perceptron: 94.09% vs. CRF: 94.38% 23

Structured SVM structured perceptron: w Δɸ(x,,z) > 0 SVM: for all (x,), functional margin (w x) 1 structured SVM version 1: simple loss for all (x,), for all z, margin w Δɸ(x,,z) 1 correct has to score higher than an wrong z b 1 structured SVM version 2: structured loss for all (x,), for all z, margin w Δɸ(x,,z) (,z) correct has to score higher than an wrong z b (,z), a distance metric such as hamming loss 24

Loss-Augmented Decoding want for all z: w ɸ(x,) w ɸ(x,z) + (,z) same as: w ɸ(x,) maxz w ɸ(x,z) + (,z) loss-augmented decoding: argmaxz w ɸ(x,z) + (,z) if (,z) factors in z (e.g. hamming), just modif DP CIML version modified DP should have learning rate! λ = 1/(2C) ver similar to Pegasos; but should use Pegasos framework instead 25

Correct Version following Pegasos want for all z: w ɸ(x,) w ɸ(x,z) + (,z) same as: w ɸ(x,) maxz w ɸ(x,z) + (,z) loss-augmented decoding: argmaxz w ɸ(x,z) + (,z) if (,z) factors in z (e.g. hamming), just modif DP 1/t NC/2t ( ) N= D, C is from SVM t +=1 for each example 26

Struct. Perceptron vs Struct. SVM tagging, ATIS (train: 488 sent); SVM < avg perc << perc perceptron epoch 1 updates 102, W =291, train_err 3.90%, dev_err 9.36% avg_err 6.14% epoch 2 updates 91, W =334, train_err 3.33%, dev_err 8.19% avg_err 4.97% epoch 3 updates 78, W =347, train_err 2.92%, dev_err 5.85% avg_err 4.97% epoch 4 updates 81, W =368, train_err 3.11%, dev_err 6.73% avg_err 5.85% epoch 5 updates 78, W =378, train_err 2.70%, dev_err 6.14% avg_err 5.56% epoch 6 updates 63, W =385, train_err 2.26%, dev_err 6.14% avg_err 5.56% epoch 7 updates 69, W =385, train_err 2.43%, dev_err 7.02% avg_err 5.56% epoch 8 updates 60, W =388, train_err 2.15%, dev_err 6.73% avg_err 5.56% epoch 9 updates 59, W =390, train_err 2.04%, dev_err 6.14% avg_err 5.56% epoch 10 updates 64, W =394, train_err 2.15%, dev_err 5.85% avg_err 5.26% SVM C=1 epoch 1 updates 116, W =311, train_err 4.55%, dev_err 5.85% epoch 2 updates 82, W =328, train_err 3.05%, dev_err 4.97% epoch 3 updates 78, W =334, train_err 2.92%, dev_err 5.56% epoch 4 updates 77, W =339, train_err 2.92%, dev_err 5.26% epoch 5 updates 80, W =344, train_err 2.94%, dev_err 5.56% epoch 6 updates 73, W =345, train_err 2.75%, dev_err 4.68% epoch 7 updates 72, W =347, train_err 2.75%, dev_err 4.97% epoch 8 updates 75, W =352, train_err 2.86%, dev_err 4.97% epoch 9 updates 74, W =353, train_err 2.78%, dev_err 4.97% epoch 10 updates 72, W =354, train_err 2.78%, dev_err 4.97% 27