Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Similar documents
Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Machine Learning for Structured Prediction

Lecture 13: Structured Prediction

Natural Language Processing

Probabilistic Models for Sequence Labeling

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Machine Learning for NLP

Logistic Regression. Machine Learning Fall 2018

Log-Linear Models, MEMMs, and CRFs

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Machine Learning for NLP

Sequential Supervised Learning

Statistical Methods for NLP

Intelligent Systems (AI-2)

Undirected Graphical Models

Active learning in sequence labeling

Midterm sample questions

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

CSE 490 U Natural Language Processing Spring 2016

Intelligent Systems (AI-2)

A brief introduction to Conditional Random Fields


Graphical models for part of speech tagging

Warm up: risk prediction with logistic regression

Overfitting, Bias / Variance Analysis

CSCI-567: Machine Learning (Spring 2019)

CSE 447/547 Natural Language Processing Winter 2018

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Structure Learning in Sequential Data

Introduction to Logistic Regression and Support Vector Machine

Homework 4, Part B: Structured perceptron

Lecture 5 Neural models for NLP

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Comments. Assignment 3 code released. Thought questions 3 due this week. Mini-project: hopefully you have started. implement classification algorithms

Lecture 12: Algorithms for HMMs

Lecture 12: Algorithms for HMMs

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Neural Networks in Structured Prediction. November 17, 2015

Logistic Regression. COMP 527 Danushka Bollegala

Conditional Random Fields for Sequential Supervised Learning

Conditional Random Field

Structured Prediction

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

ANLP Lecture 22 Lexical Semantics with Dense Vectors

Midterm sample questions

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Logistic Regression Logistic

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Probabilistic modeling. The slides are closely adapted from Subhransu Maji s slides

Artificial Neural Networks. MGS Lecture 2

Linear & nonlinear classifiers

Machine Learning for Signal Processing Bayes Classification and Regression

Recap from previous lecture

Probabilistic Graphical Models

Machine Learning Practice Page 2 of 2 10/28/13

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

lecture 6: modeling sequences (final part)

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Sequence Labeling: HMMs & Structured Perceptron

Classification Based on Probability

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

CSC321 Lecture 4: Learning a Classifier

Probabilistic Graphical Models

CSC321 Lecture 18: Learning Probabilistic Models

Logistic Regression & Neural Networks

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Neural Networks and Deep Learning

STA 414/2104: Machine Learning

Lab 12: Structured Prediction

Sequential Data Modeling - The Structured Perceptron

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Lecture 3: Pattern Classification

Feature-based Discriminative Models. More Sequence Models

Nonlinear Classification

CMU-Q Lecture 24:

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Algorithms for Predicting Structured Data

CPSC 340: Machine Learning and Data Mining

Machine Learning for Signal Processing Bayes Classification

Statistical learning. Chapter 20, Sections 1 4 1

Lecture 3: Multiclass Classification

Probabilistic Graphical Models

Fundamentals of Machine Learning. Mohammad Emtiyaz Khan EPFL Aug 25, 2015

Structured Prediction

Natural Language Processing

STA 4273H: Statistical Machine Learning

Machine Learning, Fall 2012 Homework 2

Transcription:

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1

Models for sequence tagging x: words (inputs) y: POS tags (outputs) Training: (x,y) pairs Testing: given x, predict y = arg max y p(y x) Model diagram (hypothetical!) Model Inference HMM x x x Y p(~y,~x) = p(y t y t 1 )p(x t y t ) t Transition multinom. Emission multinom. Viterbi Indep Log Reg, no context x x x p(~y ~x) = Y t p(y t ~x, t) Any features you want... from the input. Indep ILR with context features x x x p(y t x t ) / exp( T f(x t,y t )) p(y t ~x, t) / exp( T f(~x, t, y t )) Indep

Features are good Typically, feature-based NLP taggers use many many binary features. Is this the first token in the sentence? Second? Third? Last? Next to last? Word to left? Right? Last 3 letters of this word? Last 3 letters of word on left? On right? Is this word capitalized? Does it contain punctuation? Indep LogReg can t consider POS context, only word-based context. This seems non-ideal. ILR with context features x x x p(y t ~x, t) / exp( T f(~x, t, y t ))

From HMMs to MEMMs HMM p(~y,~x) = Y t p(y t y t 1 )p(x t y t ) x x x Indep Log Reg p(~y ~x) = Y t p(y t ~x, t) x x x p(y t ~x, t) / exp( T f(~x, t, y t )) Max entropy Markov Model... is a conditional log-linear Markov Model MEMM x x x Y p(~y ~x) = p(y t y t 1, ~x, t) t p(y t ~x, t) / exp( T f(~x, t, y t 1,y t )) MEMMs can have features from the input, and the previous state!

Using an MEMM MEMM x x x Y p(~y ~x) = p(y t y t 1, ~x, t) t p(y t ~x, t) / exp( T f(~x, t, y t 1,y t )) Training: this is the same training algorithm as logistic regression. Testing: like an HMM, e.g. greedy or Viterbi inference. Issues Asymmetry in model: underappreciates future tags. (Though it appreciates them a little bit: Viterbi gives different answers than greedy.) For POS tagging, MEMM s are state of the art, esp. with tricks (e.g use both left-to-right and right-to-left) 5

Global structure prediction HMMs and MEMMs are ways to score possible output structures, with local probabilistic models Why not learn parameters for a global model of structures? Don t assume any generation process. Instead, use features to score the entire structure at once. ( goodness function ) G(y) = T f(x, y) Prediction: arg max y G(y) Is inference efficient? Depends on the structure of f. How to learn? If you know how to evaluate the argmax, you can use the structured perceptron algorithm. There are several different global structure prediction models out there. Related: conditional random fields, where p(y x) exp(goodness(y))

finna get good gold y = V V A Two simple feature templates f(x,y) is... For every pair of tag types (A,B) Transition features X t 1{y t 1 = A, y t = B} V,V: 1 V,A: 1 For every tag type A and word type w Observation features X t 1{y t = A, x t = w} V,finna: 1 V,get: 1 A,good: 1 7

gold y = finna get good V V A Mathematical convention is numeric indexing, though sometimes convenient to implement as hash table. Transition features Observation features -0.6-1.0 1.1 0.5 0.0 0.8 0.5-1.3-1.6 0.0 0.6 0.0-0.2-0.2 0.8-1.0 0.1-1.9 1.1 1.2-0.1-1.0-0.1-0.1 f(x, y) 1 0 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 3 0 0 0 0 f trans:v,a (x, y) = NX t=2 1{y t 1 = V,y t = A} Goodness(y) = T f(x, y) f obs:v,finna (x, y) = NX t=1 1{y t = V,x t =finna} If every element of f(x,y) is based on only a single tag, or neighboring tags, then the Viterbi algorithm can calculate argmaxy G(y).

added after lecture: I did an example on the board to show that Viterbi can do it. those feature counts decompose into local parts of the output structure. finna get good y1 y2 y3 G(y) = trans:y1,y2 + trans:y2,y3 + obs:y1,finna + obs:y2,get + obs:y3,good each term involves two neighboring tags, or just one tag. Therefore to maximize G wrt y, we can use the NK^2 version of Viterbi we have done before. 9

finna get good gold y = V V A pred y* = N V A Learning idea: want gold y to have high scores. Update weights so y would have a higher score, and y* would be lower, next time. f(x, y) V,V: 1 V,A: 1 V,finna: 1 V,get: 1 A,good: 1 f(x, y*) N,V: 1 V,A: 1 N,finna: 1 V,get: 1 A,good: 1 f(x,y) - f(x, y*) V,V: +1 N,V: -1 V,finna: +1 N,finna: -1 Perceptron update rule: := + [f(x, y) f(x, y )] Learning rate (hyperparam)

:= + [f(x, y) f(x, y )] Transition features Observation features -0.6-1.0 1.1 0.5 0.0 0.8 0.5-1.3-1.6 0.0 0.6 0.0-0.2-0.2 0.8-1.0 0.1-1.9 1.1 1.2-0.1-1.0-0.1 f(x, y) f(x, y ) 1 0 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 3 0 0 0 1 0 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 1 0 3 0 0 0 The update vector: ( ) +1-1

Structured Perceptron Initialize theta=0 For each example (x,y) in dataset, iterating many times through the dataset: Predict y* = argmax y theta'f(x,y) (If y* = y, no update.) Calculate feature differences and update theta. := + [f(x, y) f(x, y )] 12

Averaged perceptron Perceptron is error-driven: update theta when there are mistakes. If the learning algorithm never gets 100% accuracy on the training set, it never converges. (The perceptron doesn t care about the magnitude of its mistakes! Unlike log-linear models.) One popular solution: averaging. Average all seen values of theta into a final solution. = 1 n 13 X t (t) t indexes every example. (If you pass through the dataset 5 times, you have 5*(num train examples) timesteps.) The above is inefficient, but there s a trick to make it much faster in practice.

Added after lecture, will review on Thursday: The three main structpred models are: (1) struct perceptron, (2) crf s, (3) structsvm s. All of them work the same at test time (decoding via the viterbi algorithm, by maximizing a linear goodness score). Only at training time are they different. Averaged perceptron is probably the simplest to implement and use. Lots of practitioners in NLP who don t care about fancy machine learning often use it. I actually like CRF s myself because of they have a probabilistic interpretation, but that doesn t always matter. Training CRF s is slightly more complicated than struct perceptrons (not that much more complicated, but like a lecture s worth of material), so I figured we could skip it in this class. Instead of averaging, you can also do early stopping: keep a development set and evaluate accuracy on it every iteration through the data. Choose the theta that did best. I don t know which method is better (different researchers may prefer different methods). Averaging has the advantage that there aren t really any hyperparameters to tune (well, the learning rate to a certain extent). Why does averaging work? Theta is bouncing a lot around the space, because the perceptron doesn t know how to prefer solutions according to the magnitude of the errors it makes. The value of theta will be overfitted towards doing well on the most recent examples it s seen. If you average, you average away some of the noise. Averaging is used in other areas of machine learning too. It s a form of regularization. Perceptron learning is actually a form of gradient descent. It s not on the logistic regression log-likelihood, but instead the gradients of a different function (the 1-0 loss). The Collins 2002 paper that introduced the structured perceptron is still great to read for more details: http://www.cs.columbia.edu/~mcollins/papers/tagperc.pdf More on the classification perceptron: see Hal Daume s book chapter draft, http://ciml.info/dl/v0_9/ciml-v0_9-ch03.pdf 14

Recap Discriminative sequence models give scoring for output structures. MEMMs combine Markovian hidden structure, plus features from words. Global structure prediction models allow any scoring of the hidden structure. The structured perceptron is a simple and effective learning algorithm for them. 15