Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Similar documents
Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

From Binary to Multiclass Classification. CS 6961: Structured Prediction Spring 2018

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Logistic Regression. Machine Learning Fall 2018

Lecture 13: Structured Prediction

Lecture 3: Multiclass Classification

Machine Learning for Structured Prediction

Multiclass Classification-1

A brief introduction to Conditional Random Fields

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Sequential Supervised Learning

Machine Learning for NLP

Machine Learning Basics Lecture 7: Multiclass Classification. Princeton University COS 495 Instructor: Yingyu Liang

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Pattern Recognition and Machine Learning

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Undirected Graphical Models

Probabilistic Models for Sequence Labeling

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence Labeling: HMMs & Structured Perceptron

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Gaussian and Linear Discriminant Analysis; Multiclass Classification

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Probabilistic Graphical Models

Midterm sample questions

Linear Classifiers IV

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Intelligent Systems (AI-2)

Log-Linear Models, MEMMs, and CRFs

The Perceptron Algorithm

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Lecture 6: Graphical Models

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Neural Networks: Backpropagation

Introduction to Graphical Models

Lecture 5 Neural models for NLP

Graphical Models for Collaborative Filtering

Introduction to Machine Learning Midterm, Tues April 8

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

CS 7180: Behavioral Modeling and Decision- making in AI

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Hidden Markov Models Part 2: Algorithms

Bias-Variance Tradeoff

Lecture 12: Algorithms for HMMs

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Discriminative Models

Structure Learning in Sequential Data

Nonparametric Bayesian Methods (Gaussian Processes)

Warm up: risk prediction with logistic regression

STA 4273H: Statistical Machine Learning

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Midterm: CS 6375 Spring 2015 Solutions

Directed Probabilistic Graphical Models CMSC 678 UMBC

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

NLP Programming Tutorial 11 - The Structured Perceptron

Statistical Methods for NLP

Information Extraction from Text

ML4NLP Multiclass Classification

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

CS 6140: Machine Learning Spring 2017

2.2 Structured Prediction

Kernelized Perceptron Support Vector Machines

CSC321 Lecture 4: Learning a Classifier

CS 188: Artificial Intelligence. Outline

CSCI-567: Machine Learning (Spring 2019)

Machine Learning 2017

Universität Potsdam Institut für Informatik Lehrstuhl Maschinelles Lernen. Linear Classifiers. Blaine Nelson, Tobias Scheffer

Intelligent Systems (AI-2)

Lecture 12: Algorithms for HMMs

Discriminative Models

Classification with Perceptrons. Reading:

Probabilistic Graphical Models

Random Field Models for Applications in Computer Vision

Graphical models for part of speech tagging

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

Logistic Regression Trained with Different Loss Functions. Discussion

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Linear discriminant functions

Hidden Markov Models

Hidden Markov Models in Language Processing

Logistic Regression. Some slides adapted from Dan Jurfasky and Brendan O Connor

Energy Based Learning. Energy Based Learning

Statistical Methods for NLP

SVAN 2016 Mini Course: Stochastic Convex Optimization Methods in Machine Learning

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Linear & nonlinear classifiers

CSC321 Lecture 4 The Perceptron Algorithm

SPSS, University of Texas at Arlington. Topics in Machine Learning-EE 5359 Neural Networks

Sequential Data Modeling - The Structured Perceptron

Logistic Regression & Neural Networks

Clustering. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms March 8, / 26

Hidden Markov Models

Transcription:

Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1

Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of factors Each factor assigns a score to assignment of the random variables it is connected to Training and prediction Final prediction via argmax w T Á(x, y) Train by maximum (regularized) likelihood Connections to other models Effectively a linear classifier A generalization of logistic regression to structures An instance of Markov Random Field, with some random variables observed We will see this soon 2

Global features The feature function decomposes over the sequence y 0 y 1 y 2 y 3 x w T Á(x, y 0, y 1 ) w T Á(x, y 1, y 2 ) w T Á(x, y 2, y 3 ) 3

Outline Sequence models Hidden Markov models Inference with HMM Learning Conditional Models and Local Classifiers Global models Conditional Random Fields Structured Perceptron for sequences 4

HMM is also a linear classifier Consider the HMM 5

HMM is also a linear classifier Consider the HMM Log probabilities = transition scores + emission scores 6

HMM is also a linear classifier Consider the HMM Define indicators: I z = 1 if z is true; else 0 7

HMM is also a linear classifier Consider the HMM Define indicators: I z = 1 if z is true; else 0 8

HMM is also a linear classifier Consider the HMM Define indicators: I z = 1 if z is true; else 0 9

HMM is also a linear classifier Consider the HMM Indicators: I z = 1 if z is true; else 0 10

HMM is also a linear classifier Consider the HMM Indicators: I z = 1 if z is true; else 0 Or equivalently This is a linear function log P terms are the weights; counts and indicators are features Can be written as w T Á(x, y) and add more features 11

HMM is also a linear classifier Consider the HMM Indicators: I z = 1 if z is true; else 0 Or equivalently This is a linear function log P terms are the weights; counts and indicators are features Can be written as w T Á(x, y) and add more features 12

HMM is a linear classifier: An example Det Noun Verb Det The dog ate the Noun homework 13

HMM is a linear classifier: An example Det Noun Verb Det Noun Consider The dog ate the homework 14

HMM is a linear classifier: An example Det Noun Verb Det Noun Consider The dog ate the homework Transition scores + Emission scores 15

HMM is a linear classifier: An example Det Noun Verb Det Noun Consider The dog ate the homework log P(Det! Noun) 2 + log P(Noun! Verb) 1 + Emission scores + log P(Verb! Det) 1 16

HMM is a linear classifier: An example Det Noun Verb Det Noun Consider The dog ate the homework log P(The Det) 1 log P(Det! Noun) 2 + log P(Noun! Verb) 1 + log P(Verb! Det) 1 + + log P(dog Noun) 1 + log P(ate Verb) 1 + log P(the Det) 1 + log P(homework Noun) 1 17

HMM is a linear classifier: An example Det Noun Verb Det Noun Consider The dog ate the homework log P(The Det) 1 log P(Det! Noun) 2 + log P(Noun! Verb) 1 + log P(Verb! Det) 1 + + log P(dog Noun) 1 + log P(ate Verb) 1 + log P(the Det) 1 w: Parameters of the model + log P(homework Noun) 1 18

HMM is a linear classifier: An example Det Noun Verb Det Noun Consider The dog ate the homework log P(The Det) 1 log P(Det! Noun) 2 + log P(Noun! Verb) 1 + + log P(dog Noun) 1 + log P(ate Verb) 1 Á(x, y): Properties of this output and the input + log P(Verb! Det) 1 + log P(the Det) 1 + log P(homework Noun) 1 19

HMM is a linear classifier: An example Det Noun Verb Det Noun Consider The dog ate the homework log P(Det Noun) log P(Noun Verb) log P(Verb Det) log P The Det) log P dog Noun) log P ate Verb) log P the Det log P homework Noun) 2 1 1 1 1 1 1 1 w: Parameters of the model Á(x, y): Properties of this output and the input 20

HMM is a linear classifier: An example Det Noun Verb Det Noun Consider The dog ate the homework log P(Det Noun) log P(Noun Verb) log P(Verb Det) log P The Det) log P dog Noun) log P ate Verb) log P the Det log P homework Noun) 2 1 1 1 1 1 1 1 log P(x, y) = A linear scoring function = w T Á(x,y) w: Parameters of the model Á(x, y): Properties of this output and the input 21

Towards structured Perceptron 1. HMM is a linear classifier Can we treat it as any linear classifier for training? If so, we could add additional features that are global properties As long as the output can be decomposed for easy inference 2. The Viterbi algorithm calculates max w T Á(x, y) Viterbi only cares about scores to structures (not necessarily normalized) 3. We could push the learning algorithm to train for un-normalized scores If we need normalization, we could always normalize by computing exponentiating and dividing by Z That is, the learning algorithm can effectively just focus on the score of y for a particular x Train a discriminative model! 22

Towards structured Perceptron 1. HMM is a linear classifier Can we treat it as any linear classifier for training? If so, we could add additional features that are global properties As long as the output can be decomposed for easy inference 2. The Viterbi algorithm calculates max w T Á(x, y) Viterbi only cares about scores to structures (not necessarily normalized) 3. We could push the learning algorithm to train for un-normalized scores If we need normalization, we could always normalize by computing exponentiating and dividing by Z That is, the learning algorithm can effectively just focus on the score of y for a particular x Train a discriminative model! 23

Towards structured Perceptron 1. HMM is a linear classifier Can we treat it as any linear classifier for training? If so, we could add additional features that are global properties As long as the output can be decomposed for easy inference 2. The Viterbi algorithm calculates max w T Á(x, y) Viterbi only cares about scores to structures (not necessarily normalized) 3. We could push the learning algorithm to train for un-normalized scores If we need normalization, we could always normalize by exponentiating and dividing by Z That is, the learning algorithm can effectively just focus on the score of y for a particular x Train a discriminative model instead of the generative one! 24

Structured Perceptron algorithm Given a training set D = {(x,y)} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x, y) 2 D: 1. Predict Structured y = argmax Perceptron y w T Á(x, update y ) 2. If y y, update w à w + learningrate (Á(x, y) - Á(x, y )) 3. Return w Prediction: argmax y w T Á(x, y) 25

Structured Perceptron algorithm Given a training set D = {(x,y)} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x, y) 2 D: 1. Predict y = argmax y w T Á(x, y ) 2. If y y, update w à w + learningrate (Á(x, y) - Á(x, y )) 3. Return w Prediction: argmax y w T Á(x, y) 26

Structured Perceptron algorithm Given a training set D = {(x,y)} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x, y) 2 D: 1. Predict y = argmax y w T Á(x, y ) 2. If y y, update w à w + learningrate (Á(x, y) - Á(x, y )) 3. Return w Prediction: argmax y w T Á(x, y) Update only on an error. Perceptron is an mistakedriven algorithm. If there is a mistake, promote y and demote y 27

Structured Perceptron algorithm Given a training set D = {(x,y)} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x, y) 2 D: 1. Predict y = argmax y w T Á(x, y ) 2. If y y, update w à w + learningrate (Á(x, y) - Á(x, y )) 3. Return w Prediction: argmax y w T Á(x, y) T is a hyperparameter to the algorithm Update only on an error. Perceptron is an mistakedriven algorithm. If there is a mistake, promote y and demote y 28

Structured Perceptron algorithm Given a training set D = {(x,y)} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x, y) 2 D: 1. Predict y = argmax y w T Á(x, y ) 2. If y y, update w à w + learningrate (Á(x, y) - Á(x, y )) 3. Return w Prediction: argmax y w T Á(x, y) In practice, good to shuffle D before the inner loop T is a hyperparameter to the algorithm Update only on an error. Perceptron is an mistakedriven algorithm. If there is a mistake, promote y and demote y 29

Structured Perceptron algorithm Given a training set D = {(x,y)} 1. Initialize w = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x, y) 2 D: 1. Predict y = argmax y w T Á(x, y ) 2. If y y, update w à w + learningrate (Á(x, y) - Á(x, y )) 3. Return w Prediction: argmax y w T Á(x, y) In practice, good to shuffle D before the inner loop T is a hyperparameter to the algorithm Inference in training loop! Update only on an error. Perceptron is an mistakedriven algorithm. If there is a mistake, promote y and demote y 30

Notes on structured perceptron Mistake bound for separable data, just like perceptron In practice, use averaging for better generalization Initialize a = 0 After each step, whether there is an update or not, a à w + a Note, we still check for mistake using w not a Return a at the end instead of w Exercise: Optimize this for performance modify a only on errors Global update One weight vector for entire sequence Not for each position Same algorithm can be derived via constraint classification Create a binary classification data set and run perceptron 31

Structured Perceptron with averaging Given a training set D = {(x,y)} 1. Initialize w = 0 2 < n, a = 0 2 < n 2. For epoch = 1 T: 1. For each training example (x, y) 2 D: 1. Predict y = argmax y w T Á(x, y ) 2. If y y, update w à w + learningrate (Á(x, y) - Á(x, y )) 3. Set a à a + w 3. Return a 32

CRF vs. structured perceptron Stochastic gradient descent update for CRF For a training example (x i, y i ) Structured perceptron For a training example (x i, y i ) Expectation vs max Caveat: Adding regularization will change the CRF update, averaging changes the perceptron update 33

The lay of the land HMM: A generative model, assigns probabilities to sequences 34

The lay of the land HMM: A generative model, assigns probabilities to sequences Two roads diverge 35

The lay of the land HMM: A generative model, assigns probabilities to sequences Hidden Markov Models are actually just linear classifiers Don t really care whether we are predicting probabilities. We are assigning scores to a full output for a given input (like multiclass) Generalize algorithms for linear classifiers. Sophisticated models that can use arbitrary features Structured Perceptron Structured SVM Two roads diverge 36

The lay of the land HMM: A generative model, assigns probabilities to sequences Two roads diverge Hidden Markov Models are actually just linear classifiers Don t really care whether we are predicting probabilities. We are assigning scores to a full output for a given input (like multiclass) Generalize algorithms for linear classifiers. Sophisticated models that can use arbitrary features Structured Perceptron Structured SVM Model probabilities via exponential functions. Gives us the log-linear representation Log-probabilities for sequences for a given input Learn by maximizing likelihood. Sophisticated models that can use arbitrary features Conditional Random field 37

The lay of the land HMM: A generative model, assigns probabilities to sequences Hidden Markov Models are actually just linear classifiers Don t really care whether we are predicting probabilities. We are assigning scores to a full output for a given input (like multiclass) Generalize algorithms for linear classifiers. Sophisticated models that can use arbitrary features Two roads diverge Discriminative/Conditional models Model probabilities via exponential functions. Gives us the log-linear representation Log-probabilities for sequences for a given input Learn by maximizing likelihood. Sophisticated models that can use arbitrary features Structured Perceptron Structured SVM Conditional Random field 38

The lay of the land HMM: A generative model, assigns probabilities to sequences Hidden Markov Models are actually just linear classifiers Don t really care whether we are predicting probabilities. We are assigning scores to a full output for a given input (like multiclass) Generalize algorithms for linear classifiers. Sophisticated models that can use arbitrary features Two roads diverge Discriminative/Conditional models Model probabilities via exponential functions. Gives us the log-linear representation Log-probabilities for sequences for a given input Learn by maximizing likelihood. Sophisticated models that can use arbitrary features Structured Perceptron Structured SVM Conditional Random field Applicable beyond sequences Eventually, similar objective minimized with different loss functions 39

The lay of the land HMM: A generative model, assigns probabilities to sequences Hidden Markov Models are actually just linear classifiers Don t really care whether we are predicting probabilities. We are assigning scores to a full output for a given input (like multiclass) Generalize algorithms for linear classifiers. Sophisticated models that can use arbitrary features Two roads diverge Discriminative/Conditional models Model probabilities via exponential functions. Gives us the log-linear representation Log-probabilities for sequences for a given input Learn by maximizing likelihood. Sophisticated models that can use arbitrary features Structured Perceptron Structured SVM Conditional Random field Coming soon Applicable beyond sequences Eventually, similar objective minimized with different loss functions 40

Sequence models: Summary Goal: Predict an output sequence given input sequence Hidden Markov Model Inference Predict via Viterbi algorithm Conditional models/discriminative models Local approaches (no inference during training) MEMM, conditional Markov model Global approaches (inference during training) CRF, structured perceptron Prediction is not always tractable for general structures Same dichotomy for more general structures To think What are the parts in a sequence model? How is each model scoring these parts? 41