Sequential Data Modeling - Conditional Random Fields

Similar documents
NLP Programming Tutorial 6 - Advanced Discriminative Learning

Sequential Data Modeling - The Structured Perceptron

Neural networks CMSC 723 / LING 723 / INST 725 MARINE CARPUAT. Slides credit: Graham Neubig

NLP Programming Tutorial 8 - Recurrent Neural Nets

Logistic Regression & Neural Networks

NLP Programming Tutorial 11 - The Structured Perceptron

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Neural Networks, Computation Graphs. CMSC 470 Marine Carpuat

Linear Classification. CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington

Probabilistic Models for Sequence Labeling

Sequence Labeling: HMMs & Structured Perceptron

Logistic Regression. Machine Learning Fall 2018

Machine Learning for NLP

Log-Linear Models, MEMMs, and CRFs

Multiclass Logistic Regression

Logistic Regression. COMP 527 Danushka Bollegala

Lecture 13: Structured Prediction

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Machine Learning for Structured Prediction

Ch 4. Linear Models for Classification

April 9, Depto. de Ing. de Sistemas e Industrial Universidad Nacional de Colombia, Bogotá. Linear Classification Models. Fabio A. González Ph.D.

Linear & nonlinear classifiers

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

COMPUTATIONAL INTELLIGENCE (INTRODUCTION TO MACHINE LEARNING) SS16

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

lecture 6: modeling sequences (final part)

Regression with Numerical Optimization. Logistic

Sequential Supervised Learning

Machine Learning for NLP

6.036 midterm review. Wednesday, March 18, 15

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

From perceptrons to word embeddings. Simon Šuster University of Groningen

Conditional Random Field

Mark your answers ON THE EXAM ITSELF. If you are not sure of your answer you may wish to provide a brief explanation.

Machine Learning Lecture 7

Reading Group on Deep Learning Session 1

Midterm exam CS 189/289, Fall 2015

HOMEWORK #4: LOGISTIC REGRESSION

Logistic Regression. Robot Image Credit: Viktoriya Sukhanova 123RF.com

HOMEWORK #4: LOGISTIC REGRESSION

Bayesian Machine Learning

Least Mean Squares Regression

Machine Learning. Lecture 3: Logistic Regression. Feng Li.

CS229 Supplemental Lecture notes

6.036: Midterm, Spring Solutions

EXAM IN STATISTICAL MACHINE LEARNING STATISTISK MASKININLÄRNING

Introduction to Logistic Regression and Support Vector Machine

Linear Models for Classification: Discriminative Learning (Perceptron, SVMs, MaxEnt)

Linear smoother. ŷ = S y. where s ij = s ij (x) e.g. s ij = diag(l i (x))

LINEAR MODELS FOR CLASSIFICATION. J. Elder CSE 6390/PSYC 6225 Computational Modeling of Visual Perception

Classification Based on Probability

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Machine Learning (CS 567) Lecture 3

Machine Learning Lecture 5

Linear classifiers: Logistic regression

Introduction to Machine Learning. Regression. Computer Science, Tel-Aviv University,

Intelligent Systems (AI-2)

Machine Learning. 7. Logistic and Linear Regression

Applied Machine Learning Lecture 5: Linear classifiers, continued. Richard Johansson

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Lecture 5: Linear models for classification. Logistic regression. Gradient Descent. Second-order methods.

Intelligent Systems (AI-2)

Serious limitations of (single-layer) perceptrons: Cannot learn non-linearly separable tasks. Cannot approximate (learn) non-linear functions

Logistic Regression Introduction to Machine Learning. Matt Gormley Lecture 8 Feb. 12, 2018

Part 4: Conditional Random Fields

Computational statistics

Neural Networks and Deep Learning

IN FALL NATURAL LANGUAGE PROCESSING. Jan Tore Lønning

10-701/ Machine Learning - Midterm Exam, Fall 2010

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Machine Learning. Lecture 4: Regularization and Bayesian Statistics. Feng Li.

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

Least Mean Squares Regression. Machine Learning Fall 2018

y(x n, w) t n 2. (1)

Neural Network Training

Overfitting, Bias / Variance Analysis

Optimization Methods for Machine Learning

CPSC 340: Machine Learning and Data Mining. Stochastic Gradient Fall 2017

Lecture 10: Logistic Regression

Linear Discrimination Functions

Last updated: Oct 22, 2012 LINEAR CLASSIFIERS. J. Elder CSE 4404/5327 Introduction to Machine Learning and Pattern Recognition

Linear Regression (continued)

Lecture 12. Neural Networks Bastian Leibe RWTH Aachen

Statistical Methods for NLP

Undirected Graphical Models

Machine Learning Lecture 10

Linear models: the perceptron and closest centroid algorithms. D = {(x i,y i )} n i=1. x i 2 R d 9/3/13. Preliminaries. Chapter 1, 7.

Linear discriminant functions

Hidden Markov Models

Max Margin-Classifier

Conditional Random Fields for Sequential Supervised Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Linear Models for Classification

Classification Logistic Regression

Structured Prediction

Learning: Binary Perceptron. Examples: Perceptron. Separable Case. In the space of feature vectors

Transcription:

Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and Technology (NAIST) 1

Prediction Problems Given x, predict y 2

Prediction Problems Given x, A book review Oh, man I love this book! This book is so boring... A tweet On the way to the park! 公園に行くなう! A sentence I read a book predict y Is it positive? yes no Its language English Japanese Its parts-of-speech N VBD DET NN I read a book Binary Prediction (2 choices) Multi-class Prediction (several choices) Structured Prediction (millions of choices) 3

Logistic Regression 4

Example we will use: Given an introductory sentence from Wikipedia Predict whether the article is about a person Given Gonso was a Sanron sect priest (754-827) in the late Nara and early Heian periods. Predict Yes! Shichikuzan Chigogataki Fudomyoo is a historical site located at Magura, Maizuru City, Kyoto Prefecture. No! This is binary classification (of course!) 5

Review: Linear Prediction Model Each element that helps us predict is a feature contains priest contains (<#>-<#>) contains site contains Kyoto Prefecture Each feature has a weight, positive if it indicates yes, and negative if it indicates no w contains priest = 2 w contains (<#>-<#>) = 1 w contains site = -3 w contains Kyoto Prefecture = -1 For a new example, sum the weights Kuya (903-972) was a priest born in Kyoto Prefecture. 2 + -1 + 1 = 2 If the sum is at least 0: yes, otherwise: no 6

Review: Mathematical Formulation y = sign (w ϕ (x)) = sign ( i I w i ϕ i (x)) x: the input φ(x): vector of feature functions {φ 1 (x), φ 2 (x),, φ I (x)} w: the weight vector {w 1, w 2,, w I } y: the prediction, +1 if yes, -1 if no (sign(v) is +1 if v >= 0, -1 otherwise) 7

Perceptron and Probabilities Sometimes we want the probability P( y x) Estimating confidence in predictions Combining with other systems However, perceptron only gives us a prediction y=sign(w ϕ( x)) In other words: 1 P( y x) if w ϕ (x) 0 P( y x)=0 if w ϕ (x)<0 p(y x) 0.5 0-10 -5 0 5 10 w*phi(x) 8

The Logistic Function The logistic function is a softened version of the function used in the perceptron Perceptron 1 P( y x)= e w ϕ ( x) 1+e w ϕ (x) Logistic Function 1 p(y x) 0.5 p(y x) 0.5 0-10 -5 0 5 10 0-10 -5 0 5 10 w*phi(x) w*phi(x) Can account for uncertainty Differentiable 9

Logistic Regression Train based on conditional likelihood Find the parameters w that maximize the conditional likelihood of all answers y i given the example x i ŵ=argmax w i P( y i x i ; w) How do we solve this? 10

Review: Perceptron Training Algorithm create map w for I iterations for each labeled pair x, y in the data phi = create_features(x) y' = predict_one(w, phi) if y'!= y w += y * phi In other words Try to classify each training example Every time we make a mistake, update the weights 11

Stochastic Gradient Descent Online training algorithm for probabilistic models (including logistic regression) create map w for I iterations for each labeled pair x, y in the data w += α * dp(y x)/dw In other words For every training example, calculate the gradient (the direction that will increase the probability of y) Move in that direction, multiplied by learning rate α 12

Gradient of the Logistic Function Take the derivative of the probability d d w P ( y x) = d d w = ϕ (x) w ϕ ( x) e w ϕ (x) 1+e w ϕ (x) e (1+e w ϕ (x) ) 2 dp(y x)/dw*phi(x) 0.4 0.2 0-10 -5 0 5 10 w*phi(x) d d w P ( y= 1 x) = d d w (1 ew ϕ (x) w ϕ (x)) 1+e w ϕ (x) e = ϕ (x) (1+e w ϕ (x) ) 2 13

Example: Initial Update Set α, initialize w=0 x = A site, located in Maizuru, Kyoto y = -1 d w ϕ(x)=0 e0 P ( y= 1 x) = d w (1+e 0 ) ϕ (x) 2 = 0.25 ϕ (x) w w + 0.25 ϕ (x) w unigram Maizuru = -0.25 w unigram, = -0.5 w unigram in = -0.25 w unigram Kyoto = -0.25 w unigram A = -0.25 w unigram site = -0.25 w unigram located = -0.25 14

Example: Second Update x = Shoken, monk born in Kyoto y = 1 w ϕ (x)= 1-0.5-0.25-0.25 d d w P ( y x) = e 1 (1+e 1 ) ϕ (x) 2 = 0.196ϕ (x) w w +0.196 ϕ( x) w unigram Maizuru = -0.25 w unigram, = -0.304 w unigram in = -0.054 w unigram Kyoto = -0.054 w unigram A = -0.25 w unigram site = -0.25 w unigram located = -0.25 w unigram Shoken = 0.196 w unigram monk = 0.196 w 15 unigram born = 0.196

Calculating Optimal Sequences, Probabilities 16

Sequence Likelihood Logistic regression considered probability of y { 1,+1} P( y x) What if we want to consider probability of a sequence? X i Y i I visited Nara PRN VBD NNP P(Y X ) 17

Calculating Multi-class Probabilities Each sequence has it's own feature vector time flies N V time flies V N time flies N N time flies V V φ( ) φ T,<S>,N φ T,N,V φ T,V,</S> φ E,N,time φ E,V,flies φ( ) φ( ) φ( ) φ T,<S>,V φ T,V,N φ T,N,</S> φ E,V,time φ E,N,flies φ T,<S>,N φ T,N,N φ T,N,</S> φ E,N,time φ E,N,flies φ T,<S>,V φ T,V,V φ T,V,</S> φ E,V,time φ E,V,flies Use weights for each feature to calculate scores w T,<S>,N w T,V,</S> w E,N,time φ( time flies N V )*w=3 φ( time flies )*w=2 N N φ( time flies )*w=0 V N φ( time flies )*w V V 18

The Softmax Function Turn into probabilities by taking exponent and normalizing (the Softmax function) P(Y X )= ew ϕ (Y, X ) w ϕ (Ỹ, X ) Ỹ e Take the exponent and normalize exp(φ( time flies )*w)=20.08 N V exp(φ( time flies V N exp(φ( time flies )*w)=7.39 N N exp(φ( time flies V V )*w).00 )*w)=2.72 P(N V time flies)=.6437 P(N N time flies)=.2369 P(V N time flies)=0.0320 P(V V time flies)=0.0872 19

Calculating Edge Features Like perceptron, can calculate features for each edge φ E,N,time φ T,<S>,N <S> φ E,V,time φ T,<S>,V time N V φ E,N,flies φ T,N,N φ E,V,flies φ T,V,V flies N φ E,N,flies φ T,V,N φ E,V,flies φ T,N,V V φ T,N,</S> φ T,V,</S> </S> 20

Calculating Edge Probabilities Calculate scores, and take exponent time flies e w*φ =7.39 P=.881 <S> e w*φ.00 P=.119 N V e w*φ.00 P=.237 e w*φ.00 P=.087 N e w*φ.00 P=.032 e w*φ.00 P=.644 This is now the same form as the HMM V e w*φ.00 P=.269 e w*φ =2.72 P=.731 </S> Can use the Viterbi algorithm Calculate probabilities using forward-backward 21

Conditional Random Fields 22

Maximizing CRF Likelihood Want to maximize the likelihood for sequences ŵ=argmax w i P(Y i X i ;w ) P(Y X )= ew ϕ (Y, X ) w ϕ (Ỹ, X ) Ỹ e For convenience, we consider the log likelihood w ϕ (Ỹ, X) log P(Y X )=w ϕ(y, X) log Ỹ e Want to find gradient for stochastic gradient descent d d w log P (Y X ) 23

Deriving a CRF Gradient: w ϕ (Ỹ, X ) log P(Y X ) = w ϕ (Y, X ) log Ỹ e = w ϕ (Y, X ) log Z d d w log P (Y X ) = d ϕ(y, X ) d w log w ϕ (Ỹ, X) Ỹ e = ϕ(y, X ) 1 Z d (Ỹ, X ) ew ϕ Ỹ d w = w ϕ(ỹ, X) e ϕ(y, X ) Ỹ ϕ (Ỹ, X ) Z = ϕ(y, X ) Ỹ P (Ỹ X )ϕ (Ỹ, X ) 24

To get the gradient we: In Other Words... d d w log P (Y X )=ϕ (Y, X ) Ỹ P (Ỹ X )ϕ (Ỹ, X ) add the correct feature vector subtract the expectation of the features 25

time flies N V time flies V N time flies N N time flies V V φ( ) φ( ) φ( ) Example φ( ) φ T,<S>,N φ T,N,V φ T,V,</S> φ E,N,time φ E,V,flies φ T,<S>,V φ T,V,N φ T,N,</S> φ E,V,time φ E,N,flies φ T,<S>,N φ T,N,N φ T,N,</S> φ E,N,time φ E,N,flies φ T,<S>,V φ T,V,V φ T,V,</S> φ E,V,time φ E,V,flies P=.644 P=.032 P=.237 P=.087 φ T,<S>,N, φ E,N,time = 1-.644-.237 =.119 φ T,N,V = 1-.644 =.356 φ T,<S>,V, φ E,V,time = 0-.032-.087 = -.119 φ T,V,</S>, φ E,V,flies = 1-.644-.087 =.269 φ T,N,</S>, φ E,V,flies = 0-.032-.237 = -.269 φ T,V,N = 0-.032 = -.032 φ T,N,N = 0-.237 = -.237 φ T,V,V = 0-.087 = -.087 26

Combinatorial Explosion Problem!: The number of hypotheses is exponential. d d w log P (Y X )=ϕ (Y, X ) Ỹ P (Ỹ X )ϕ (Ỹ, X ) O(T X ) T = number of tags 27

Calculate Feature Expectations using Edge Probabilities! If we know the edge probabilities, just multiply them! e w*φ =7.39 P=.881 φ E,N,time φ T,<S>,N time N φ T,<S>,N, φ E,N,time = 1-.881 =.119 φ T,<S>,V, φ E,V,time = 0-.119 = -.119 e w*φ.00 P=.119 <S> φ E,V,time φ T,<S>,V V Same answer as when we explicitly expand all Y! φ T,<S>,N, φ E,N,time = 1-.644-.237 =.119 φ T,<S>,V, φ E,V,time = 0-.032-.087 = -.119 28

CRF Training Procedure Can perform stochastic gradient descent, like logistic regression create map w for I iterations for each labeled pair X, Y in the data gradient = φ(y,x) calculate e φ(y,x)*w for each edge run forward-backward algorithm to get P(edge) for each edge gradient -= P(edge)*φ(edge) w += α * gradient Only major difference is gradient calculation Learning rate α 29

Learning Algorithms 30

Batch Learning Online Learning: Update after each example Online Stochastic Gradient Descent create map w for I iterations for each labeled pair x, y in the data w += α * dp(y x)/dw Batch Learning: Update after all examples Batch Stochastic Gradient Descent create map w for I iterations for each labeled pair x, y in the data gradient += α * dp(y x)/dw w += gradient 31

Batch Learning Algorithms: Newton/Quasi-Newton Methods Newton-Raphson Method: Choose how far to update using the second-order derivatives (the Hessian matrix) Faster convergence, but w * w time and memory Limited Memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS): Guesses second-order derivatives from first-order Most widely used? Library: http://www.chokkan.org/software/liblbfgs/ More information: http://homes.cs.washington.edu/~galen/files/quasinewton-notes.pdf 32

Online Learning vs. Batch Learning Online: In general, simpler mathematical derivation Often converges faster Batch: More stable (does not change based on order) Trivially parallelizable 33

Regularization 34

Cannot Distinguish Between Large and Small Classifiers For these examples: -1 he saw a bird in the park +1 he saw a robbery in the park Which classifier is better? Classifier 1 he +3 saw -5 a +0.5 bird -1 robbery +1 in +5 the -3 park -2 Classifier 2 bird -1 robbery +1 35

Cannot Distinguish Between Large and Small Classifiers For these examples: -1 he saw a bird in the park +1 he saw a robbery in the park Which classifier is better? Classifier 1 he +3 saw -5 a +0.5 bird -1 robbery +1 in +5 the -3 park -2 Classifier 2 bird -1 robbery +1 Probably classifier 2! It doesn't use irrelevant information. 36

Regularization A penalty on adding extra weights L2 regularization: Big penalty on large weights, small penalty on small weights High accuracy L1 regularization: Uniform increase whether large or small Will cause many weights to become zero small model 5 4 3 2 1 0-2 -1 0 1 2 L2 L1 37

Regularization in Logistic Regression/CRF To do so in logistic regression/crf, we add the penalty to the log likelihood (for the whole corpus) L2 Regularization ŵ=argmax w ( i P (Y i X i ; w)) c w w w 2 c adjusts the strength of the regularization smaller: more freedom to fit the data larger: less freedom to fit the data, better generalization L1 also used, slightly more difficult to optimize 38

Conclusion 39

Conclusion Logistic regression is a probabilistic classifier Conditional random fields are probabilistic structured discriminative prediction models Can be trained using Online stochastic gradient descent (like peceptron) Batch learning using a method such as L-BFGS Regularization can help solve problems of overfitting 40

Thank You! 41