Sequential Data Modeling - Conditional Random Fields

Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and Technology (NAIST) 1

Prediction Problems Given x, predict y 2

Prediction Problems Given x, A book review Oh, man I love this book! This book is so boring... A tweet On the way to the park! 公園に行くなう! A sentence I read a book predict y Is it positive? yes no Its language English Japanese Its parts-of-speech N VBD DET NN I read a book Binary Prediction (2 choices) Multi-class Prediction (several choices) Structured Prediction (millions of choices) 3

Logistic Regression 4

Example we will use: Given an introductory sentence from Wikipedia Predict whether the article is about a person Given Gonso was a Sanron sect priest (754-827) in the late Nara and early Heian periods. Predict Yes! Shichikuzan Chigogataki Fudomyoo is a historical site located at Magura, Maizuru City, Kyoto Prefecture. No! This is binary classification (of course!) 5

Review: Linear Prediction Model Each element that helps us predict is a feature contains priest contains (<#>-<#>) contains site contains Kyoto Prefecture Each feature has a weight, positive if it indicates yes, and negative if it indicates no w contains priest = 2 w contains (<#>-<#>) = 1 w contains site = -3 w contains Kyoto Prefecture = -1 For a new example, sum the weights Kuya (903-972) was a priest born in Kyoto Prefecture. 2 + -1 + 1 = 2 If the sum is at least 0: yes, otherwise: no 6

Review: Mathematical Formulation y = sign (w ϕ (x)) = sign ( i I w i ϕ i (x)) x: the input φ(x): vector of feature functions {φ 1 (x), φ 2 (x),, φ I (x)} w: the weight vector {w 1, w 2,, w I } y: the prediction, +1 if yes, -1 if no (sign(v) is +1 if v >= 0, -1 otherwise) 7

Perceptron and Probabilities Sometimes we want the probability P( y x) Estimating confidence in predictions Combining with other systems However, perceptron only gives us a prediction y=sign(w ϕ( x)) In other words: 1 P( y x) if w ϕ (x) 0 P( y x)=0 if w ϕ (x)<0 p(y x) 0.5 0-10 -5 0 5 10 w*phi(x) 8

The Logistic Function The logistic function is a softened version of the function used in the perceptron Perceptron 1 P( y x)= e w ϕ ( x) 1+e w ϕ (x) Logistic Function 1 p(y x) 0.5 p(y x) 0.5 0-10 -5 0 5 10 0-10 -5 0 5 10 w*phi(x) w*phi(x) Can account for uncertainty Differentiable 9

Logistic Regression Train based on conditional likelihood Find the parameters w that maximize the conditional likelihood of all answers y i given the example x i ŵ=argmax w i P( y i x i ; w) How do we solve this? 10

Review: Perceptron Training Algorithm create map w for I iterations for each labeled pair x, y in the data phi = create_features(x) y' = predict_one(w, phi) if y'!= y w += y * phi In other words Try to classify each training example Every time we make a mistake, update the weights 11

Stochastic Gradient Descent Online training algorithm for probabilistic models (including logistic regression) create map w for I iterations for each labeled pair x, y in the data w += α * dp(y x)/dw In other words For every training example, calculate the gradient (the direction that will increase the probability of y) Move in that direction, multiplied by learning rate α 12

Gradient of the Logistic Function Take the derivative of the probability d d w P ( y x) = d d w = ϕ (x) w ϕ ( x) e w ϕ (x) 1+e w ϕ (x) e (1+e w ϕ (x) ) 2 dp(y x)/dw*phi(x) 0.4 0.2 0-10 -5 0 5 10 w*phi(x) d d w P ( y= 1 x) = d d w (1 ew ϕ (x) w ϕ (x)) 1+e w ϕ (x) e = ϕ (x) (1+e w ϕ (x) ) 2 13

Example: Initial Update Set α, initialize w=0 x = A site, located in Maizuru, Kyoto y = -1 d w ϕ(x)=0 e0 P ( y= 1 x) = d w (1+e 0 ) ϕ (x) 2 = 0.25 ϕ (x) w w + 0.25 ϕ (x) w unigram Maizuru = -0.25 w unigram, = -0.5 w unigram in = -0.25 w unigram Kyoto = -0.25 w unigram A = -0.25 w unigram site = -0.25 w unigram located = -0.25 14

Example: Second Update x = Shoken, monk born in Kyoto y = 1 w ϕ (x)= 1-0.5-0.25-0.25 d d w P ( y x) = e 1 (1+e 1 ) ϕ (x) 2 = 0.196ϕ (x) w w +0.196 ϕ( x) w unigram Maizuru = -0.25 w unigram, = -0.304 w unigram in = -0.054 w unigram Kyoto = -0.054 w unigram A = -0.25 w unigram site = -0.25 w unigram located = -0.25 w unigram Shoken = 0.196 w unigram monk = 0.196 w 15 unigram born = 0.196

Calculating Optimal Sequences, Probabilities 16

Sequence Likelihood Logistic regression considered probability of y { 1,+1} P( y x) What if we want to consider probability of a sequence? X i Y i I visited Nara PRN VBD NNP P(Y X ) 17

Calculating Multi-class Probabilities Each sequence has it's own feature vector time flies N V time flies V N time flies N N time flies V V φ( ) φ T,<S>,N φ T,N,V φ T,V,</S> φ E,N,time φ E,V,flies φ( ) φ( ) φ( ) φ T,<S>,V φ T,V,N φ T,N,</S> φ E,V,time φ E,N,flies φ T,<S>,N φ T,N,N φ T,N,</S> φ E,N,time φ E,N,flies φ T,<S>,V φ T,V,V φ T,V,</S> φ E,V,time φ E,V,flies Use weights for each feature to calculate scores w T,<S>,N w T,V,</S> w E,N,time φ( time flies N V )*w=3 φ( time flies )*w=2 N N φ( time flies )*w=0 V N φ( time flies )*w V V 18

The Softmax Function Turn into probabilities by taking exponent and normalizing (the Softmax function) P(Y X )= ew ϕ (Y, X ) w ϕ (Ỹ, X ) Ỹ e Take the exponent and normalize exp(φ( time flies )*w)=20.08 N V exp(φ( time flies V N exp(φ( time flies )*w)=7.39 N N exp(φ( time flies V V )*w).00 )*w)=2.72 P(N V time flies)=.6437 P(N N time flies)=.2369 P(V N time flies)=0.0320 P(V V time flies)=0.0872 19

Calculating Edge Features Like perceptron, can calculate features for each edge φ E,N,time φ T,<S>,N <S> φ E,V,time φ T,<S>,V time N V φ E,N,flies φ T,N,N φ E,V,flies φ T,V,V flies N φ E,N,flies φ T,V,N φ E,V,flies φ T,N,V V φ T,N,</S> φ T,V,</S> </S> 20

Calculating Edge Probabilities Calculate scores, and take exponent time flies e w*φ =7.39 P=.881 <S> e w*φ.00 P=.119 N V e w*φ.00 P=.237 e w*φ.00 P=.087 N e w*φ.00 P=.032 e w*φ.00 P=.644 This is now the same form as the HMM V e w*φ.00 P=.269 e w*φ =2.72 P=.731 </S> Can use the Viterbi algorithm Calculate probabilities using forward-backward 21

Conditional Random Fields 22

Maximizing CRF Likelihood Want to maximize the likelihood for sequences ŵ=argmax w i P(Y i X i ;w ) P(Y X )= ew ϕ (Y, X ) w ϕ (Ỹ, X ) Ỹ e For convenience, we consider the log likelihood w ϕ (Ỹ, X) log P(Y X )=w ϕ(y, X) log Ỹ e Want to find gradient for stochastic gradient descent d d w log P (Y X ) 23

Deriving a CRF Gradient: w ϕ (Ỹ, X ) log P(Y X ) = w ϕ (Y, X ) log Ỹ e = w ϕ (Y, X ) log Z d d w log P (Y X ) = d ϕ(y, X ) d w log w ϕ (Ỹ, X) Ỹ e = ϕ(y, X ) 1 Z d (Ỹ, X ) ew ϕ Ỹ d w = w ϕ(ỹ, X) e ϕ(y, X ) Ỹ ϕ (Ỹ, X ) Z = ϕ(y, X ) Ỹ P (Ỹ X )ϕ (Ỹ, X ) 24

To get the gradient we: In Other Words... d d w log P (Y X )=ϕ (Y, X ) Ỹ P (Ỹ X )ϕ (Ỹ, X ) add the correct feature vector subtract the expectation of the features 25

time flies N V time flies V N time flies N N time flies V V φ( ) φ( ) φ( ) Example φ( ) φ T,<S>,N φ T,N,V φ T,V,</S> φ E,N,time φ E,V,flies φ T,<S>,V φ T,V,N φ T,N,</S> φ E,V,time φ E,N,flies φ T,<S>,N φ T,N,N φ T,N,</S> φ E,N,time φ E,N,flies φ T,<S>,V φ T,V,V φ T,V,</S> φ E,V,time φ E,V,flies P=.644 P=.032 P=.237 P=.087 φ T,<S>,N, φ E,N,time = 1-.644-.237 =.119 φ T,N,V = 1-.644 =.356 φ T,<S>,V, φ E,V,time = 0-.032-.087 = -.119 φ T,V,</S>, φ E,V,flies = 1-.644-.087 =.269 φ T,N,</S>, φ E,V,flies = 0-.032-.237 = -.269 φ T,V,N = 0-.032 = -.032 φ T,N,N = 0-.237 = -.237 φ T,V,V = 0-.087 = -.087 26

Combinatorial Explosion Problem!: The number of hypotheses is exponential. d d w log P (Y X )=ϕ (Y, X ) Ỹ P (Ỹ X )ϕ (Ỹ, X ) O(T X ) T = number of tags 27

Calculate Feature Expectations using Edge Probabilities! If we know the edge probabilities, just multiply them! e w*φ =7.39 P=.881 φ E,N,time φ T,<S>,N time N φ T,<S>,N, φ E,N,time = 1-.881 =.119 φ T,<S>,V, φ E,V,time = 0-.119 = -.119 e w*φ.00 P=.119 <S> φ E,V,time φ T,<S>,V V Same answer as when we explicitly expand all Y! φ T,<S>,N, φ E,N,time = 1-.644-.237 =.119 φ T,<S>,V, φ E,V,time = 0-.032-.087 = -.119 28

CRF Training Procedure Can perform stochastic gradient descent, like logistic regression create map w for I iterations for each labeled pair X, Y in the data gradient = φ(y,x) calculate e φ(y,x)*w for each edge run forward-backward algorithm to get P(edge) for each edge gradient -= P(edge)*φ(edge) w += α * gradient Only major difference is gradient calculation Learning rate α 29

Learning Algorithms 30

Batch Learning Online Learning: Update after each example Online Stochastic Gradient Descent create map w for I iterations for each labeled pair x, y in the data w += α * dp(y x)/dw Batch Learning: Update after all examples Batch Stochastic Gradient Descent create map w for I iterations for each labeled pair x, y in the data gradient += α * dp(y x)/dw w += gradient 31

Batch Learning Algorithms: Newton/Quasi-Newton Methods Newton-Raphson Method: Choose how far to update using the second-order derivatives (the Hessian matrix) Faster convergence, but w * w time and memory Limited Memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS): Guesses second-order derivatives from first-order Most widely used? Library: http://www.chokkan.org/software/liblbfgs/ More information: http://homes.cs.washington.edu/~galen/files/quasinewton-notes.pdf 32

Online Learning vs. Batch Learning Online: In general, simpler mathematical derivation Often converges faster Batch: More stable (does not change based on order) Trivially parallelizable 33

Regularization 34

Cannot Distinguish Between Large and Small Classifiers For these examples: -1 he saw a bird in the park +1 he saw a robbery in the park Which classifier is better? Classifier 1 he +3 saw -5 a +0.5 bird -1 robbery +1 in +5 the -3 park -2 Classifier 2 bird -1 robbery +1 35

Regularization A penalty on adding extra weights L2 regularization: Big penalty on large weights, small penalty on small weights High accuracy L1 regularization: Uniform increase whether large or small Will cause many weights to become zero small model 5 4 3 2 1 0-2 -1 0 1 2 L2 L1 37

Regularization in Logistic Regression/CRF To do so in logistic regression/crf, we add the penalty to the log likelihood (for the whole corpus) L2 Regularization ŵ=argmax w ( i P (Y i X i ; w)) c w w w 2 c adjusts the strength of the regularization smaller: more freedom to fit the data larger: less freedom to fit the data, better generalization L1 also used, slightly more difficult to optimize 38

Conclusion 39

Conclusion Logistic regression is a probabilistic classifier Conditional random fields are probabilistic structured discriminative prediction models Can be trained using Online stochastic gradient descent (like peceptron) Batch learning using a method such as L-BFGS Regularization can help solve problems of overfitting 40

Thank You! 41