Sequential Data Modeling - Conditional Random Fields

Size: px

Start display at page:

Download "Sequential Data Modeling - Conditional Random Fields"

Cynthia Byrd
5 years ago
Views:

1 Sequential Data Modeling - Conditional Random Fields Graham Neubig Nara Institute of Science and Technology (NAIST) 1

2 Prediction Problems Given x, predict y 2

3 Prediction Problems Given x, A book review Oh, man I love this book! This book is so boring... A tweet On the way to the park! 公園に行くなう! A sentence I read a book predict y Is it positive? yes no Its language English Japanese Its parts-of-speech N VBD DET NN I read a book Binary Prediction (2 choices) Multi-class Prediction (several choices) Structured Prediction (millions of choices) 3

4 Logistic Regression 4

5 Example we will use: Given an introductory sentence from Wikipedia Predict whether the article is about a person Given Gonso was a Sanron sect priest ( ) in the late Nara and early Heian periods. Predict Yes! Shichikuzan Chigogataki Fudomyoo is a historical site located at Magura, Maizuru City, Kyoto Prefecture. No! This is binary classification (of course!) 5

6 Review: Linear Prediction Model Each element that helps us predict is a feature contains priest contains (<#>-<#>) contains site contains Kyoto Prefecture Each feature has a weight, positive if it indicates yes, and negative if it indicates no w contains priest = 2 w contains (<#>-<#>) = 1 w contains site = -3 w contains Kyoto Prefecture = -1 For a new example, sum the weights Kuya ( ) was a priest born in Kyoto Prefecture = 2 If the sum is at least 0: yes, otherwise: no 6

7 Review: Mathematical Formulation y = sign (w ϕ (x)) = sign ( i I w i ϕ i (x)) x: the input φ(x): vector of feature functions {φ 1 (x), φ 2 (x),, φ I (x)} w: the weight vector {w 1, w 2,, w I } y: the prediction, +1 if yes, -1 if no (sign(v) is +1 if v >= 0, -1 otherwise) 7

8 Perceptron and Probabilities Sometimes we want the probability P( y x) Estimating confidence in predictions Combining with other systems However, perceptron only gives us a prediction y=sign(w ϕ( x)) In other words: 1 P( y x) if w ϕ (x) 0 P( y x)=0 if w ϕ (x)<0 p(y x) w*phi(x) 8

9 The Logistic Function The logistic function is a softened version of the function used in the perceptron Perceptron 1 P( y x)= e w ϕ ( x) 1+e w ϕ (x) Logistic Function 1 p(y x) 0.5 p(y x) w*phi(x) w*phi(x) Can account for uncertainty Differentiable 9

10 Logistic Regression Train based on conditional likelihood Find the parameters w that maximize the conditional likelihood of all answers y i given the example x i ŵ=argmax w i P( y i x i ; w) How do we solve this? 10

11 Review: Perceptron Training Algorithm create map w for I iterations for each labeled pair x, y in the data phi = create_features(x) y' = predict_one(w, phi) if y'!= y w += y * phi In other words Try to classify each training example Every time we make a mistake, update the weights 11

12 Stochastic Gradient Descent Online training algorithm for probabilistic models (including logistic regression) create map w for I iterations for each labeled pair x, y in the data w += α * dp(y x)/dw In other words For every training example, calculate the gradient (the direction that will increase the probability of y) Move in that direction, multiplied by learning rate α 12

13 Gradient of the Logistic Function Take the derivative of the probability d d w P ( y x) = d d w = ϕ (x) w ϕ ( x) e w ϕ (x) 1+e w ϕ (x) e (1+e w ϕ (x) ) 2 dp(y x)/dw*phi(x) w*phi(x) d d w P ( y= 1 x) = d d w (1 ew ϕ (x) w ϕ (x)) 1+e w ϕ (x) e = ϕ (x) (1+e w ϕ (x) ) 2 13

14 Example: Initial Update Set α, initialize w=0 x = A site, located in Maizuru, Kyoto y = -1 d w ϕ(x)=0 e0 P ( y= 1 x) = d w (1+e 0 ) ϕ (x) 2 = 0.25 ϕ (x) w w ϕ (x) w unigram Maizuru = w unigram, = -0.5 w unigram in = w unigram Kyoto = w unigram A = w unigram site = w unigram located =

15 Example: Second Update x = Shoken, monk born in Kyoto y = 1 w ϕ (x)= d d w P ( y x) = e 1 (1+e 1 ) ϕ (x) 2 = 0.196ϕ (x) w w ϕ( x) w unigram Maizuru = w unigram, = w unigram in = w unigram Kyoto = w unigram A = w unigram site = w unigram located = w unigram Shoken = w unigram monk = w 15 unigram born = 0.196

16 Calculating Optimal Sequences, Probabilities 16

17 Sequence Likelihood Logistic regression considered probability of y { 1,+1} P( y x) What if we want to consider probability of a sequence? X i Y i I visited Nara PRN VBD NNP P(Y X ) 17

18 Calculating Multi-class Probabilities Each sequence has it's own feature vector time flies N V time flies V N time flies N N time flies V V φ( ) φ T,<S>,N φ T,N,V φ T,V,</S> φ E,N,time φ E,V,flies φ( ) φ( ) φ( ) φ T,<S>,V φ T,V,N φ T,N,</S> φ E,V,time φ E,N,flies φ T,<S>,N φ T,N,N φ T,N,</S> φ E,N,time φ E,N,flies φ T,<S>,V φ T,V,V φ T,V,</S> φ E,V,time φ E,V,flies Use weights for each feature to calculate scores w T,<S>,N w T,V,</S> w E,N,time φ( time flies N V )*w=3 φ( time flies )*w=2 N N φ( time flies )*w=0 V N φ( time flies )*w V V 18

19 The Softmax Function Turn into probabilities by taking exponent and normalizing (the Softmax function) P(Y X )= ew ϕ (Y, X ) w ϕ (Ỹ, X ) Ỹ e Take the exponent and normalize exp(φ( time flies )*w)=20.08 N V exp(φ( time flies V N exp(φ( time flies )*w)=7.39 N N exp(φ( time flies V V )*w).00 )*w)=2.72 P(N V time flies)=.6437 P(N N time flies)=.2369 P(V N time flies)= P(V V time flies)=

20 Calculating Edge Features Like perceptron, can calculate features for each edge φ E,N,time φ T,<S>,N <S> φ E,V,time φ T,<S>,V time N V φ E,N,flies φ T,N,N φ E,V,flies φ T,V,V flies N φ E,N,flies φ T,V,N φ E,V,flies φ T,N,V V φ T,N,</S> φ T,V,</S> </S> 20

21 Calculating Edge Probabilities Calculate scores, and take exponent time flies e w*φ =7.39 P=.881 <S> e w*φ.00 P=.119 N V e w*φ.00 P=.237 e w*φ.00 P=.087 N e w*φ.00 P=.032 e w*φ.00 P=.644 This is now the same form as the HMM V e w*φ.00 P=.269 e w*φ =2.72 P=.731 </S> Can use the Viterbi algorithm Calculate probabilities using forward-backward 21

22 Conditional Random Fields 22

23 Maximizing CRF Likelihood Want to maximize the likelihood for sequences ŵ=argmax w i P(Y i X i ;w ) P(Y X )= ew ϕ (Y, X ) w ϕ (Ỹ, X ) Ỹ e For convenience, we consider the log likelihood w ϕ (Ỹ, X) log P(Y X )=w ϕ(y, X) log Ỹ e Want to find gradient for stochastic gradient descent d d w log P (Y X ) 23

24 Deriving a CRF Gradient: w ϕ (Ỹ, X ) log P(Y X ) = w ϕ (Y, X ) log Ỹ e = w ϕ (Y, X ) log Z d d w log P (Y X ) = d ϕ(y, X ) d w log w ϕ (Ỹ, X) Ỹ e = ϕ(y, X ) 1 Z d (Ỹ, X ) ew ϕ Ỹ d w = w ϕ(ỹ, X) e ϕ(y, X ) Ỹ ϕ (Ỹ, X ) Z = ϕ(y, X ) Ỹ P (Ỹ X )ϕ (Ỹ, X ) 24

25 To get the gradient we: In Other Words... d d w log P (Y X )=ϕ (Y, X ) Ỹ P (Ỹ X )ϕ (Ỹ, X ) add the correct feature vector subtract the expectation of the features 25

26 time flies N V time flies V N time flies N N time flies V V φ( ) φ( ) φ( ) Example φ( ) φ T,<S>,N φ T,N,V φ T,V,</S> φ E,N,time φ E,V,flies φ T,<S>,V φ T,V,N φ T,N,</S> φ E,V,time φ E,N,flies φ T,<S>,N φ T,N,N φ T,N,</S> φ E,N,time φ E,N,flies φ T,<S>,V φ T,V,V φ T,V,</S> φ E,V,time φ E,V,flies P=.644 P=.032 P=.237 P=.087 φ T,<S>,N, φ E,N,time = =.119 φ T,N,V = =.356 φ T,<S>,V, φ E,V,time = = φ T,V,</S>, φ E,V,flies = =.269 φ T,N,</S>, φ E,V,flies = = φ T,V,N = = φ T,N,N = = φ T,V,V = =

27 Combinatorial Explosion Problem!: The number of hypotheses is exponential. d d w log P (Y X )=ϕ (Y, X ) Ỹ P (Ỹ X )ϕ (Ỹ, X ) O(T X ) T = number of tags 27

28 Calculate Feature Expectations using Edge Probabilities! If we know the edge probabilities, just multiply them! e w*φ =7.39 P=.881 φ E,N,time φ T,<S>,N time N φ T,<S>,N, φ E,N,time = =.119 φ T,<S>,V, φ E,V,time = = e w*φ.00 P=.119 <S> φ E,V,time φ T,<S>,V V Same answer as when we explicitly expand all Y! φ T,<S>,N, φ E,N,time = =.119 φ T,<S>,V, φ E,V,time = =

29 CRF Training Procedure Can perform stochastic gradient descent, like logistic regression create map w for I iterations for each labeled pair X, Y in the data gradient = φ(y,x) calculate e φ(y,x)*w for each edge run forward-backward algorithm to get P(edge) for each edge gradient -= P(edge)*φ(edge) w += α * gradient Only major difference is gradient calculation Learning rate α 29

30 Learning Algorithms 30

31 Batch Learning Online Learning: Update after each example Online Stochastic Gradient Descent create map w for I iterations for each labeled pair x, y in the data w += α * dp(y x)/dw Batch Learning: Update after all examples Batch Stochastic Gradient Descent create map w for I iterations for each labeled pair x, y in the data gradient += α * dp(y x)/dw w += gradient 31

32 Batch Learning Algorithms: Newton/Quasi-Newton Methods Newton-Raphson Method: Choose how far to update using the second-order derivatives (the Hessian matrix) Faster convergence, but w * w time and memory Limited Memory Broyden-Fletcher-Goldfarb-Shanno algorithm (L-BFGS): Guesses second-order derivatives from first-order Most widely used? Library: More information: 32

33 Online Learning vs. Batch Learning Online: In general, simpler mathematical derivation Often converges faster Batch: More stable (does not change based on order) Trivially parallelizable 33

34 Regularization 34

35 Cannot Distinguish Between Large and Small Classifiers For these examples: -1 he saw a bird in the park +1 he saw a robbery in the park Which classifier is better? Classifier 1 he +3 saw -5 a +0.5 bird -1 robbery +1 in +5 the -3 park -2 Classifier 2 bird -1 robbery +1 35

36 Cannot Distinguish Between Large and Small Classifiers For these examples: -1 he saw a bird in the park +1 he saw a robbery in the park Which classifier is better? Classifier 1 he +3 saw -5 a +0.5 bird -1 robbery +1 in +5 the -3 park -2 Classifier 2 bird -1 robbery +1 Probably classifier 2! It doesn't use irrelevant information. 36

37 Regularization A penalty on adding extra weights L2 regularization: Big penalty on large weights, small penalty on small weights High accuracy L1 regularization: Uniform increase whether large or small Will cause many weights to become zero small model L2 L1 37

38 Regularization in Logistic Regression/CRF To do so in logistic regression/crf, we add the penalty to the log likelihood (for the whole corpus) L2 Regularization ŵ=argmax w ( i P (Y i X i ; w)) c w w w 2 c adjusts the strength of the regularization smaller: more freedom to fit the data larger: less freedom to fit the data, better generalization L1 also used, slightly more difficult to optimize 38

39 Conclusion 39

40 Conclusion Logistic regression is a probabilistic classifier Conditional random fields are probabilistic structured discriminative prediction models Can be trained using Online stochastic gradient descent (like peceptron) Batch learning using a method such as L-BFGS Regularization can help solve problems of overfitting 40

41 Thank You! 41

NLP Programming Tutorial 6 - Advanced Discriminative Learning

NLP Programming Tutorial 6 - Advanced Discriminative Learning Graham Neubig Nara Institute of Science and Technology (NAIST) 1 Review: Classifiers and the Perceptron 2 Prediction Problems Given x, predict