Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011
Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions
Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions
Introduction Linear-Chain General Specific Implementations Conclusions Introduction Graphical probabilistic model Discriminative model State of the art in task as sequence labeling
Introduction Linear-Chain General Specific Implementations Conclusions Graphical Models A graphical model is a family of probability distributions that factorize according to an underlying graph. They provide a simple way to visualize the structure of a probabilistic model. Inspecting such graph we can insight into the property of the model. Three equivalent types of model. Directed graphical model (Bayesian networks) Undirected graphical model (Markov random chain) Factor graph (generalization of first two)
Introduction Linear-Chain General Specific Implementations Conclusions Graphical Models cont d y 1 y 2 y 3 x 1 x 2 x 3 y t-1 y t y t+1 y t+10 y t+11 y t+12 x x t-1 t x t+1 x t+10 x t+11 x t+12
Introduction Linear-Chain General Specific Implementations Conclusions Generative Models Ng and Jordan, 2002 Generative classifiers learn a model of the joint probability, p(x, y), of the inputs x and the label y, and make their prediction by using the Bayes rule to calculate p(y x), and then picking the most likely label y. It s very hard to model the dependences between different features over the observed variables. In Naive Bayes classifiers we assume the conditional independence between features.
Introduction Linear-Chain General Specific Implementations Conclusions Discriminative Models Discriminative classifiers model the posterior p(y x) directly. Do not need to model the distribution (of features) over observed variables.
Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions
Introduction Linear-Chain General Specific Implementations Conclusions Linear-Chain CRF ( P (y x : θ) = 1 T Z(x) exp t=1 k=1 T is the number of tokens in the sequence. K is the number of features we use in our model. Lafferty et al., 2001 ) K θ k f k (y t 1, y t ; x t ) θ is a parameters vector we have to estimate in order to obtain a model that fits the data. Z(x) is an instance-specific normalization function. Z(x) = y exp ( T t=1 k=1 ) K θ k f k (y t 1, y t ; x t )
Introduction Linear-Chain General Specific Implementations Conclusions Feature Functions ( P (y x : θ) = 1 T Z(x) exp t=1 k=1 f k is one of the k feature functions ) K θ k f k (y t 1, y t ; x t ) f ij (y, y, x t ) = 1 {y=i} 1 {y =j} for each transition from the state i to the state j. f io (y, y, x t ) = 1 {y=i} 1 {x=o} for each state-observation pair i, o. 1 {y=i} is a function which returns 1 if y = i, 0 elsewere. x t is the feature vector at time t, it contains all the components that are needed for computing features at time t. If we want to use x t+1 (the next word) as a feature, then the feature vector x t is assumed to include x t+1.
Introduction Linear-Chain General Specific Implementations Conclusions Feature Functions cont y 1 y 2 y 3 x 1 x 2 x 3 The graph above is a classical Linear-Chain CRF. We can modify and improve this model as we want. For example: f ij (y, y, x t ) = 1 {y=i} 1 {y =j}1 {x=o} e.g. { 1 if xi = John, y f(y i, y i 1, x i ) = i 1 = O, y i = NN 0 otherwise
Introduction Linear-Chain General Specific Implementations Conclusions Training To estimate the θ parameters we calculate the maximum conditional log likelihood. N l(θ) = log p(x (i) y (i) ) i=1 substituting the CRF model into the likelihood and adding a 1 regularization parameter 2σ we obtain: 2 l(θ) = N T K i=1 t=1 k=1 θ k f k (y (i) t, y (i) t 1, x(i) t N K ) log Z(x (i) ) i=1 k=1 θ 2 k 2σ 2 In general this function cannot be maximized in closed form. The partial derivatives are: δl δθ k = N T i=1 t=1 f k (y (i) t, y (i) t 1, x(i) t ) N i=1 t=1 T y,y f k (y, y, x (i) t )p(y, y x (i) ) K k=1 θ 2 k 2σ 2
Introduction Linear-Chain General Specific Implementations Conclusions Training cont d The function l(θ) is concave, that is every local optimum is a global optimum. The maximization of the log likelihood can be computed with several numerical algorithms Gradient method of steepest ascent Iterative scaling Newton s method quasi-newton s method BFGS and limited-memory BFGS
Introduction Linear-Chain General Specific Implementations Conclusions Forward-Backward Algorithm In order to resolve two main problems in the CRF scenario, the Z(x) evaluation and the computation of marginals distributions in the gradient computation, we have to use two algorithms belonging to the same class of algorithm, the forward-backward class. Z(x) = ( T ) K exp θ k f k (y t 1, y t ; x t ) y t=1 k=1 we define a function g as: K g t (y t 1, y t ) = θ k f k (y t 1, y t ; x t ) k=1 given that θ and x are fixed we can give as parameters y t 1 and y t.
Introduction Linear-Chain General Specific Implementations Conclusions Forward-Backward Algorithm cont d Wallach, 2004 Z(x) = y T exp (g t (y t 1, y t )) t=1 for each t from 1 to T + 1 we define an Y + 2 Y + 2 matrix called Transition Matrix: M t (u, v) = exp g t (u, v) M 1 (u, v) is defined only for u = ST ART, and M T +1 (u, v) is defined only for v = END given this the only thing we have to do is multiplying all the matrices, e.g. M 1 M 2, M 1,2 M 3,... and take the (ST ART, END) entry of the obtained matrix.
Introduction Linear-Chain General Specific Implementations Conclusions Forward-Backward Algorithm cont d In order to infer the marginal in the gradient computation, we have to use the forward-backward algorithm as in the HMM. We have to solve the expectation of f k under the model distribution Base case: N i=1 t=1 T y,y f k (y, y, x (i) t )p(y, y x (i) ) { 1, if y = START α 0 (y x) = 0, otherwise { 1, if y = END β T +1 (y x) = 0, otherwise
Introduction Linear-Chain General Specific Implementations Conclusions Forward-Backward Algorithm cont d Recurrence relation: α t (x) T = α t 1 (x) T M t (x) we can write: β t (x) = M t+1 (x)β t+1 (x) p(y, y x (i) ) = α t 1(y x)m t (y, y x)β t (y x) Z(x)
Introduction Linear-Chain General Specific Implementations Conclusions Viterbi Algorithm To determine the most probable sequence of labels y 1,..., y T we use the Viterbi algorithm y = arg max y T g t (y t 1, y t ) t=1 We define: SCORE(y 1,..., y p ) = p t=1 g t(y t 1, y t ) U(p) =score best sequence y 1,..., y p U(p, v) =score best sequence y 1,..., y p, where y p = v
Introduction Linear-Chain General Specific Implementations Conclusions Viterbi Algorithm cont d Formally: Base case: U(p, v) = p 1 max [ g t (y t 1, y t ) + g p (y p 1, v)] y 1,...,y p 1 U(0, v) = t=1 { 0, if v = START, otherwise we can recursively proceed in this manner: U(p, v) = max y p 1 [U(p 1, y p 1 ) + g p (y p 1, v)] v Our goal is to find U( T + 1, END)
Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions
Introduction Linear-Chain General Specific Implementations Conclusions General CRF p(y x) = 1 Z(x) C p C Ψ c C p Ψ c (x c, y c ; θ p ) where each factor is parametrized as: K(p) Ψ c (x c, y c ; θ p ) = exp θ pk f pk (x c, y c ) Z(x) = y C p C k=1 Ψ c C p Ψ c (x c, y c ; θ p ) C = C 1,..., C P where each C p is a clique template whose parameters are tied.
Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions
Introduction Linear-Chain General Specific Implementations Conclusions Important Tasks As a versatile learning system we can employ it in several natural languages task: Sequence labeling, Part Of Speech (linear-chain) Information extraction, Named Entity Recognition (skip-chain, semi-markov) Text classification (multilabel crf)
Introduction Linear-Chain General Specific Implementations Conclusions Skip-Chain CRF Sutton and McCallum, 2004 y t-1 y t y t+1 y t+10 y t+11 y t+12 x x t-1 t x t+1 x t+10 x t+11 x t+12 p(y x) = 1 Z(x) T Ψ t (y t, y t 1, x) t=1 (u,v) I Ψ uv (y u, y v, x) ( ) Ψ t (y t, y t 1, x) = exp θ 1k f 1k (y t, y t 1, x t ) ( ) Ψ uv (y u, y v, x) = exp θ 2k f 2k (y u, y v, x, u, v) k k
Introduction Linear-Chain General Specific Implementations Conclusions Semi-Markov CRF Sarawagi and Cohen, 2005 p(s x) = 1 s K Z(x) exp θ k g k (y j, y j 1, x, t j, u j ) k j s = (s 1...s p ) denote the segmentation of x where s j = (t j, u j, y j ) where t j is a start position, u j is an end position, and y j is the label of that segment. Semi-Markov CRF employes a modified version of Viterbi algorithm that take into account the segment model
Introduction Linear-Chain General Specific Implementations Conclusions Multilabel CRF ( Ghamrawi and McCallum, 2005 p(y x) = 1 Z(x) exp θ k f k (x, y) + ) θ k f k (y) k k k { v i, y j : 1 i V, 1 j Y } k { y i, y j, q : q {0, 1, 2, 3}, 1 i, j Y } y = subset of the set Y represented by a vector of length Y y 1 y 2 x 1 x 2 x 3
Introduction Linear-Chain General Specific Implementations Conclusions Multilabel CRF cont d ( p(y x) = 1 Z(x) exp θ k f k (x, y) + ) θ k f k (x, y) k k k { v i, y j : 1 i V, 1 j Y } k { v i, y j, y j : 1 i V, 1 j, j Y } y 1 y 2 x 1 x 2 x 3
Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions
Introduction Linear-Chain General Specific Implementations Conclusions Implementations CRF++ by Taku Kudo at http://crfpp.sourceforge.net/ CRF Project by Sunita Sarawagi at http://crf.sourceforge.net/ Stanford CRF at http://nlp.stanford.edu/software/crf-ner.shtml Mallet at http://mallet.cs.umass.edu/index.php by Andrew McCallum
Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions
Introduction Linear-Chain General Specific Implementations Conclusions Conclusions CRF s take the best among HMM s and ME s State of the art in many NLP tasks A rich framework such as HMM s
Introduction Linear-Chain General Specific Implementations Conclusions Questions? Thanks for your attention!