Fall 2010 Graduate Course on Dynamic Learning

Fall 2010 Graduate Course on Dynamic Learning Chapter 8: Conditional Random Fields November 1, 2010 Byoung-Tak Zhang School of Computer Science and Engineering & Cognitive Science and Brain Science Programs Seoul National University http://bi.snu.ac.kr/~btzhang/

Overview Motivating Applications Sequence Segmentation and Labeling Generative vs. Discriminative Models HMM, MEMM CRF From MRF to CRF Learning Algorithms HMM vs. CRF 2

Pos Tagging Motivating Application: Sequence Labeling [He/PRP] [reckons/vbz] [the/dt] [current/jj] [account/nn] [deficit/nn] [will/md] [narrow/vb] [to/to] [only/rb] [#/#] [1.8/CD] [billion/cd] [in/in] [September/NNP] [./.] Term Extraction Rockwell International Corp. s Tulsa unit said it signed a tentative agreement extending its contract with Boeing Co. to provide structural parts for Boeing s 747 jetliners. Information Extraction from Company Annual Report 3

Sequence Segmenting and Labeling Goal: mark up sequences with content tags Computational linguistics Text and speech processing Topic segmentation Part-of-speech (POS) tagging Information extraction Syntactic disambiguation Computational biology DNA and protein sequence cealignment Sequence homolog searching in databases Protein secondary structure prediction RNA secondary structure t analysis

Binary Classifier vs. Sequence Case restoration Labeling jack utilize outlook express to retrieve emails E.g. SVMs vs. CRFs - Jack utilize outlook express to retrieve emails. + Jack Utilize Outlook Express To Receive Emails jack utilize outlook express to receive emails JACK UTILIZE OUTLOOK EXPRESS TO RECEIVE EMAILS 5

Sequence Labeling Models: Overview HMM Generative model E.g. Ghahramani (1997), Manning and Schutze (1999) MEMM Conditional model E.g. Berger and Pietra (1996), McCallum and Freitag (2000) CRFs Conditional model without t label l bias problem Linear-Chain CRFs E.g. Lafferty and McCallum (2001), Wallach (2004) Non-Linear Chain CRFs Modeling more complex interaction between labels: DCRFs, 2D-CRFs E.g. Sutton and McCallum (2004), Zhu and Nie (2005) 6

Generative Models: HMM Based on joint probability distribution P(y,x) ) Includes a model of P(x) which is not needed for classification Interdependent features either enhance model structure to represent them ( complexity problems) or make simplifying independence assumptions (e.g. naive Bayes) Hidden Markov models (HMMs) and stochastic ti grammars Assign a joint probability to paired observation and label sequences The parameters typically trained to maximize the joint likelihood lih of train examples

Hidden Markov Model n Psx (, ) = Ps ( 1) Px ( 1 s1) Ps ( i si 1 ) Px ( i si ) i= 2 Cannot represent multiple interacting (overlapping) features or long range dependences between observed elements. 8

Conditional Models Difficulties and disadvantages of generative models Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range dependencies of the observations Very strict independence assumptions on the observations Conditional Models Conditional probability P(y x) rather than joint probability P(y, x) where y = label sequence and x = observation sequence. Based directly on conditional probability P(y x) Need no model for P(x) Specify the probability of possible label sequences given an observation sequence Allow arbitrary, non-independent features on the observation sequence X The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models

Discriminative Models: MEMM Maximum entropy Markov model (MEMM) Exponential model Given a training set (X, Y) of observation sequences X and label sequences Y: Train a model θ that maximizes P(Y X, θ) For a new data sequence x, the predicted label y maximizes P(y x, θ) Notice the per-state normalization MEMMs have all the advantages of conditional models Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states ( conservation of score mass ) Subject tto Label Bias Problem Bias toward states with fewer outgoing transitions

Maximum Entropy Markov Model n P ( s x ) = P ( s 1 x1) P ( si si 1, xi ) i= 2 Label bias problem: the probability transitions leaving any given state must sum to one 11

Conditional Markov Models (CMMs) aka MEMMs aka Maxent Taggers vs. HMMs Pso (, ) = Ps ( i si 1 ) Po ( i si 1 ) i O t-1 S t-1 S t S t+1... O t O t+1 P ( s o ) = P ( si si 1, oi 1 ) i O t-1 S t-1 S t S t+1... O t O t+1 12

MEMM to CRFs i i j j j 1 i 1 n 1 n = j j 1, j = j j Zλ ( xj ) Py (... y x... x) Py ( y x) exp( λ f ( x, y, y )) exp( λ if i( xy, )) i =, where F ( xy, ) = f ( x, y, y 1) Z ( x ) j λ j i i j j j j New model exp( λ if i ( x, y )) i Z λ ( x ) 13

HMM, MEMM, and CRF in Comparison HMM MEMM CRF

Conditional Random Field (CRF)

Random Field

Markov Random Field Random Field: Let F = { F1, F2,..., FM } be a family of random variables defined on the set S, in which each random variable Fi takes a value fi in a label set L. The family F is called a random field. Markov Random Field: F is said to be a Markov random field on S with respect to a neighborhood system N if and only if it satisfies the Markov property. undirected graph for joint probability p(x) allows no direct probabilistic bili interpretation t ti define potential functions Ψ on maximal cliques A map pjoint assignment to non-negative real number requires normalisation p(x) ) = 1 Z Ψ A (x A ) Z = Ψ A (x A ) A x A

Conditional Random Field: CRF Conditional probabilistic sequential models p(y x) Undirected graphical models Joint probability of an entire label sequence given a particular observation sequence Weights of different features at different states can be traded off against each other p(y x) = 1 Ψ A (x A,yy A ) Z(x) = Ψ A (x A, y A ) Z A y A 18

Example of CRFs

Definition of CRFs X is a random variable over data sequences to be labeled X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

Conditional Random Fields (CRFs) CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions vote more strongly than others gy depending on the corresponding observations

Conditional Random Field Graphical structure of a chain-structured CRFs for sequences. The variables corresponding to unshaded nodes are not generated by the model. Conditional Random Field: a Markov random field (Y) globally conditioned on another random field (X).

Conditional Random Field undirected graphical model globally conditioned on X Given an undirected d graph G = (V, E) such hthat ty Y ={Y v v V }, if p( Y X, Y, u v,{ u, v} V) p( Y X, Y,( u, v) E) v u v u The probability of Y v given X and those random variables corresponding to nodes neighboring v in G. Then (X, Y) is a conditional random field. 23

Conditional Distribution If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y=ygivenX=x y, = x, by fundamental theorem of random fields is: p θ (y x) exp λk fk( e, y e, x) + μkgk( v, y v, x) e E,k v V,k x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V f k and g k are given and fixed. g k is a Boolean vertex feature; f k is a Boolean edge feature k is the number of features θ = ( λ1, λ2,, λn; μ1, μ2,, μn); λk andμk are parameters to be estimated y e is the set of components of y defined by edge e y v is the set of components of y defined by vertex v

Conditional Distribution (cont d) CRFs use the observation-dependent normalization Z(x) for the conditional distributions: 1 pθ (y x) = exp λ f ( e,y,x) + μ g ( v,y Z(x) e E,k v V,k k k e k k v,x) Z(x) is a normalization over the data sequence x

Conditional Random Fields /k/ /k/ /iy/ /iy/ /iy/ X X X X X CRFs are based on the idea of Markov Random Fields Transition functions add associations between transitions from Modelled as one an label undirected to another graph State functions connecting help determine labels with the observations identity of the state Observations in a CRF are not modelled as random variables

Conditional Random Fields P ( y x ) = exp ( λ f ( x, y ) + μ g ( x, y, y i i t j j t t t i j Z( x) 1 )) Hammersley-Clifford Theorem states that a State Feature Weight Transition Feature Transition Feature Function State Weight Feature Function random λ=10 field is an MRF iff it can be μ=4 g(x, described /iy/,/k/) in f([x is stop], /t/) One the possible above weight value form for this state feature One possible One possible weight state value One possible transition feature feature function The function (Strong) exponential for this transition is For the our sum attributes feature of the and clique labels potentials of Indicates /k/ followed by /iy/ the undirected graph

Conditional Random Fields Each attribute of the data we are trying to model fits into a feature function that associates the attribute and a possible label A positive value if the attribute appears in the data A zero value if the attribute is not in the data Each feature function carries a weight that gives the strength of that feature function for the proposed label High positive weights indicate a good association between the feature and the proposed label High negative weights indicate a negative association between the feature and the proposed label Weights close to zero indicate the feature has little or no impact on the identity of the label

Formally. Definition CRF is a Markov random field. By the Hammersley-Clifford theorem, the probability of a label can be expressed as a Gibbs distribution, so that 1 py ( x, λμ, ) = exp( λ jf j( y, x)) Z n j y = f j y c i= 1 F ( y, x ) f ( y, xi, ) What is clique? By only taking consideration of the one-node and two-node cliques, we have j clique 1 py ( x, λμ, ) = exp( λjtj( y e, xi,) + μksk( y s, xi,)) Z j k 29

Definition (cont.) Moreover, let us consider the problem in a first-order chain model, we have 1 p( y x, λ, μ) = exp( λjtj( yi 1, yi, x,) i + μksk( yi, x,)) i Z j k For simplifying description, let f j (y, x) denote t j (y i-1, y i, x, i) and s k (y i, x, i) 1 py ( x, λμ, ) = exp( λjfj( yx, )) Z n F ( y, x) = f ( y, x, i) j j c i= 1 j 30

Labeling In labeling, the task is to find the label sequence that has the largest probability Then the key is to estimate the parameter lambda yˆ = arg max p ( y x) = arg max( λ F( y, x)) y λ 1 py ( x, λμ, ) = exp( λjfj( yx, )) Z j y 31

Optimization Defining a loss function that should be convex for avoiding local optimization Defining constraints Finding a optimization method to solve the loss function A formal expression for optimization problem min θ f( x) s.. t gi ( x) 0,0 i k h ( x ) = 0,0 j l j 32

Loss Function Empirical loss vs. structural loss minimize L = y f( x, λ) k minimize L = λ + y f( x, λ) k Loss function: Log-likelihood 1 py ( x, λμ, ) = exp( λjfj( yx, )) Z L Z F y x ( k) ( k) ( λ) = log + λj j(, ) k j ( k) ( k) ( k) λ Lλ = λ F( y, x ) log Zλ( x ) + const 2 k 2σ j 2 33

Parameter Estimation Log-likelihood ( k) ' δ L λ ( k ) ( k ) ( Zλ ( x )) = Fj ( y, x ) ( k) δλ j k ( ) ( k) ( k) Zλ x ( λ) = log + λj j(, ) ( k) ( k) = λ L Z F y x k j Differentiating the log-likelihood with respect to parameter λ j δ Z ( x ) exp F( y, x ) λ y ( ( k ) ( k exp( λ F( y, x )) F (, ) j y x )) ( k) ' ( Zλ ( x )) y = ( k ) ( k ) Z ( x ) exp λ F( y, x ) ( k) δ L ( k ) EpYX (, )[ F j( Y, X)] E ( k ) [ F (, )] pyx (, ) j Y x λ δλ = exp( λ F( y, x ) ( ) ( ) ) (, k = F k j y x ) j k y exp λ F( y, x ) y By adding the model penalty, it can be ( k ) ( k ) = ( p( y x ) Fj ( y, x )) rewritten as y δ L ( k) ( k ) λ = E ( k ) E F (, ) pyx ( ) j Y x [ F ( Y, X)] E ( k ) [ F (, )] Y x = pyx (, ) j pyx (, λ ) j 2 δλ σ j k λ y 34

Optimization ( k) ( k) L( λ) = log Z + λjfj( y, x ) k j δ L py (, X) j ( k ) pyx (, λ ) j δλ = E E j ( k ) [ (, )] [ (, )] F Y X F Y x E p(y,x) F j j(y, (y,x) can be calculated easily E p(y x) F j (y,x) can be calculated by making use of a forward-backward algorithm Z can be estimated in the forward-backward algorithm k 35

Calculating the Expectation First we define the transition matrix of y for position x as M [ y, y ] = exp λ f( y, y, x, i) i i 1 i i 1 i E ( k ) F (, ) ( ) (, ) p ( Y x ) j Y x = p y x Fj yx λ n = i= 1 y, y α = i i 1 i ( k ) ( k ) λ y ( k) ( k) i 1 i j i 1 i py (, y x ) f ( y, y, x ) T i 1( fi Mi Vi) βi Z λ ( x ) n+ 1 Zλ ( x) = Mi( x) = αn 1 i=11 T i α im i 0 < i n αi = 1 i = 0 T M 1 β 1 1 i < n = 1 i = n T i+ i+ β i py ( x ) = T ( k) αi 1βi Z λ ( x) All state features at position i 36

First-order Numerical Optimization Using Iterative Scaling g( (GIS, IIS) Initialize each λ j (= 0 for example) Until convergence δ L - Solve 0 for each parameter λ j = p δλ j j - Update each parameter using λ j λ j + λ j 37

Second-order Numerical Optimization Using newton optimization technique for the parameter estimation 2 ( 1) ( ) 1 ( L k k ) L + λ = λ + 2 λ λ Drawbacks: parameter value initialization And compute the second order (i.e. Hesse matrix), that is difficult Solutions: - Conjugate-gradient (CG) (Shewchuk, 1994) - Limited-memory quasi-newton (L-BFGS) (Nocedal and Wright, 1999) - Voted Perceptron (Colloins 2002) 38

Summary of CRFs Model Lafferty, 2001 Applications Efficient training (Wallach, 2003) Training via. Gradient Tree Boosting (Dietterich, 2004) Bayesian Conditional Random Fields (Qi, 2005) Name entity (McCallum, 2003) Shallow parsing (Sha, 2003) Table extraction (Pinto, 2003) Signature extraction (Kristjansson, 2004) Accurate Information Extraction from Research Papers (Peng, 2004) Object Recognition (Quattoni, 2004) Identify Biomedical Named Entities (Tsai, 2005) Limitationit ti Huge computational cost in parameter estimation 39

HMM vs. CRF HMM arg max P( φ S) φ = arg max P ( φ ) P ( S φ ) φ ( P y y P s y ) = arg max log trans ( i i 1) emit ( i i ) φ y i φ CRF arg max P ( φ S ) φ 1 λ f ( c, S) = arg max e φ Z = arg max λi fi( cs, ) φ ci, 1. Both optimizations are over sums this allows us to use any of the dynamic programming HMM/GHMM decoding algorithms for fast, memory-efficient parsing, with the CRF scoring scheme used in place of the HMM/GHMM scoring scheme. 2. The CRF functions f i (cs) (c,s) may in fact be implemented using any type of sensor, including such probabilistic sensors as Markov chains, interpolated Markov models (IMM s), decision trees, phylogenetic models, etc..., as well as any non-probabilistic sensor, such as n-mer counts or binary indicators.

Appendix

Markov Random Fields A(di (discrete-valued) d)markov random field ld(mrf) is a 4-tuple 4t M=(α, ( X, P M, G) where: α is a finite alphabet, X is a set of (observable or unobservable) variables taking values from α, P M is a probability distribution on variables in X, G=(X, E) is an undirected graph on X describing a set of dependence relations among variables, such hthat tp P M (X i {X k i }) = P M (X i N G (X i )), for N G (X i )th the neighbors of fxx i under G. That is, the conditional probabilities as given by P M must obey the dependence relations (a generalized Markov assumption ) given by the undirected graph G. A problem arises when actually inducing such a model in practice namely, that we can t just set the conditional probabilities P M (X i N G (X i )) arbitrarily and expect the joint probability P M (X) to be well-defined (Besag, 1974). h h bl f l ll f h hb h d Thus, the problem of estimating parameters locally for each neighborhood is confounded by constraints at the global level...

The Hammersley-Clifford Theorem Suppose P(x)>0 for all (joint) value assignments x to the variables in X. Then by the Hammersley-Clifford theorem, the likelihood of x under model M is given by: for normalization term Z: P 1 M (x) = e Z Z = e Q( where Q(x) ) has a unique expansion given by: x Q( x) What Q(x 0, x 1,..., x n 1 ) = x i Φ i (x i ) + x i x j Φ i,j (x i, x j )+... 0 i<n 0 i< j<n x )...+ x 0 x 1...x n 1 Φ 0,1,...,n 1 (x 0, x 1,..., x n 1 ) What is a clique? A clique is any subgraph in which all vertices are neighbors. and where any Φ i term not corresponding to a clique must be zero. (Besag, 1974) The reason this is useful is that it provides a way to evaluate probabilities (whether joint or conditional) based on the local functions Φ. Thus, we can train an MRF by learning individual Φ functions one for each clique.

Conditional Random Fields A Conditional random field (CRF) isamarkov random field of unobservables which are globally conditioned on a set of observables (Lafferty et al., 2001): Formally, a CRF is a 6-tuple M=(L,α,Y,X,Ω,G) where: L is a finite output alphabet of labels; e.g., {exon, intron}, α is a finite input alphabet e.g., {A, C, G, T}, Y is a set of unobserved variables taking values from L, X is a set of (fixed) observed variables taking values from α, Ω = {Φ c : L Y α X } is a set of potential functions, Φ c (y,x), G=(V, E) is an undirected graph describing a set of dependence relations E among variables V = X Y, where E (X X)=, such that (α, Y, e ΣΦ(c,x) /Z, G-X) is a Markov random field. Note that: 1. The observables X are not included in the MRF part of the CRF, which is only over the subgraph G-X. However, the X are deemed constants, and are globally visible to the Φ functions. 2. We have not specified a probability function P M, but have instead given local clique-specific functions Φ c which together define a coherent probability distribution via Hammersley-Clifford.

CRF s versus MRF s A conditional random field is effectively an MRF plus a set of external variables X, where the internal variables Y of the MRF are the unobservables ( ) and the external variables X are the observables ( ): Y the MRF the CRF X fixed, observable, variables X (not in the MRF) Thus, we could denote a CRF informally as: C=(M, X) for MRF M and external variables X, with the understanding that the graph G X Y of the CRF is simply the graph G Y of the underlying MRF M plus the vertices X and any edges connecting these to the elements of G Y. Note that in a CRF we do not explicitly model any direct relationships between the observables (i.e., among the X) (Lafferty et al., 2001).