Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Condtonal Random Felds: Probablstc Models for Segmentng and Labelng Sequence Data Paper by John Lafferty, Andrew McCallum, and Fernando Perera ICML 2001 Presentaton by Joe Drsh May 9, 2002

Man Goals Present a new framework for labelng sequence data: Condtonal Random Felds (CRFs) Descrbe the label bas problem Motvate the use of CRFs to solve the label bas problem Defne the structure and propertes of CRFs [Descrbe two tranng procedures for learnng CRF parameters] [Sketch a proof of the convergence of the two tranng procedures] Compare expermentally to hdden Markov models (HMMs) and maxmum entropy Markov models (MEMMs) 2

Labelng Sequence Data Gven observed data sequences X = {x 1, x 2,, x n } A correspondng label sequence y k for each data sequence x k,, and Y = {y 1, y 2,, y n } Predcton Task Gven a sequence x and model θ predct y Learnng Task Gven tranng sets X and Y, learn the best model θ Notaton set of sequences: X, uppercase sequence of observatons: x, lowercase boldface, also called a data sequence sequence of labels: y, lowercase boldface, also called a sequence of states sngle observaton: x, lowercase, also called a data value 3

Example label and observaton sequence label y <head> X-NNTP-Poster: NewsHound v1.33 observaton x <head> <head> Archve-name: acorn/faq/part2 <head> Frequency: monthly <head> <queston> 2.6) What confguraton of seral cable should <queston> I use? <answer> Here follows a dagram of the necessary <answer> connectons programs to work properly. They <answer> are as far as know agreed upon by commercal <answer> comms software developers fo <answer> <answer> Pns 1, 4, and 8 must be connected together <answer> s to avod the well known seral port chp bugs. label sequence y observaton sequence x 4

Generatve Modelng (HMMs) Gven tranng set X wth label sequences Y: Tran a model θ that maxmzes P(X, Y θ) For a new data sequence x, the predcted label y maxmzes P(y x) = P(y x, θ)p(x θ) s s Y = y Y +1 = y +1 o A = a, where A s a random varable, and a s an outcome X +1 = x 5

Dscrmnatve Modelng (MEMMs( MEMMs) Gven tranng set X wth label sequences Y: Tran a model θ that maxmzes P(Y X, θ) For a new data sequence x, the predcted label y maxmzes P(y x, θ) s s Y = y Y +1 = y +1 o A = a, where A s a random varable, and a s an outcome X +1 = x 6

rb/rob models HMM MEMM CRF 0 0 0 r r r r r 1 1 4 5 o 1 Tranng data: {<rb, 123>, <rob, 453>} 2 4 5 r o 2 2 4 5 o 3 3 3 observaton depends on the state b state depends on the observaton b non-causal model b 7

Label Bas Problem Consder ths MEMM: r 0 r 1 2 4 5 o 3 b The label sequence 1,2 should score hgher when r s observed compared to ro. Or, we expect P(1 and 2 r) to be greater than P(1 and 2 ro). Mathematcally, P(1 and 2 ro) = P(2 1 and ro)p(1 ro) = P(2 1 and o)p(1 r) P(1 and 2 r) = P(2 1 and r)p(1 r) = P(2 1 and )P(1 r) Snce P(2 1 and x) = 1 for all x, P(1 and 2 ro) = P(1 and 2 r) In the tranng data, label value 2 s the only label value observed after label value 1 Therefore P(2 1) = 1, so P(2 1 and x) = 1 for all x 8

Changng the Set of States Example: r b 2 0 14 3 5 P(14 and 2 r) = P(2 14and r)p(14 r) = P(2 14 and )P(14 r) = (1)(1) = 1 P(14 and 2 ro) = P(2 14 and ro)p(14 ro) = P(2 14 and o)p(14 r) = (0)(1) = 0 o Ths s a soluton to the label bas problem. But, changng the set of states would be mpractcal. 9

Condtonal Random Felds (CRFs) Dsadvantages of MEMMs P(y x) = product of factors, one for each label Each factor can depend only on prevous label, and not future labels So, let P(y x) = exp( fk( y, x)) k where each f k s a property of part of y and x Example: f k (x, y) = 1 s X s uppercase and label Y s a proper noun. 10

Random Feld Example Let G = (Y, E) be a graph where each vertex Y v s a random varable Suppose P(Y v all other Y) = P(Y v neghbors(y v )) then Y s a random feld Example: Y 1 Y 2 Y 5 Y 4 Y 3 Y 6 P(Y 5 all other Y) = P(Y 5 Y 4, Y 6 ) 11

Condtonal Random Feld Example Suppose P(Y v X, all other Y) = P(Y v X, neghbors(y v )) then X wth Y s a condtonal random feld X Y 1 Y 2 Y 3 Y 4 Y 5 P(Y 3 X, all other Y) = P(Y 3 X, Y 2, Y 4 ) Thnk of X as observatons and Y as labels 12

Condtonal Dstrbuton If Y s a tree, the dstrbuton over the label sequence Y = y, gven X = x, s: pθ ( y exp x) λ f ( e, y, x), x) x s a data sequence outcome y s a label sequence outcome v s a vertex from vertex set V = set of label random varables e s an edge from edge set E over V f k and g k are gven and fxed features; each g k s a property of x and a vertex v for the label random varable assocated wth v. k s the number of features Θ = (λ 1, λ 2, ; µ 1, µ 2, ); λ k and µ k are parameters to be estmated y e s the components of y defned by edge e y 13 v s one component of y defned by vertex v + k k e e E, k v V, k μ k g k ( v, y v (1)

Matrx Random Varable Add specal start and stop states y 0 = start and y n+1 = start e s the edge wth labels (Y -1, Y ) v s the vertex wth label Y curly Y = set of possble label values The matrx random varable has the range curly Y curly Y If Y = y then y curly Y Suppose that p Θ (Y X) s a CRF gven by (1). Assume Y s a chan. For each poston n the observaton sequence x, we defne a matrx random varable M (x) = [M (y, y x)] as: Y -1 Y Y +1 M ( y', y x) = exp( Λ ( y', y x)) X Λ = -1 X X +1 = + ( y', y x) λ k k fk ( e, Y e, Y e ( y', y), x) k μ k g k ( v, Y v = y, x) 14

Condtonal Probablty for CRFs The condtonal probablty of a label sequence y s p n+ 1 M ( y 1 1 ( ) = = y x Θ n x) ( ) + 1 M ( x) = 1 start, stop, y n s length of sequence y = y 1 y n y 0 = start and y n+1 = stop 15

Parameter Estmaton Input: tranng data D = {(x (), y () )}, where = 1 N wth emprcal dstrbuton ~ p ( x, y ) Output: parameters Θ = (λ 1, λ 2, ; µ 1, µ 2, ) Maxmze: the log-lkelhood objectve functon: O ( Θ) = N = 1 log p Θ ( y ( ) x ( ) ) x, y ~ p ( x, y ) log p Θ ( y x) Ths s an expectaton. 16

rb/rob models HMM MEMM CRF 0 0 0 r r r r r 1 1 4 5 o 1 Tranng data {<rb, 123>, <rob, 453>} 2 4 5 r o 2 2 4 5 o 3 3 3 observaton depends on the state b P(r 1) = 29/32, P( 1) = 1/32 = P(o 1) = P(b 1) state depends on the observaton b non-causal model b 17

Modelng the label bas problem Each state emts ts desgnated symbol wth probablty 29/32 and the other symbols wth probablty 1/32 (on pcture) They tran MEMM and CRF wth the same topologes A run conssts of 2,000 tranng examples and 500 test examples, traned to convergence CRF error s 4.6%, and MEMM error s 42% MEMM fals to dscrmnate between the two branches 18

Mxed-order HMM State transton probabltes are gven by p α ( y y 1, y 2 ) = αp2 ( y y 1, y 2 ) + (1 α) p1( y y 1) Emsson probabltes are gven by p α ( x y, x 1) = αp2 ( x y, x 1) + (1 α) p1 ( x y ) α = 0 s a standard frst-order HMM. A frst order HMM has transton probabltes gven by p (y y,y ) (1 α) p (y y ) α 1 2 1 1 And emsson probabltes gven by p (x y,x ) (1 α) p (x y ) α 1 1 19

Modelng mxed-order sources Expermental Setup For each randomly generated model, a sample of 1,000 sequences of length 25 s generated for tranng and testng On each randomly generated tranng set, a CRF s traned usng Algorthm S, whch s descrbed n the paper On the same data an MEMM s traned usng teratve scalng The Vterb algorthm s used to label a test set the advantages of the addtonal representatonal power of CRFs and MEMMs relatve to HMMs. 20

MEMM versus CRF CRF usually outperforms the MEMM 21

MEMM versus HMM The HMM outperforms the MEMM 22

CRF versus HMM Each open square represents a data set wth α < 1/2, and a sold crcle ndcates a data set wth α 1/2; When the data s mostly second order (α 1/2), the dscrmnatvely traned CRF usually outperforms the HMM 23

UPenn Taggng Task 45 tags (syntactc), 1M words tranng DT NN NN NN VBZ RB JJ The asbestos fber ; crocdolte ; s unusually reslent IN PRP VBZ DT NNS IN RB JJ NNS once t enters the lungs ; wth ; even bref exposures TO PRP VBG NNS WDT VBP RP NNS JJ to t causng symptoms that show up decades later ; NNS VBD researchers sad 24

POS taggng experment 1 Compared HMMs, MEMMs, and CRFs on Penn treebank POS taggng Traned frst-order HMM, MEMM, and CRF models as n the synthetc data experments Introduced parameters µ y,x for each tag-word par and λ y y for each tagtag par n the tranng set oov = out-of-vocabulary, and are not observed n the tranng set model error oov error HMM 5.69% 45.99% MEMM 6.37% 54.61% CRF 5.55% 48.05% 25

POS taggng experment 2 Compared HMMs, MEMMs, and CRFs on Penn treebank POS taggng Each word n a gven nput sentence must be labeled wth one of 45 syntactc tags Add a small set of orthographc features: whether a spellng begns wth a number or upper case letter, whether t contans a hyphen, and f t contans one of the followng suffxes: -ng, -ogy, -ed, -s, -ly, -on, - ton, -ty, -es usng spellng features model error oov error error delta oov error delta HMM 5.69% 45.99% MEMM 6.37% 54.61% 4.81% -25% 26.99% -50% CRF 5.55% 48.05% 4.27% -24% 23.76% -50% 26

Presentaton Message Dscrmnatve models are prone to the label bas problem CRFs provde the benefts of dscrmnatve models CRFs solve the label bas problem well, and demonstrate good performance 27