Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data

Similar documents
An Experiment/Some Intuition (Fall 2006): Lecture 18 The EM Algorithm heads coin 1 tails coin 2 Overview Maximum Likelihood Estimation

Feature-Rich Sequence Models. Statistical NLP Spring MEMM Taggers. Decoding. Derivative for Maximum Entropy. Maximum Entropy II

EEE 241: Linear Systems

Hidden Markov Models

Learning undirected Models. Instructor: Su-In Lee University of Washington, Seattle. Mean Field Approximation

3.1 ML and Empirical Distribution

Hidden Markov Models & The Multivariate Gaussian (10/26/04)

xp(x µ) = 0 p(x = 0 µ) + 1 p(x = 1 µ) = µ

10-701/ Machine Learning, Fall 2005 Homework 3

Motion Perception Under Uncertainty. Hongjing Lu Department of Psychology University of Hong Kong

Course 395: Machine Learning - Lectures

Maxent Models & Deep Learning

Homework Assignment 3 Due in class, Thursday October 15

CSC321 Tutorial 9: Review of Boltzmann machines and simulated annealing

Lecture 6 Hidden Markov Models and Maximum Entropy Models

STATS 306B: Unsupervised Learning Spring Lecture 10 April 30

Logistic Regression. CAP 5610: Machine Learning Instructor: Guo-Jun QI

Lecture Notes on Linear Regression

Linear Regression Analysis: Terminology and Notation

Module 3 LOSSY IMAGE COMPRESSION SYSTEMS. Version 2 ECE IIT, Kharagpur

Introduction to Hidden Markov Models

Gaussian Mixture Models

Generalized Linear Methods

MLE and Bayesian Estimation. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Hidden Markov Model Cheat Sheet

Structured Perceptrons & Structural SVMs

Maximum Likelihood Estimation of Binary Dependent Variables Models: Probit and Logit. 1. General Formulation of Binary Dependent Variables Models

6 Supplementary Materials

Chapter 7 Channel Capacity and Coding

EM and Structure Learning

Discriminative classifier: Logistic Regression. CS534-Machine Learning

The Expectation-Maximization Algorithm

Week3, Chapter 4. Position and Displacement. Motion in Two Dimensions. Instantaneous Velocity. Average Velocity

MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.265/15.070J Fall 2013 Lecture 12 10/21/2013. Martingale Concentration Inequalities and Applications

CS 2750 Machine Learning. Lecture 5. Density estimation. CS 2750 Machine Learning. Announcements

Chapter 9: Statistical Inference and the Relationship between Two Variables

Composite Hypotheses testing

Maximum Likelihood Estimation

CS-433: Simulation and Modeling Modeling and Probability Review

THE ARIMOTO-BLAHUT ALGORITHM FOR COMPUTATION OF CHANNEL CAPACITY. William A. Pearlman. References: S. Arimoto - IEEE Trans. Inform. Thy., Jan.

Conditional Random Fields in Speech, Audio and Language Processing

Predictive Analytics : QM901.1x Prof U Dinesh Kumar, IIMB. All Rights Reserved, Indian Institute of Management Bangalore

Chapter 11: Simple Linear Regression and Correlation

Conjugacy and the Exponential Family

Using deep belief network modelling to characterize differences in brain morphometry in schizophrenia

Population Design in Nonlinear Mixed Effects Multiple Response Models: extension of PFIM and evaluation by simulation with NONMEM and MONOLIX

Overview. Hidden Markov Models and Gaussian Mixture Models. Acoustic Modelling. Fundamental Equation of Statistical Speech Recognition

THE SUMMATION NOTATION Ʃ

Week 5: Neural Networks

Clustering gene expression data & the EM algorithm

2E Pattern Recognition Solutions to Introduction to Pattern Recognition, Chapter 2: Bayesian pattern classification

Space of ML Problems. CSE 473: Artificial Intelligence. Parameter Estimation and Bayesian Networks. Learning Topics

Lecture 3: Shannon s Theorem

For example, if the drawing pin was tossed 200 times and it landed point up on 140 of these trials,

Limited Dependent Variables

ANSWERS. Problem 1. and the moment generating function (mgf) by. defined for any real t. Use this to show that E( U) var( U)

On an Extension of Stochastic Approximation EM Algorithm for Incomplete Data Problems. Vahid Tadayon 1

Finite Mixture Models and Expectation Maximization. Most slides are from: Dr. Mario Figueiredo, Dr. Anil Jain and Dr. Rong Jin

Undirected Graphical Models for Sequence Analysis

Lecture 3: Probability Distributions

Lecture 20: Hypothesis testing

Discriminative classifier: Logistic Regression. CS534-Machine Learning

MATH 829: Introduction to Data Mining and Analysis The EM algorithm (part 2)

Multi-Conditional Learning for Joint Probability Models with Latent Variables

Engineering Risk Benefit Analysis

Lecture 10 Support Vector Machines II

Generative and Discriminative Models. Jie Tang Department of Computer Science & Technology Tsinghua University 2012

Profile HMM for multiple sequences

Dr. Shalabh Department of Mathematics and Statistics Indian Institute of Technology Kanpur

Boostrapaggregating (Bagging)

Problem Points Score Total 100

8/25/17. Data Modeling. Data Modeling. Data Modeling. Patrice Koehl Department of Biological Sciences National University of Singapore

Hidden Markov Models

I529: Machine Learning in Bioinformatics (Spring 2017) Markov Models

The Geometry of Logit and Probit

CSC401/2511 Spring CSC401/2511 Natural Language Computing Spring 2019 Lecture 5 Frank Rudzicz and Chloé Pou-Prom University of Toronto

Chapter 1. Probability

Supporting Information

Speech and Language Processing

COMPLEX NUMBERS AND QUADRATIC EQUATIONS

ECE 534: Elements of Information Theory. Solutions to Midterm Exam (Spring 2006)

Prof. Dr. I. Nasser Phys 630, T Aug-15 One_dimensional_Ising_Model

ENG 8801/ Special Topics in Computer Engineering: Pattern Recognition. Memorial University of Newfoundland Pattern Recognition

Outline. Bayesian Networks: Maximum Likelihood Estimation and Tree Structure Learning. Our Model and Data. Outline

More metrics on cartesian products

Markov Chain Monte Carlo (MCMC), Gibbs Sampling, Metropolis Algorithms, and Simulated Annealing Bioinformatics Course Supplement

Probability Theory (revisited)

Google PageRank with Stochastic Matrix

Hidden Markov Models

Semi-Supervised Learning

Robert Eisberg Second edition CH 09 Multielectron atoms ground states and x-ray excitations

P R. Lecture 4. Theory and Applications of Pattern Recognition. Dept. of Electrical and Computer Engineering /

COMPARISON OF SOME RELIABILITY CHARACTERISTICS BETWEEN REDUNDANT SYSTEMS REQUIRING SUPPORTING UNITS FOR THEIR OPERATIONS

Chapter 7 Channel Capacity and Coding

} Often, when learning, we deal with uncertainty:

Machine learning: Density estimation

Part-of-Speech Tagging with Hidden Markov Models

Generative classification models

Hopfield networks and Boltzmann machines. Geoffrey Hinton et al. Presented by Tambet Matiisen

Markov Chain Monte Carlo Lecture 6

Transcription:

Condtonal Random Felds: Probablstc Models for Segmentng and Labelng Sequence Data Paper by John Lafferty, Andrew McCallum, and Fernando Perera ICML 2001 Presentaton by Joe Drsh May 9, 2002

Man Goals Present a new framework for labelng sequence data: Condtonal Random Felds (CRFs) Descrbe the label bas problem Motvate the use of CRFs to solve the label bas problem Defne the structure and propertes of CRFs [Descrbe two tranng procedures for learnng CRF parameters] [Sketch a proof of the convergence of the two tranng procedures] Compare expermentally to hdden Markov models (HMMs) and maxmum entropy Markov models (MEMMs) 2

Labelng Sequence Data Gven observed data sequences X = {x 1, x 2,, x n } A correspondng label sequence y k for each data sequence x k,, and Y = {y 1, y 2,, y n } Predcton Task Gven a sequence x and model θ predct y Learnng Task Gven tranng sets X and Y, learn the best model θ Notaton set of sequences: X, uppercase sequence of observatons: x, lowercase boldface, also called a data sequence sequence of labels: y, lowercase boldface, also called a sequence of states sngle observaton: x, lowercase, also called a data value 3

Example label and observaton sequence label y <head> X-NNTP-Poster: NewsHound v1.33 observaton x <head> <head> Archve-name: acorn/faq/part2 <head> Frequency: monthly <head> <queston> 2.6) What confguraton of seral cable should <queston> I use? <answer> Here follows a dagram of the necessary <answer> connectons programs to work properly. They <answer> are as far as know agreed upon by commercal <answer> comms software developers fo <answer> <answer> Pns 1, 4, and 8 must be connected together <answer> s to avod the well known seral port chp bugs. label sequence y observaton sequence x 4

Generatve Modelng (HMMs) Gven tranng set X wth label sequences Y: Tran a model θ that maxmzes P(X, Y θ) For a new data sequence x, the predcted label y maxmzes P(y x) = P(y x, θ)p(x θ) s s Y = y Y +1 = y +1 o A = a, where A s a random varable, and a s an outcome X +1 = x 5

Dscrmnatve Modelng (MEMMs( MEMMs) Gven tranng set X wth label sequences Y: Tran a model θ that maxmzes P(Y X, θ) For a new data sequence x, the predcted label y maxmzes P(y x, θ) s s Y = y Y +1 = y +1 o A = a, where A s a random varable, and a s an outcome X +1 = x 6

rb/rob models HMM MEMM CRF 0 0 0 r r r r r 1 1 4 5 o 1 Tranng data: {<rb, 123>, <rob, 453>} 2 4 5 r o 2 2 4 5 o 3 3 3 observaton depends on the state b state depends on the observaton b non-causal model b 7

Label Bas Problem Consder ths MEMM: r 0 r 1 2 4 5 o 3 b The label sequence 1,2 should score hgher when r s observed compared to ro. Or, we expect P(1 and 2 r) to be greater than P(1 and 2 ro). Mathematcally, P(1 and 2 ro) = P(2 1 and ro)p(1 ro) = P(2 1 and o)p(1 r) P(1 and 2 r) = P(2 1 and r)p(1 r) = P(2 1 and )P(1 r) Snce P(2 1 and x) = 1 for all x, P(1 and 2 ro) = P(1 and 2 r) In the tranng data, label value 2 s the only label value observed after label value 1 Therefore P(2 1) = 1, so P(2 1 and x) = 1 for all x 8

Changng the Set of States Example: r b 2 0 14 3 5 P(14 and 2 r) = P(2 14and r)p(14 r) = P(2 14 and )P(14 r) = (1)(1) = 1 P(14 and 2 ro) = P(2 14 and ro)p(14 ro) = P(2 14 and o)p(14 r) = (0)(1) = 0 o Ths s a soluton to the label bas problem. But, changng the set of states would be mpractcal. 9

Condtonal Random Felds (CRFs) Dsadvantages of MEMMs P(y x) = product of factors, one for each label Each factor can depend only on prevous label, and not future labels So, let P(y x) = exp( fk( y, x)) k where each f k s a property of part of y and x Example: f k (x, y) = 1 s X s uppercase and label Y s a proper noun. 10

Random Feld Example Let G = (Y, E) be a graph where each vertex Y v s a random varable Suppose P(Y v all other Y) = P(Y v neghbors(y v )) then Y s a random feld Example: Y 1 Y 2 Y 5 Y 4 Y 3 Y 6 P(Y 5 all other Y) = P(Y 5 Y 4, Y 6 ) 11

Condtonal Random Feld Example Suppose P(Y v X, all other Y) = P(Y v X, neghbors(y v )) then X wth Y s a condtonal random feld X Y 1 Y 2 Y 3 Y 4 Y 5 P(Y 3 X, all other Y) = P(Y 3 X, Y 2, Y 4 ) Thnk of X as observatons and Y as labels 12

Condtonal Dstrbuton If Y s a tree, the dstrbuton over the label sequence Y = y, gven X = x, s: pθ ( y exp x) λ f ( e, y, x), x) x s a data sequence outcome y s a label sequence outcome v s a vertex from vertex set V = set of label random varables e s an edge from edge set E over V f k and g k are gven and fxed features; each g k s a property of x and a vertex v for the label random varable assocated wth v. k s the number of features Θ = (λ 1, λ 2, ; µ 1, µ 2, ); λ k and µ k are parameters to be estmated y e s the components of y defned by edge e y 13 v s one component of y defned by vertex v + k k e e E, k v V, k μ k g k ( v, y v (1)

Matrx Random Varable Add specal start and stop states y 0 = start and y n+1 = start e s the edge wth labels (Y -1, Y ) v s the vertex wth label Y curly Y = set of possble label values The matrx random varable has the range curly Y curly Y If Y = y then y curly Y Suppose that p Θ (Y X) s a CRF gven by (1). Assume Y s a chan. For each poston n the observaton sequence x, we defne a matrx random varable M (x) = [M (y, y x)] as: Y -1 Y Y +1 M ( y', y x) = exp( Λ ( y', y x)) X Λ = -1 X X +1 = + ( y', y x) λ k k fk ( e, Y e, Y e ( y', y), x) k μ k g k ( v, Y v = y, x) 14

Condtonal Probablty for CRFs The condtonal probablty of a label sequence y s p n+ 1 M ( y 1 1 ( ) = = y x Θ n x) ( ) + 1 M ( x) = 1 start, stop, y n s length of sequence y = y 1 y n y 0 = start and y n+1 = stop 15

Parameter Estmaton Input: tranng data D = {(x (), y () )}, where = 1 N wth emprcal dstrbuton ~ p ( x, y ) Output: parameters Θ = (λ 1, λ 2, ; µ 1, µ 2, ) Maxmze: the log-lkelhood objectve functon: O ( Θ) = N = 1 log p Θ ( y ( ) x ( ) ) x, y ~ p ( x, y ) log p Θ ( y x) Ths s an expectaton. 16

rb/rob models HMM MEMM CRF 0 0 0 r r r r r 1 1 4 5 o 1 Tranng data {<rb, 123>, <rob, 453>} 2 4 5 r o 2 2 4 5 o 3 3 3 observaton depends on the state b P(r 1) = 29/32, P( 1) = 1/32 = P(o 1) = P(b 1) state depends on the observaton b non-causal model b 17

Modelng the label bas problem Each state emts ts desgnated symbol wth probablty 29/32 and the other symbols wth probablty 1/32 (on pcture) They tran MEMM and CRF wth the same topologes A run conssts of 2,000 tranng examples and 500 test examples, traned to convergence CRF error s 4.6%, and MEMM error s 42% MEMM fals to dscrmnate between the two branches 18

Mxed-order HMM State transton probabltes are gven by p α ( y y 1, y 2 ) = αp2 ( y y 1, y 2 ) + (1 α) p1( y y 1) Emsson probabltes are gven by p α ( x y, x 1) = αp2 ( x y, x 1) + (1 α) p1 ( x y ) α = 0 s a standard frst-order HMM. A frst order HMM has transton probabltes gven by p (y y,y ) (1 α) p (y y ) α 1 2 1 1 And emsson probabltes gven by p (x y,x ) (1 α) p (x y ) α 1 1 19

Modelng mxed-order sources Expermental Setup For each randomly generated model, a sample of 1,000 sequences of length 25 s generated for tranng and testng On each randomly generated tranng set, a CRF s traned usng Algorthm S, whch s descrbed n the paper On the same data an MEMM s traned usng teratve scalng The Vterb algorthm s used to label a test set the advantages of the addtonal representatonal power of CRFs and MEMMs relatve to HMMs. 20

MEMM versus CRF CRF usually outperforms the MEMM 21

MEMM versus HMM The HMM outperforms the MEMM 22

CRF versus HMM Each open square represents a data set wth α < 1/2, and a sold crcle ndcates a data set wth α 1/2; When the data s mostly second order (α 1/2), the dscrmnatvely traned CRF usually outperforms the HMM 23

UPenn Taggng Task 45 tags (syntactc), 1M words tranng DT NN NN NN VBZ RB JJ The asbestos fber ; crocdolte ; s unusually reslent IN PRP VBZ DT NNS IN RB JJ NNS once t enters the lungs ; wth ; even bref exposures TO PRP VBG NNS WDT VBP RP NNS JJ to t causng symptoms that show up decades later ; NNS VBD researchers sad 24

POS taggng experment 1 Compared HMMs, MEMMs, and CRFs on Penn treebank POS taggng Traned frst-order HMM, MEMM, and CRF models as n the synthetc data experments Introduced parameters µ y,x for each tag-word par and λ y y for each tagtag par n the tranng set oov = out-of-vocabulary, and are not observed n the tranng set model error oov error HMM 5.69% 45.99% MEMM 6.37% 54.61% CRF 5.55% 48.05% 25

POS taggng experment 2 Compared HMMs, MEMMs, and CRFs on Penn treebank POS taggng Each word n a gven nput sentence must be labeled wth one of 45 syntactc tags Add a small set of orthographc features: whether a spellng begns wth a number or upper case letter, whether t contans a hyphen, and f t contans one of the followng suffxes: -ng, -ogy, -ed, -s, -ly, -on, - ton, -ty, -es usng spellng features model error oov error error delta oov error delta HMM 5.69% 45.99% MEMM 6.37% 54.61% 4.81% -25% 26.99% -50% CRF 5.55% 48.05% 4.27% -24% 23.76% -50% 26

Presentaton Message Dscrmnatve models are prone to the label bas problem CRFs provde the benefts of dscrmnatve models CRFs solve the label bas problem well, and demonstrate good performance 27