Fall 2010 Graduate Course on Dynamic Learning

Similar documents
Undirected Graphical Models

Conditional Random Field

Probabilistic Models for Sequence Labeling

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields: An Introduction

Graphical models for part of speech tagging

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Statistical Methods for NLP

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

A brief introduction to Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Random Field Models for Applications in Computer Vision

Sequential Supervised Learning

Machine Learning for Structured Prediction

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Intelligent Systems (AI-2)

Intelligent Systems (AI-2)

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 13: Structured Prediction

Probabilistic Graphical Models

Log-Linear Models, MEMMs, and CRFs

Intelligent Systems (AI-2)

Lecture 9: PGM Learning

Probabilistic Graphical Models Lecture Notes Fall 2009

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

CS Lecture 4. Markov Random Fields

CRF for human beings

3 : Representation of Undirected GM

Conditional Independence and Factorization

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Conditional Random Fields for Sequential Supervised Learning

Structure Learning in Sequential Data

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Machine Learning for NLP

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

2.2 Structured Prediction

Natural Language Processing

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Undirected Graphical Models: Markov Random Fields

Lecture 15. Probabilistic Models on Graph

CSC 412 (Lecture 4): Undirected Graphical Models

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Graphical Models Another Approach to Generalize the Viterbi Algorithm

CSE 490 U Natural Language Processing Spring 2016

Lecture 6: Graphical Models

Directed Probabilistic Graphical Models CMSC 678 UMBC

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

arxiv: v1 [stat.ml] 17 Nov 2010

Undirected Graphical Models

Graphical Models for Collaborative Filtering

Bayesian Networks Inference with Probabilistic Graphical Models

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

4 : Exact Inference: Variable Elimination

Discriminative Fields for Modeling Spatial Dependencies in Natural Images

Undirected Graphical Models for Sequence Analysis

Machine Learning for natural language processing

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Learning Parameters of Undirected Models. Sargur Srihari

Markov Random Fields

Hidden Markov Models

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Stephen Scott.

A Graphical Model for Simultaneous Partitioning and Labeling

Computational Genomics

Fast Inference and Learning with Sparse Belief Propagation

Example: multivariate Gaussian Distribution

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Pattern Recognition and Machine Learning

Introduction to Graphical Models

Learning Parameters of Undirected Models. Sargur Srihari

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Statistical Learning

Intelligent Systems:

Machine Learning Lecture 5

Dynamic Approaches: The Hidden Markov Model

Linear & nonlinear classifiers

Today s Lecture: HMMs

Final Examination CS 540-2: Introduction to Artificial Intelligence

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

6.867 Machine learning, lecture 23 (Jaakkola)

p(d θ ) l(θ ) 1.2 x x x

Expectation Maximization (EM)

Introduction to Logistic Regression and Support Vector Machine

Graphical Models and Kernel Methods

CSE 447/547 Natural Language Processing Winter 2018

Logistic Regression & Neural Networks

Hidden Markov Models

Linear & nonlinear classifiers

A gentle introduction to Hidden Markov Models

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Statistical Methods for NLP

Active learning in sequence labeling

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

CSCE 478/878 Lecture 9: Hidden. Markov. Models. Stephen Scott. Introduction. Outline. Markov. Chains. Hidden Markov Models. CSCE 478/878 Lecture 9:

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Transcription:

Fall 2010 Graduate Course on Dynamic Learning Chapter 8: Conditional Random Fields November 1, 2010 Byoung-Tak Zhang School of Computer Science and Engineering & Cognitive Science and Brain Science Programs Seoul National University http://bi.snu.ac.kr/~btzhang/

Overview Motivating Applications Sequence Segmentation and Labeling Generative vs. Discriminative Models HMM, MEMM CRF From MRF to CRF Learning Algorithms HMM vs. CRF 2

Pos Tagging Motivating Application: Sequence Labeling [He/PRP] [reckons/vbz] [the/dt] [current/jj] [account/nn] [deficit/nn] [will/md] [narrow/vb] [to/to] [only/rb] [#/#] [1.8/CD] [billion/cd] [in/in] [September/NNP] [./.] Term Extraction Rockwell International Corp. s Tulsa unit said it signed a tentative agreement extending its contract with Boeing Co. to provide structural parts for Boeing s 747 jetliners. Information Extraction from Company Annual Report 3

Sequence Segmenting and Labeling Goal: mark up sequences with content tags Computational linguistics Text and speech processing Topic segmentation Part-of-speech (POS) tagging Information extraction Syntactic disambiguation Computational biology DNA and protein sequence cealignment Sequence homolog searching in databases Protein secondary structure prediction RNA secondary structure t analysis

Binary Classifier vs. Sequence Case restoration Labeling jack utilize outlook express to retrieve emails E.g. SVMs vs. CRFs - Jack utilize outlook express to retrieve emails. + Jack Utilize Outlook Express To Receive Emails jack utilize outlook express to receive emails JACK UTILIZE OUTLOOK EXPRESS TO RECEIVE EMAILS 5

Sequence Labeling Models: Overview HMM Generative model E.g. Ghahramani (1997), Manning and Schutze (1999) MEMM Conditional model E.g. Berger and Pietra (1996), McCallum and Freitag (2000) CRFs Conditional model without t label l bias problem Linear-Chain CRFs E.g. Lafferty and McCallum (2001), Wallach (2004) Non-Linear Chain CRFs Modeling more complex interaction between labels: DCRFs, 2D-CRFs E.g. Sutton and McCallum (2004), Zhu and Nie (2005) 6

Generative Models: HMM Based on joint probability distribution P(y,x) ) Includes a model of P(x) which is not needed for classification Interdependent features either enhance model structure to represent them ( complexity problems) or make simplifying independence assumptions (e.g. naive Bayes) Hidden Markov models (HMMs) and stochastic ti grammars Assign a joint probability to paired observation and label sequences The parameters typically trained to maximize the joint likelihood lih of train examples

Hidden Markov Model n Psx (, ) = Ps ( 1) Px ( 1 s1) Ps ( i si 1 ) Px ( i si ) i= 2 Cannot represent multiple interacting (overlapping) features or long range dependences between observed elements. 8

Conditional Models Difficulties and disadvantages of generative models Need to enumerate all possible observation sequences Not practical to represent multiple interacting features or long-range dependencies of the observations Very strict independence assumptions on the observations Conditional Models Conditional probability P(y x) rather than joint probability P(y, x) where y = label sequence and x = observation sequence. Based directly on conditional probability P(y x) Need no model for P(x) Specify the probability of possible label sequences given an observation sequence Allow arbitrary, non-independent features on the observation sequence X The probability of a transition between labels may depend on past and future observations Relax strong independence assumptions in generative models

Discriminative Models: MEMM Maximum entropy Markov model (MEMM) Exponential model Given a training set (X, Y) of observation sequences X and label sequences Y: Train a model θ that maximizes P(Y X, θ) For a new data sequence x, the predicted label y maximizes P(y x, θ) Notice the per-state normalization MEMMs have all the advantages of conditional models Per-state normalization: all the mass that arrives at a state must be distributed among the possible successor states ( conservation of score mass ) Subject tto Label Bias Problem Bias toward states with fewer outgoing transitions

Maximum Entropy Markov Model n P ( s x ) = P ( s 1 x1) P ( si si 1, xi ) i= 2 Label bias problem: the probability transitions leaving any given state must sum to one 11

Conditional Markov Models (CMMs) aka MEMMs aka Maxent Taggers vs. HMMs Pso (, ) = Ps ( i si 1 ) Po ( i si 1 ) i O t-1 S t-1 S t S t+1... O t O t+1 P ( s o ) = P ( si si 1, oi 1 ) i O t-1 S t-1 S t S t+1... O t O t+1 12

MEMM to CRFs i i j j j 1 i 1 n 1 n = j j 1, j = j j Zλ ( xj ) Py (... y x... x) Py ( y x) exp( λ f ( x, y, y )) exp( λ if i( xy, )) i =, where F ( xy, ) = f ( x, y, y 1) Z ( x ) j λ j i i j j j j New model exp( λ if i ( x, y )) i Z λ ( x ) 13

HMM, MEMM, and CRF in Comparison HMM MEMM CRF

Conditional Random Field (CRF)

Random Field

Markov Random Field Random Field: Let F = { F1, F2,..., FM } be a family of random variables defined on the set S, in which each random variable Fi takes a value fi in a label set L. The family F is called a random field. Markov Random Field: F is said to be a Markov random field on S with respect to a neighborhood system N if and only if it satisfies the Markov property. undirected graph for joint probability p(x) allows no direct probabilistic bili interpretation t ti define potential functions Ψ on maximal cliques A map pjoint assignment to non-negative real number requires normalisation p(x) ) = 1 Z Ψ A (x A ) Z = Ψ A (x A ) A x A

Conditional Random Field: CRF Conditional probabilistic sequential models p(y x) Undirected graphical models Joint probability of an entire label sequence given a particular observation sequence Weights of different features at different states can be traded off against each other p(y x) = 1 Ψ A (x A,yy A ) Z(x) = Ψ A (x A, y A ) Z A y A 18

Example of CRFs

Definition of CRFs X is a random variable over data sequences to be labeled X is a random variable over data sequences to be labeled Y is a random variable over corresponding label sequences

Conditional Random Fields (CRFs) CRFs have all the advantages of MEMMs without label bias problem MEMM uses per-state exponential model for the conditional probabilities of next states given the current state CRF has a single exponential model for the joint probability of the entire sequence of labels given the observation sequence Undirected acyclic graph Allow some transitions vote more strongly than others gy depending on the corresponding observations

Conditional Random Field Graphical structure of a chain-structured CRFs for sequences. The variables corresponding to unshaded nodes are not generated by the model. Conditional Random Field: a Markov random field (Y) globally conditioned on another random field (X).

Conditional Random Field undirected graphical model globally conditioned on X Given an undirected d graph G = (V, E) such hthat ty Y ={Y v v V }, if p( Y X, Y, u v,{ u, v} V) p( Y X, Y,( u, v) E) v u v u The probability of Y v given X and those random variables corresponding to nodes neighboring v in G. Then (X, Y) is a conditional random field. 23

Conditional Distribution If the graph G = (V, E) of Y is a tree, the conditional distribution over the label sequence Y=ygivenX=x y, = x, by fundamental theorem of random fields is: p θ (y x) exp λk fk( e, y e, x) + μkgk( v, y v, x) e E,k v V,k x is a data sequence y is a label sequence v is a vertex from vertex set V = set of label random variables e is an edge from edge set E over V f k and g k are given and fixed. g k is a Boolean vertex feature; f k is a Boolean edge feature k is the number of features θ = ( λ1, λ2,, λn; μ1, μ2,, μn); λk andμk are parameters to be estimated y e is the set of components of y defined by edge e y v is the set of components of y defined by vertex v

Conditional Distribution (cont d) CRFs use the observation-dependent normalization Z(x) for the conditional distributions: 1 pθ (y x) = exp λ f ( e,y,x) + μ g ( v,y Z(x) e E,k v V,k k k e k k v,x) Z(x) is a normalization over the data sequence x

Conditional Random Fields /k/ /k/ /iy/ /iy/ /iy/ X X X X X CRFs are based on the idea of Markov Random Fields Transition functions add associations between transitions from Modelled as one an label undirected to another graph State functions connecting help determine labels with the observations identity of the state Observations in a CRF are not modelled as random variables

Conditional Random Fields P ( y x ) = exp ( λ f ( x, y ) + μ g ( x, y, y i i t j j t t t i j Z( x) 1 )) Hammersley-Clifford Theorem states that a State Feature Weight Transition Feature Transition Feature Function State Weight Feature Function random λ=10 field is an MRF iff it can be μ=4 g(x, described /iy/,/k/) in f([x is stop], /t/) One the possible above weight value form for this state feature One possible One possible weight state value One possible transition feature feature function The function (Strong) exponential for this transition is For the our sum attributes feature of the and clique labels potentials of Indicates /k/ followed by /iy/ the undirected graph

Conditional Random Fields Each attribute of the data we are trying to model fits into a feature function that associates the attribute and a possible label A positive value if the attribute appears in the data A zero value if the attribute is not in the data Each feature function carries a weight that gives the strength of that feature function for the proposed label High positive weights indicate a good association between the feature and the proposed label High negative weights indicate a negative association between the feature and the proposed label Weights close to zero indicate the feature has little or no impact on the identity of the label

Formally. Definition CRF is a Markov random field. By the Hammersley-Clifford theorem, the probability of a label can be expressed as a Gibbs distribution, so that 1 py ( x, λμ, ) = exp( λ jf j( y, x)) Z n j y = f j y c i= 1 F ( y, x ) f ( y, xi, ) What is clique? By only taking consideration of the one-node and two-node cliques, we have j clique 1 py ( x, λμ, ) = exp( λjtj( y e, xi,) + μksk( y s, xi,)) Z j k 29

Definition (cont.) Moreover, let us consider the problem in a first-order chain model, we have 1 p( y x, λ, μ) = exp( λjtj( yi 1, yi, x,) i + μksk( yi, x,)) i Z j k For simplifying description, let f j (y, x) denote t j (y i-1, y i, x, i) and s k (y i, x, i) 1 py ( x, λμ, ) = exp( λjfj( yx, )) Z n F ( y, x) = f ( y, x, i) j j c i= 1 j 30

Labeling In labeling, the task is to find the label sequence that has the largest probability Then the key is to estimate the parameter lambda yˆ = arg max p ( y x) = arg max( λ F( y, x)) y λ 1 py ( x, λμ, ) = exp( λjfj( yx, )) Z j y 31

Optimization Defining a loss function that should be convex for avoiding local optimization Defining constraints Finding a optimization method to solve the loss function A formal expression for optimization problem min θ f( x) s.. t gi ( x) 0,0 i k h ( x ) = 0,0 j l j 32

Loss Function Empirical loss vs. structural loss minimize L = y f( x, λ) k minimize L = λ + y f( x, λ) k Loss function: Log-likelihood 1 py ( x, λμ, ) = exp( λjfj( yx, )) Z L Z F y x ( k) ( k) ( λ) = log + λj j(, ) k j ( k) ( k) ( k) λ Lλ = λ F( y, x ) log Zλ( x ) + const 2 k 2σ j 2 33

Parameter Estimation Log-likelihood ( k) ' δ L λ ( k ) ( k ) ( Zλ ( x )) = Fj ( y, x ) ( k) δλ j k ( ) ( k) ( k) Zλ x ( λ) = log + λj j(, ) ( k) ( k) = λ L Z F y x k j Differentiating the log-likelihood with respect to parameter λ j δ Z ( x ) exp F( y, x ) λ y ( ( k ) ( k exp( λ F( y, x )) F (, ) j y x )) ( k) ' ( Zλ ( x )) y = ( k ) ( k ) Z ( x ) exp λ F( y, x ) ( k) δ L ( k ) EpYX (, )[ F j( Y, X)] E ( k ) [ F (, )] pyx (, ) j Y x λ δλ = exp( λ F( y, x ) ( ) ( ) ) (, k = F k j y x ) j k y exp λ F( y, x ) y By adding the model penalty, it can be ( k ) ( k ) = ( p( y x ) Fj ( y, x )) rewritten as y δ L ( k) ( k ) λ = E ( k ) E F (, ) pyx ( ) j Y x [ F ( Y, X)] E ( k ) [ F (, )] Y x = pyx (, ) j pyx (, λ ) j 2 δλ σ j k λ y 34

Optimization ( k) ( k) L( λ) = log Z + λjfj( y, x ) k j δ L py (, X) j ( k ) pyx (, λ ) j δλ = E E j ( k ) [ (, )] [ (, )] F Y X F Y x E p(y,x) F j j(y, (y,x) can be calculated easily E p(y x) F j (y,x) can be calculated by making use of a forward-backward algorithm Z can be estimated in the forward-backward algorithm k 35

Calculating the Expectation First we define the transition matrix of y for position x as M [ y, y ] = exp λ f( y, y, x, i) i i 1 i i 1 i E ( k ) F (, ) ( ) (, ) p ( Y x ) j Y x = p y x Fj yx λ n = i= 1 y, y α = i i 1 i ( k ) ( k ) λ y ( k) ( k) i 1 i j i 1 i py (, y x ) f ( y, y, x ) T i 1( fi Mi Vi) βi Z λ ( x ) n+ 1 Zλ ( x) = Mi( x) = αn 1 i=11 T i α im i 0 < i n αi = 1 i = 0 T M 1 β 1 1 i < n = 1 i = n T i+ i+ β i py ( x ) = T ( k) αi 1βi Z λ ( x) All state features at position i 36

First-order Numerical Optimization Using Iterative Scaling g( (GIS, IIS) Initialize each λ j (= 0 for example) Until convergence δ L - Solve 0 for each parameter λ j = p δλ j j - Update each parameter using λ j λ j + λ j 37

Second-order Numerical Optimization Using newton optimization technique for the parameter estimation 2 ( 1) ( ) 1 ( L k k ) L + λ = λ + 2 λ λ Drawbacks: parameter value initialization And compute the second order (i.e. Hesse matrix), that is difficult Solutions: - Conjugate-gradient (CG) (Shewchuk, 1994) - Limited-memory quasi-newton (L-BFGS) (Nocedal and Wright, 1999) - Voted Perceptron (Colloins 2002) 38

Summary of CRFs Model Lafferty, 2001 Applications Efficient training (Wallach, 2003) Training via. Gradient Tree Boosting (Dietterich, 2004) Bayesian Conditional Random Fields (Qi, 2005) Name entity (McCallum, 2003) Shallow parsing (Sha, 2003) Table extraction (Pinto, 2003) Signature extraction (Kristjansson, 2004) Accurate Information Extraction from Research Papers (Peng, 2004) Object Recognition (Quattoni, 2004) Identify Biomedical Named Entities (Tsai, 2005) Limitationit ti Huge computational cost in parameter estimation 39

HMM vs. CRF HMM arg max P( φ S) φ = arg max P ( φ ) P ( S φ ) φ ( P y y P s y ) = arg max log trans ( i i 1) emit ( i i ) φ y i φ CRF arg max P ( φ S ) φ 1 λ f ( c, S) = arg max e φ Z = arg max λi fi( cs, ) φ ci, 1. Both optimizations are over sums this allows us to use any of the dynamic programming HMM/GHMM decoding algorithms for fast, memory-efficient parsing, with the CRF scoring scheme used in place of the HMM/GHMM scoring scheme. 2. The CRF functions f i (cs) (c,s) may in fact be implemented using any type of sensor, including such probabilistic sensors as Markov chains, interpolated Markov models (IMM s), decision trees, phylogenetic models, etc..., as well as any non-probabilistic sensor, such as n-mer counts or binary indicators.

Appendix

Markov Random Fields A(di (discrete-valued) d)markov random field ld(mrf) is a 4-tuple 4t M=(α, ( X, P M, G) where: α is a finite alphabet, X is a set of (observable or unobservable) variables taking values from α, P M is a probability distribution on variables in X, G=(X, E) is an undirected graph on X describing a set of dependence relations among variables, such hthat tp P M (X i {X k i }) = P M (X i N G (X i )), for N G (X i )th the neighbors of fxx i under G. That is, the conditional probabilities as given by P M must obey the dependence relations (a generalized Markov assumption ) given by the undirected graph G. A problem arises when actually inducing such a model in practice namely, that we can t just set the conditional probabilities P M (X i N G (X i )) arbitrarily and expect the joint probability P M (X) to be well-defined (Besag, 1974). h h bl f l ll f h hb h d Thus, the problem of estimating parameters locally for each neighborhood is confounded by constraints at the global level...

The Hammersley-Clifford Theorem Suppose P(x)>0 for all (joint) value assignments x to the variables in X. Then by the Hammersley-Clifford theorem, the likelihood of x under model M is given by: for normalization term Z: P 1 M (x) = e Z Z = e Q( where Q(x) ) has a unique expansion given by: x Q( x) What Q(x 0, x 1,..., x n 1 ) = x i Φ i (x i ) + x i x j Φ i,j (x i, x j )+... 0 i<n 0 i< j<n x )...+ x 0 x 1...x n 1 Φ 0,1,...,n 1 (x 0, x 1,..., x n 1 ) What is a clique? A clique is any subgraph in which all vertices are neighbors. and where any Φ i term not corresponding to a clique must be zero. (Besag, 1974) The reason this is useful is that it provides a way to evaluate probabilities (whether joint or conditional) based on the local functions Φ. Thus, we can train an MRF by learning individual Φ functions one for each clique.

Conditional Random Fields A Conditional random field (CRF) isamarkov random field of unobservables which are globally conditioned on a set of observables (Lafferty et al., 2001): Formally, a CRF is a 6-tuple M=(L,α,Y,X,Ω,G) where: L is a finite output alphabet of labels; e.g., {exon, intron}, α is a finite input alphabet e.g., {A, C, G, T}, Y is a set of unobserved variables taking values from L, X is a set of (fixed) observed variables taking values from α, Ω = {Φ c : L Y α X } is a set of potential functions, Φ c (y,x), G=(V, E) is an undirected graph describing a set of dependence relations E among variables V = X Y, where E (X X)=, such that (α, Y, e ΣΦ(c,x) /Z, G-X) is a Markov random field. Note that: 1. The observables X are not included in the MRF part of the CRF, which is only over the subgraph G-X. However, the X are deemed constants, and are globally visible to the Φ functions. 2. We have not specified a probability function P M, but have instead given local clique-specific functions Φ c which together define a coherent probability distribution via Hammersley-Clifford.

CRF s versus MRF s A conditional random field is effectively an MRF plus a set of external variables X, where the internal variables Y of the MRF are the unobservables ( ) and the external variables X are the observables ( ): Y the MRF the CRF X fixed, observable, variables X (not in the MRF) Thus, we could denote a CRF informally as: C=(M, X) for MRF M and external variables X, with the understanding that the graph G X Y of the CRF is simply the graph G Y of the underlying MRF M plus the vertices X and any edges connecting these to the elements of G Y. Note that in a CRF we do not explicitly model any direct relationships between the observables (i.e., among the X) (Lafferty et al., 2001).