Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Partially Directed Graphs and Conditional Random Fields Sargur srihari@cedar.buffalo.edu 1

Topics Conditional Random Fields Gibbs distribution and CRF Directed and Undirected Independencies View as combination of BN and MN CRF for Image Segmentation CRF for Text Analytics Naiive Bayes and Naiive Markov Learning the models 2

Conditional Distribution Representation Nodes correspond to Y U X Y are target variables and X are observed variables Parameterized as ordinary Markov Network Set of factors Φ 1 (D 1,..Φ m (D m Can be encoded as a log-linear model Viewed as encoding a set of factors Model represents P(Y X rather than P(Y,X To naturally represent a conditional distribution Avoid representing a probabilistic model over X Disallow potentials involving only variables in X 3

Conditional Random Fields MN encodes a joint distribution over X An MN can also be used to represent a conditional distribution P(Y X Y is a set of target variables X is a set of observed variables Representation is called a CRF Has an analog in directed graphical models Conditional Bayesian Networks 4

CRF Definition An undirected graph H with nodes X U Y Network is annotated with a set of factors φ 1 (D 1,..φ m (D m such that D i X Network encodes a conditional distribution as P(Y X = 1 Z(X m P(Y, X = φ i (D i i=1 Z(X = P(Y, X Y P(Y, X Where Z(X is the marginal distribution of X and!p(y, X is the joint distribution Joint distribution (unnormalized is a product of factors Partition function Is now a function of X Two variables in H are connected by an edge whenever they appear in the scope of a factor

Deriving the CRF definition (1 Conditional distribution from Baye s rule: P(Y X = P(Y, X P(X (2 From definition of Gibbs distribution P Φ (X 1,..X n = 1 Z P(Y, X = Numerator of (1 is: 1 Z(Y, X m!p(y, X where P(Y!, X = φi (D i and Z(Y, X = P(Y!, X i=1 (3 Denominator of (1 from sum rule applied to numerator: 1 P(X = P(Y, X = Z(Y, X Y P(X 1,..X n where P(X 1,..X n = φ i (D i Z = P(X 1,..X n Y P(Y, X m i=1 X 1,..X n Y,X (4 Combining (1-(3 gives CRF: P(Y / X = Y 1!P(Y, X where Z(X =!P(Y, X Y!P(Y, X = 1 Z(X!P(Y, X 6

Difference between CRF & Gibbs Different normalization in partition function Z(X A Gibbs distribution factorizes into a set of factors and partition function Z CRF P Φ (X 1,..X n = 1 Z P(Y X = 1 Z(X Induces a different value of Z for every assignment x to X Summation only over Y P(X 1,..X n where P(X 1,..X n = φ i (D i Z = P(X 1,..X n P(Y, X where P(Y, X = φ i (D i Z(X = P(Y, X Difference denoted by feature variables greyed-out Known X (shown dark grey Y has a distribution dependent on X m i=1 m i=1 X 1 Y X 2 X 1,..X n X 3 X 4 X 5 Y 1 Y 2 Y 3 Y 4 Y 5 7

Example of CRF CRF over Y={Y 1,..Y k } and X={X 1,..X k } Edges are Y i Y i+1 and Y i X i 1 P(Y X = P(Y, X Z(X P(Y, X = φ i (Y i,y i+1 φ i (Y i, X i k 1 i=1 Z(X = P(Y, X Y i=1 Linear chain-structured CRF k X 1 Observed Feature Variables: Assumed known when model is used (hence greyed-out X 2 X 3 X 4 X 5 For sequence labeling Y 1 Y 2 Y 3 Y 4 Y 5 8

Main Strength of CRF Avoid encoding over the variables in X Allows incorporating into model A rich set of observed variables Whose dependencies are complex or poorly understood Allows including continuous variables Distributions may not have simple parametric forms Can incorporate domain knowledge Rich features without modeling joint distribution 9

CRF Image segmentation cow Original image grass (a (b (c (d Each superpixel is a random variable Classification using node potentials alone Segmentation using pairwise Markov Network encoding Each image defines a probability distribution over the variables representing super-pixel labels Rather than define joint distribution over pixel values we define a conditional distribution over segment labels given the pixel values Avoids making a parametric assumption over (continuous pixel values Can define image processing routines to define rich features, e.g., presence or direction of an image gradient at pixel» such features usually rely on multiple pixels» So defining correct joint distribution or independence properties over the features is non-trivial 10

Directed and Undirected Dependencies A CRF defines a conditional distribution of Y on X Thus it can be viewed as a partially directed graph Where we have an undirected component over Y Which has variables in X as parents 11

CRFs for Text Analysis Important use for CRF framework Part-of-speech labeling Named Entity Tagging People, places, organizations, etc Extracting structured information from text From a reference list Publications, titles, authors, journals, tyear Models share a similar structure 12

Named Entity (NE Tagging Entities often span multiple words Type of entity may not be apparent from individual words New York is location, New York Times is organization For each word X i introduce target variable Y i which is its entity type Outcomes for Y i are (in BIO notation B-PERSON, I-PERSON, B-LOCATION, I-LOCATION, B-ORGANIZATION, I- ORGANIZATION, OTHER B: beginning, I: inside entity B allows segmenting adjacent entities of same type 13

CRF for NE Tagging B-PER I-PER OTH OTH OTH B-LOC I-LOC B-PER OTH OTH OTH OTH Y Mrs. Green spoke today in New York Green chairs the finance committe X Set of known variables (are words: X Two factors for each word φ t 1 (Y t,y t +1 φ t 2 (Y t, X 1,..X T Factor to represent dependency between neighboring target variables φ 1 t (Y t,y t +1 Factor to represent dependency between target Y t and its context in word sequence Can depend on arbitrary features of entire input word sequence X 1,..X T (Three here φ t 2 (Y t, X 1,..X T 14

Linear Chain CRF for NE Factor to represent dependency between target Y t and its context in word sequence φ 2 t (Y t, X 1,..X T Can depend on arbitrary features of entire input word sequence X 1,..X T Not encoded using table factors but use log-linear models Factors derived from feature functions such as f t (Y t,x t =I{Y t =B-ORGANIZATION,X t = Times }

Features for NE Tagging For word X i Capitalized, In list of common person names, In atlas of location names, End with ton, Exactly York, Following Times For word sequence More than two-sports related terms, New York is a sports organization Hundreds or thousands of features Sparse (zero for most words Same feature variable can be connected to multiple target variables Y i dependent on identity of several words in window 16

Performance of CRF Linear Chain CRFs provide high per-token accuracies High 90% range on many natural data sets High per field Precision and Recall Where entire phrase categories and boundaries must be correct 80-95% depending on data set 17

Including additional information in NE Linear chain graphical model is augmented When word occurs multiple times in a document it has the same label Include factors that connect identical words Results in skip-chain CRF shown next 18

Skip Chain CRF for NE Recognition B-PER I-PER OTH OTH OTH B-LOC I-LOC B-PER OTH OTH OTH OTH Mrs. Green spoke today in New York Green chairs the finance committe First occurrence of Green has neighboring words that provide strong evidence that it is a Person. Second occurrence is more ambiguous. Augmenting with a long range factor allows to predict correctly. Graphical structure over Y can easily depend on the Xs 19

Joint inference: Part-of-Speech Labeling/Noun-phrase Segmentation Pair of coupled linear chain CRFs Noun phrase is composed of several words that depends on the POS and word B I O O O B I O B I I NP ADJ N V IN V PRP N IN DT N N POS British Airways rose after announcing its withdrawal from the UAL deal KEY B I O N ADJ Begin noun phrase Within noun phrase Not a noun phrase Noun Adjective V IN PRP DT Verb Preposition Possesive pronoun Determiner (e.g., a, an, the (b 20

Partially Directed Models Probabilistic Graphical models are useful for complex systems: Directed (Bayesian networks Undirected (Markov networks Can unify both representations Incorporate both directed/undirected dependencies CRFs can be viewed as partially directed graphs CRFs can be generalized to chain graphs Which have subgraphs with chains Network in which undirected components depend upon each other in a directed fashion 21

Directed and Undirected Dependencies A CRF defines a conditional distribution of Y on X Thus can be viewed as a partially directed graph 22

CRF as Partially Directed Graph CRF defines a conditional distribution of Y on X Linear chain CRF X 1 X 2 X 3 X 4 X 5 Y 1 Can be viewed as one with undirected component over Y which has X as parents Y 2 Y 3 Y 4 Y 5 Equivalent Partially directed variant (CRF X 1 Y 1 X 2 Y 2 X 3 Y 3 X 4 Y 4 X 5 Y 5 Factors are defined over Ys only, each of which has an X value Equivalent Models 23

CRF vs HMM Logistic CPDs (logistic regression is conditional analog of Naïve Bayes CRF is conditional analog of HMM 24

Models for Sequence Labeling HMM k P(X,Y = P(X i Y i P(Y i+1 Y i i=1 Conditioning on Unknown X 1 Y 1 X 2 Y 2 X 3 Y 3 X 4 Y 4 X 5 Y 5 Generative Since Y i s are unknown joint distribution has to be estimated from data Determining P(Y X depends on first determining P(X,Y Discriminative CRF Conditioning on Known P(Y X = 1 Z(X P(Y X is obtained directly P(Y, X P(Y, X = φ i (Y i,y i+1 φ i (Y i, X i k 1 i=1 Z(X = P(Y, X Y MEMM k i=1 X 1 Y 1 X 1 X 2 Y 2 X 2 X 3 Y 3 X 3 X 4 Y 4 X 4 X 5 Y 5 X 5 k P(X,Y = P(Y i X i P(Y i+1 Y i i=1 Y 1 Y 2 Y 3 Y 4 Y 5 25

CRF (Partially Directed and MEMM Linear chain structured CRF Y = {Y 1,..Y k }, X = {X 1,..X k } P(Y X = 1 Z(X P(Y, X P(Y, X = φ i (Y i,y i+1 φ i (Y i, X i k 1 i=1 Z(X = P(Y, X Y k i=1 Equivalent Partially directed variant (CRF X 1 Y 1 X 2 Y 2 X 1 Y 1 X 2 Y 2 X 3 Y 3 X 3 Y 3 X 4 Y 4 X 4 Y 4 X 5 Y 5 X 5 Y 5 Fully-directed version (a Bayesian network X 1 X 2 X 3 X 4 X 5 Called Max Entropy Markov Model (MEMM is also a conditional model but is Non-equivalent Y 1 Y 2 Y 3 Y 4 Y 5 k P(Y X = P(Y i X i P(Y i Y i 1 i=1 Y 1 is independent of X 2 if Y 2 is unknown. If Y 2 known we have dependency due to v-structure Sound Conditional BN requires edges from all variables in X to each Y i In CRF probability of Y depends on: values of all variables X={X 1,..X k } MEMM is more efficient, fewer parameters to be learned. 26

Models for Sequence Labeling Sequence of observations X={X 1,..X k }. Need a joint label Y={Y 1,..Y k }. Both CRF and MEMM are Discriminative Models That directly obtain conditional probability P(Y X HMM is a generative model That needs joint probability P(X,Y CRF X 1 Y 1 P(Y X = X 2 Y 2 X 3 Y 3 1 P(Y, X Z(X k 1 X 4 Y 4 P(Y, X = φ i (Y i,y i+1 φ i (Y i, X i i=1 Z(X = P(Y, X Y k i=1 X 5 Y 5 Y 1 is independent of X 2 if we are not given Y 2 More generally, Y i X j - X -j Later observation has no effect on posterior probability of current state. In activity recognition in video sequence, frames are labelled as running/walking. Earlier frames may be blurry but later ones clearer. MEMM X 1 X 2 X 3 X 4 X 5 Y 1 Y 2 Y 3 Y 4 Y 5 k P(Y X = P(Y i X i P(Y i Y i 1 i=1 Models have Trade-offs in expressive power and learnability MEMM and HMM are more easily learned As purely directed models their parameters can be computed in closed-form using maximum likelihood CRF requires iterative gradient-based approach which is more expensive HMM Needs joint distribution X 1 Y 1 X 2 Y 2 X 3 Y 3 k X 4 Y 4 X 5 Y 5 P(X,Y = P(X i / Y i P(Y i Y i 1 i=1 P(Y / X = P(X,Y P(X

CRF Example: Naïve Markov model Binary-valued variables X={X 1,..X k } and Y={Y} Variables independent of each other and only dependent on class Y Pairwise potential between Y and each X i Φ i (X i,y=exp{w i I{X i =1,Y=1}} Single node potential Φ 0 (Y=exp{w 0 I{Y=1}} From CRF definition k w i x i P(Y = 1 x 1,..x k = exp w 0 + P(Y = 0 x 1,..x k = exp 0 which is equivalent to { } = 1 i=1 P(Y = 1 x 1,..x k = sigmoid w 0 + k i=1 w i x i I is indicator function which takes value 1 when its argument is true and else 0 X 1 X 2 X k Y Logistic CPD (regression: not defined by a table but induced by parameters. Efficient: Linear (not exponential as in full BN in the no of parents where sigmoid(z = e z 1 + e z

Naïve Markov and Naïve Bayes Binary variables X={X 1,..X k } and Y={Y} 1. Logistic regression is conditional analog of Naïve Bayes Classifier k P(Y = 1 x 1,..x k = sigmoid w 0 + w i x i i=1 Discriminative Model (k parameters X 1 X 2 X k 1. Naïve Bayes P(Y = 1 X 1,..X k = P(Y = 1 P(Y, X 1,..X k P(Y, X 1,..X k = P(Y P(X i Y k i=1 Y X 1 X 2 X k Generative Model (k parameters: Y We have to first obtain k CPDs conditioned on unknown from which we can get the distribution conditioned on known 29

Logistic Regression Revisited Input X, target classes Y=0 and Y=1 A posteriori probability of Y=1 is P(Y=1 X =y(x = σ (w T X where X is a M-dimensional feature vector σ (. is the logistic sigmoid function Goal: determine the M parameters Known as logistic regression in statistics Although a model for classification rather than for regression Machine Learning Logistic Sigmoid σ(a a Properties: A. Symmetry σ(-a=1-σ(a B. Inverse a=ln(σ /1-σ known as logit. Also known as log odds since it is the ratio ln[p(y=1 x/p(y=0 x] C. Derivative dσ/da=σ(1-σ

Determining Logistic Regression parameters Maximum Likelihood Approach for Two classes Data set consists of (input,target pairs: (X n, t n where t n {0,1}, n =1,..,N Since t is binary we can use the Bernoulli distribution for it p(t w = y t (1 y 1 t, where y = σ (w t X Likelihood function associated with N observations p(t w = N t y n n n =1 { 1 y n } 1 t n where t =(t 1,..,t N T and y n = p(y=1 X n Machine Learning 31

Error Fn for Logistic Regression Likelihood function p(t w = N n =1 y n t n { } 1 t n 1 y n Error function is the negative of the log-likelihood N n =1 { } E(w = ln p(t w = t n ln y n + (1 t n ln(1 y n Known as Cross-entropy error function Machine Learning 32

Gradient of Error Function Error function E(w = ln p(t w = t n ln y n + (1 t n ln(1 y n where y n = σ(w T X n Using Derivative of logistic sigmoid n =1 Gradient of the error function E(w = N n=1 ( y n t n X n N Error x Feature Vector Contribution to gradient by data point n is error between target t n and prediction y n = σ (w T φ n times input X n Machine Learning { } Analytical Derivative: Let z = z 1 + z 2 where z 1 = t lnσ (w t X and z 2 = (1 tln[1 σ ( t X] dz 1 dw = tσ (wt X[1 σ (w t X]X σ (w t X and dσ da = σ(1 σ dz2 dw = (1 tσ (wt X[1 σ (w t X]( X [1 σ (w t X] 33 d dx (lnax = a x Therefore dz dw = (σ (wt X tx = (y tx

Simple Sequential Algorithm No closed-form maximum likelihood solution for determining w Given Gradient of error function E(w = N n=1 ( y n t n X n Solve using an iterative approach w τ +1 = w τ η E n where E n = (y n t n X n Error x Feature Vector Machine Learning Solution has severe over-fitting problems for linearly separable data So use IRLS algorithm 34

Multi-class Logistic Regression Work with soft-max function instead of logistic sigmoid p(y = k X = y k (X = exp(a k exp(a j j where a k =w kt X Machine Learning 35

Muti-class Likelihood Function 1-of K Coding scheme For feature vector X n, target vector t n belonging to class Y=k is a binary vector with all elements zero except for element k p(t w 1,.., w K = where y nk =y k (X n N K k=1 p(c k X n t n,k t = y nk nk n=1 N n=1 K k=1 T is a N x K matrix of elements with elements t nk Machine Learning 36

Multi-class Error Function 1. Error Function: negative loglikelihood E(w 1,...,w K = ln p(t w 1,..,w K = n =1 k=1 t nk ln y nk Known as cross-entropy error function for multi-class 1. Gradient of error function wrt one parameter vector w j w j E(w 1,..., w K = Machine Learning N n=1 (y nj t nj X n N K Error x Feature Vector Derivatives of Soft-max y k (X = exp(a k exp(a j j a k =w kt X y k a j = y k (I kj y j where I kj are elements of the identity matrix

IRLS Algorithm for Multi-class 3. Hessian matrix comprises blocks of size M x M Block j,k is given by wk w j E(w 1,..., w K = N n=1 y nk (I kj y nj X n Hessian matrix is positive-definite, therefore error function has a unique minimum 4. Batch Algorithm based on Newton-Raphson X n T Machine Learning 38

Iterative Reweighted Least Squares (IRLS Efficient approximation using Newton-Raphson iterative optimization w (new = w (old H 1 E(w where H is the Hessian matrix whose elements are the second derivatives of E(w with respect to the components of w Newton s Method Machine Learning Since we are solving for derivative of E(w Need second derivative 39

IRLS Steps IRLS is applicable to both Linear Regression and Logistic Regression We discuss Logistic Regression, for which we need 1. Error function E(w Logistic Regression: Bernoulli Likelihood Function 2. Gradient E(w 3. Hessian H = E(w 4. Newton-Raphson update w (new = w (old H 1 E(w Machine Learning 40

IRLS for Logistic Regression Posterior probability of class Y=1 is p(y=1 X =y(x = σ (w T X Likelihood Function for data set {X n,t n }, t n {0,1} p(t w = 1. Error Function Log-likelihood yields Cross-entropy N { } E(w = t n ln y n + (1 t n ln(1 y n n =1 N n =1 y n t n { 1 y n } 1 t n Machine Learning 41

IRLS for Logistic Regression 2. Gradient of Error Function: 3. Hessian: N E(w = (y n t n X n = X T (y t n=1 H = E(w = N n=1 y n (1 y n X n X n T = X T RX R is NxN diagonal matrix with elements R nn =y n (1-y n =w T X (1-w T X n Hessian is not constant and depends on w through R Since H is positive-definite (i.e., for arbitrary u, u T Hu>0 error function is a concave function of w and so has a unique minimum Machine Learning 42

IRLS for Logistic Regression 4. Newton-Raphson update: w (new = w (old H 1 E(w Substituting H = X T RX and E(w = X T (y t w (new = w (old (X T RX -1 X T (y-t = (X T RX -1 {XXw (old -X T (y-t} = (X T RX -1 X T Rz where z is a N-dimensional vector with elements z =Xw (old -R -1 (y-t Update formula is a set of normal equations Since Hessian depends on w Apply them iteratively each time using the new weight vector Machine Learning 43