Conditional Random Field

Similar documents
Probabilistic Models for Sequence Labeling

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Undirected Graphical Models

Intelligent Systems (AI-2)

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Intelligent Systems (AI-2)

Log-Linear Models, MEMMs, and CRFs

Sequential Supervised Learning

CRF for human beings

A brief introduction to Conditional Random Fields

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Graphical models for part of speech tagging

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Conditional Random Fields: An Introduction

Random Field Models for Applications in Computer Vision

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Dynamic Approaches: The Hidden Markov Model

Structure Learning in Sequential Data

Probabilistic Graphical Models

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Lecture 13: Structured Prediction

Brief Introduction of Machine Learning Techniques for Content Analysis

Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

p(d θ ) l(θ ) 1.2 x x x

Intelligent Systems (AI-2)

Basic math for biology

arxiv: v1 [stat.ml] 17 Nov 2010

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

Probabilistic Graphical Models

Hidden Markov Models

Support Vector Machines

Graphical Models for Collaborative Filtering

Learning Parameters of Undirected Models. Sargur Srihari

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

Hidden Markov Models Hamid R. Rabiee

with Local Dependencies

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Sparse Forward-Backward for Fast Training of Conditional Random Fields

Naïve Bayes classification

Statistical Methods for NLP

Fast Inference and Learning with Sparse Belief Propagation

Note Set 5: Hidden Markov Models

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Chris Bishop s PRML Ch. 8: Graphical Models

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Information Extraction from Text

Conditional Random Fields for Sequential Supervised Learning

Lecture 15. Probabilistic Models on Graph

Probabilistic Graphical Models & Applications

Fall 2010 Graduate Course on Dynamic Learning

Design and Implementation of Speech Recognition Systems

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

Expectation Maximization (EM)

Graphical models. Sunita Sarawagi IIT Bombay

9 Forward-backward algorithm, sum-product on factor graphs

Learning Parameters of Undirected Models. Sargur Srihari

3 : Representation of Undirected GM

Probabilistic Graphical Models

Lecture 9: PGM Learning

CS Lecture 4. Markov Random Fields

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Markov Networks.

STA 4273H: Statistical Machine Learning

Hidden Markov Models

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Notes on Markov Networks

2 : Directed GMs: Bayesian Networks

Statistical Machine Learning from Data

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

STA 414/2104: Machine Learning

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Machine Learning for Structured Prediction

Introduction to the Tensor Train Decomposition and Its Applications in Machine Learning

Directed Probabilistic Graphical Models CMSC 678 UMBC

The Origin of Deep Learning. Lili Mou Jan, 2015

Introduction to Machine Learning

CSC 412 (Lecture 4): Undirected Graphical Models

COMP90051 Statistical Machine Learning

Statistical Methods for NLP

Hidden Markov Models

Parameter learning in CRF s

Chapter 16. Structured Probabilistic Models for Deep Learning

A Graphical Model for Simultaneous Partitioning and Labeling

lecture 6: modeling sequences (final part)

Hidden Markov Models in Language Processing

Probabilistic Graphical Models

Introduction to Machine Learning

Graphical Models Another Approach to Generalize the Viterbi Algorithm

Undirected Graphical Models: Markov Random Fields

4 : Exact Inference: Variable Elimination

Intelligent Systems:

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Linear Dynamical Systems

Hidden Markov Models

Machine Learning. Gaussian Mixture Models. Zhiyao Duan & Bryan Pardo, Machine Learning: EECS 349 Fall

Probabilistic Graphical Models Homework 2: Due February 24, 2014 at 4 pm

Page 1. References. Hidden Markov models and multiple sequence alignment. Markov chains. Probability review. Example. Markovian sequence

Transcription:

Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011

Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions

Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions

Introduction Linear-Chain General Specific Implementations Conclusions Introduction Graphical probabilistic model Discriminative model State of the art in task as sequence labeling

Introduction Linear-Chain General Specific Implementations Conclusions Graphical Models A graphical model is a family of probability distributions that factorize according to an underlying graph. They provide a simple way to visualize the structure of a probabilistic model. Inspecting such graph we can insight into the property of the model. Three equivalent types of model. Directed graphical model (Bayesian networks) Undirected graphical model (Markov random chain) Factor graph (generalization of first two)

Introduction Linear-Chain General Specific Implementations Conclusions Graphical Models cont d y 1 y 2 y 3 x 1 x 2 x 3 y t-1 y t y t+1 y t+10 y t+11 y t+12 x x t-1 t x t+1 x t+10 x t+11 x t+12

Introduction Linear-Chain General Specific Implementations Conclusions Generative Models Ng and Jordan, 2002 Generative classifiers learn a model of the joint probability, p(x, y), of the inputs x and the label y, and make their prediction by using the Bayes rule to calculate p(y x), and then picking the most likely label y. It s very hard to model the dependences between different features over the observed variables. In Naive Bayes classifiers we assume the conditional independence between features.

Introduction Linear-Chain General Specific Implementations Conclusions Discriminative Models Discriminative classifiers model the posterior p(y x) directly. Do not need to model the distribution (of features) over observed variables.

Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions

Introduction Linear-Chain General Specific Implementations Conclusions Linear-Chain CRF ( P (y x : θ) = 1 T Z(x) exp t=1 k=1 T is the number of tokens in the sequence. K is the number of features we use in our model. Lafferty et al., 2001 ) K θ k f k (y t 1, y t ; x t ) θ is a parameters vector we have to estimate in order to obtain a model that fits the data. Z(x) is an instance-specific normalization function. Z(x) = y exp ( T t=1 k=1 ) K θ k f k (y t 1, y t ; x t )

Introduction Linear-Chain General Specific Implementations Conclusions Feature Functions ( P (y x : θ) = 1 T Z(x) exp t=1 k=1 f k is one of the k feature functions ) K θ k f k (y t 1, y t ; x t ) f ij (y, y, x t ) = 1 {y=i} 1 {y =j} for each transition from the state i to the state j. f io (y, y, x t ) = 1 {y=i} 1 {x=o} for each state-observation pair i, o. 1 {y=i} is a function which returns 1 if y = i, 0 elsewere. x t is the feature vector at time t, it contains all the components that are needed for computing features at time t. If we want to use x t+1 (the next word) as a feature, then the feature vector x t is assumed to include x t+1.

Introduction Linear-Chain General Specific Implementations Conclusions Feature Functions cont y 1 y 2 y 3 x 1 x 2 x 3 The graph above is a classical Linear-Chain CRF. We can modify and improve this model as we want. For example: f ij (y, y, x t ) = 1 {y=i} 1 {y =j}1 {x=o} e.g. { 1 if xi = John, y f(y i, y i 1, x i ) = i 1 = O, y i = NN 0 otherwise

Introduction Linear-Chain General Specific Implementations Conclusions Training To estimate the θ parameters we calculate the maximum conditional log likelihood. N l(θ) = log p(x (i) y (i) ) i=1 substituting the CRF model into the likelihood and adding a 1 regularization parameter 2σ we obtain: 2 l(θ) = N T K i=1 t=1 k=1 θ k f k (y (i) t, y (i) t 1, x(i) t N K ) log Z(x (i) ) i=1 k=1 θ 2 k 2σ 2 In general this function cannot be maximized in closed form. The partial derivatives are: δl δθ k = N T i=1 t=1 f k (y (i) t, y (i) t 1, x(i) t ) N i=1 t=1 T y,y f k (y, y, x (i) t )p(y, y x (i) ) K k=1 θ 2 k 2σ 2

Introduction Linear-Chain General Specific Implementations Conclusions Training cont d The function l(θ) is concave, that is every local optimum is a global optimum. The maximization of the log likelihood can be computed with several numerical algorithms Gradient method of steepest ascent Iterative scaling Newton s method quasi-newton s method BFGS and limited-memory BFGS

Introduction Linear-Chain General Specific Implementations Conclusions Forward-Backward Algorithm In order to resolve two main problems in the CRF scenario, the Z(x) evaluation and the computation of marginals distributions in the gradient computation, we have to use two algorithms belonging to the same class of algorithm, the forward-backward class. Z(x) = ( T ) K exp θ k f k (y t 1, y t ; x t ) y t=1 k=1 we define a function g as: K g t (y t 1, y t ) = θ k f k (y t 1, y t ; x t ) k=1 given that θ and x are fixed we can give as parameters y t 1 and y t.

Introduction Linear-Chain General Specific Implementations Conclusions Forward-Backward Algorithm cont d Wallach, 2004 Z(x) = y T exp (g t (y t 1, y t )) t=1 for each t from 1 to T + 1 we define an Y + 2 Y + 2 matrix called Transition Matrix: M t (u, v) = exp g t (u, v) M 1 (u, v) is defined only for u = ST ART, and M T +1 (u, v) is defined only for v = END given this the only thing we have to do is multiplying all the matrices, e.g. M 1 M 2, M 1,2 M 3,... and take the (ST ART, END) entry of the obtained matrix.

Introduction Linear-Chain General Specific Implementations Conclusions Forward-Backward Algorithm cont d In order to infer the marginal in the gradient computation, we have to use the forward-backward algorithm as in the HMM. We have to solve the expectation of f k under the model distribution Base case: N i=1 t=1 T y,y f k (y, y, x (i) t )p(y, y x (i) ) { 1, if y = START α 0 (y x) = 0, otherwise { 1, if y = END β T +1 (y x) = 0, otherwise

Introduction Linear-Chain General Specific Implementations Conclusions Forward-Backward Algorithm cont d Recurrence relation: α t (x) T = α t 1 (x) T M t (x) we can write: β t (x) = M t+1 (x)β t+1 (x) p(y, y x (i) ) = α t 1(y x)m t (y, y x)β t (y x) Z(x)

Introduction Linear-Chain General Specific Implementations Conclusions Viterbi Algorithm To determine the most probable sequence of labels y 1,..., y T we use the Viterbi algorithm y = arg max y T g t (y t 1, y t ) t=1 We define: SCORE(y 1,..., y p ) = p t=1 g t(y t 1, y t ) U(p) =score best sequence y 1,..., y p U(p, v) =score best sequence y 1,..., y p, where y p = v

Introduction Linear-Chain General Specific Implementations Conclusions Viterbi Algorithm cont d Formally: Base case: U(p, v) = p 1 max [ g t (y t 1, y t ) + g p (y p 1, v)] y 1,...,y p 1 U(0, v) = t=1 { 0, if v = START, otherwise we can recursively proceed in this manner: U(p, v) = max y p 1 [U(p 1, y p 1 ) + g p (y p 1, v)] v Our goal is to find U( T + 1, END)

Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions

Introduction Linear-Chain General Specific Implementations Conclusions General CRF p(y x) = 1 Z(x) C p C Ψ c C p Ψ c (x c, y c ; θ p ) where each factor is parametrized as: K(p) Ψ c (x c, y c ; θ p ) = exp θ pk f pk (x c, y c ) Z(x) = y C p C k=1 Ψ c C p Ψ c (x c, y c ; θ p ) C = C 1,..., C P where each C p is a clique template whose parameters are tied.

Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions

Introduction Linear-Chain General Specific Implementations Conclusions Important Tasks As a versatile learning system we can employ it in several natural languages task: Sequence labeling, Part Of Speech (linear-chain) Information extraction, Named Entity Recognition (skip-chain, semi-markov) Text classification (multilabel crf)

Introduction Linear-Chain General Specific Implementations Conclusions Skip-Chain CRF Sutton and McCallum, 2004 y t-1 y t y t+1 y t+10 y t+11 y t+12 x x t-1 t x t+1 x t+10 x t+11 x t+12 p(y x) = 1 Z(x) T Ψ t (y t, y t 1, x) t=1 (u,v) I Ψ uv (y u, y v, x) ( ) Ψ t (y t, y t 1, x) = exp θ 1k f 1k (y t, y t 1, x t ) ( ) Ψ uv (y u, y v, x) = exp θ 2k f 2k (y u, y v, x, u, v) k k

Introduction Linear-Chain General Specific Implementations Conclusions Semi-Markov CRF Sarawagi and Cohen, 2005 p(s x) = 1 s K Z(x) exp θ k g k (y j, y j 1, x, t j, u j ) k j s = (s 1...s p ) denote the segmentation of x where s j = (t j, u j, y j ) where t j is a start position, u j is an end position, and y j is the label of that segment. Semi-Markov CRF employes a modified version of Viterbi algorithm that take into account the segment model

Introduction Linear-Chain General Specific Implementations Conclusions Multilabel CRF ( Ghamrawi and McCallum, 2005 p(y x) = 1 Z(x) exp θ k f k (x, y) + ) θ k f k (y) k k k { v i, y j : 1 i V, 1 j Y } k { y i, y j, q : q {0, 1, 2, 3}, 1 i, j Y } y = subset of the set Y represented by a vector of length Y y 1 y 2 x 1 x 2 x 3

Introduction Linear-Chain General Specific Implementations Conclusions Multilabel CRF cont d ( p(y x) = 1 Z(x) exp θ k f k (x, y) + ) θ k f k (x, y) k k k { v i, y j : 1 i V, 1 j Y } k { v i, y j, y j : 1 i V, 1 j, j Y } y 1 y 2 x 1 x 2 x 3

Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions

Introduction Linear-Chain General Specific Implementations Conclusions Implementations CRF++ by Taku Kudo at http://crfpp.sourceforge.net/ CRF Project by Sunita Sarawagi at http://crf.sourceforge.net/ Stanford CRF at http://nlp.stanford.edu/software/crf-ner.shtml Mallet at http://mallet.cs.umass.edu/index.php by Andrew McCallum

Introduction Linear-Chain General Specific Implementations Conclusions Outline 1 Introduction 2 Linear-Chain 3 General 4 Specific 5 Implementations 6 Conclusions

Introduction Linear-Chain General Specific Implementations Conclusions Conclusions CRF s take the best among HMM s and ME s State of the art in many NLP tasks A rich framework such as HMM s

Introduction Linear-Chain General Specific Implementations Conclusions Questions? Thanks for your attention!