Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Similar documents
Graphical models for part of speech tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Probabilistic Models for Sequence Labeling

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

Conditional Random Field

STA 414/2104: Machine Learning

Sequential Supervised Learning

Undirected Graphical Models

Statistical Methods for NLP

Machine Learning for natural language processing

MACHINE LEARNING 2 UGM,HMMS Lecture 7

STA 4273H: Statistical Machine Learning

Linear Dynamical Systems

Hidden Markov Models

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Hidden Markov Models

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Lecture 9: PGM Learning

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

A brief introduction to Conditional Random Fields

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Lecture 13: Structured Prediction

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Probabilistic Graphical Models

Linear Dynamical Systems (Kalman filter)

CRF for human beings

We Live in Exciting Times. CSCI-567: Machine Learning (Spring 2019) Outline. Outline. ACM (an international computing research society) has named

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Basic math for biology

Expectation Maximization (EM)

Introduction to Machine Learning CMU-10701

Graphical Models for Collaborative Filtering

Hidden Markov Model. Ying Wu. Electrical Engineering and Computer Science Northwestern University Evanston, IL 60208

COMP90051 Statistical Machine Learning

Hidden Markov Models

Hidden Markov Models Part 2: Algorithms

Expectation Maximization (EM)

Hidden Markov Models

Probabilistic Graphical Models

Parameter learning in CRF s

Statistical Processing of Natural Language

Intelligent Systems (AI-2)

order is number of previous outputs

Conditional Random Fields: An Introduction

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Probabilistic Graphical Models

Intelligent Systems (AI-2)

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

CS838-1 Advanced NLP: Hidden Markov Models

Multiscale Systems Engineering Research Group

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Conditional Random Fields for Sequential Supervised Learning

Advanced Data Science

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Note Set 5: Hidden Markov Models

lecture 6: modeling sequences (final part)

Pattern Recognition and Machine Learning

Introduction to Machine Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Learning Parameters of Undirected Models. Sargur Srihari

Dynamic Approaches: The Hidden Markov Model

Statistical Methods for NLP

Hidden Markov Models

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Statistical Pattern Recognition

Hidden Markov Models

Lecture 11: Hidden Markov Models

Master 2 Informatique Probabilistic Learning and Data Analysis

Graphical Models Seminar

Lecture 15. Probabilistic Models on Graph

Sequence Labeling: HMMs & Structured Perceptron

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Parametric Models Part III: Hidden Markov Models

Hidden Markov Models and Gaussian Mixture Models

Directed Probabilistic Graphical Models CMSC 678 UMBC

Natural Language Processing

Bayesian Machine Learning - Lecture 7

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Introduction to Graphical Models

Learning Parameters of Undirected Models. Sargur Srihari

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Clustering K-means. Clustering images. Machine Learning CSE546 Carlos Guestrin University of Washington. November 4, 2014.

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Machine Learning 4771

STATS 306B: Unsupervised Learning Spring Lecture 5 April 14

with Local Dependencies

Hidden Markov models

9 Forward-backward algorithm, sum-product on factor graphs

Recap: HMM. ANLP Lecture 9: Algorithms for HMMs. More general notation. Recap: HMM. Elements of HMM: Sharon Goldwater 4 Oct 2018.

Hidden Markov Models

Data Mining Techniques

Machine Learning for Structured Prediction

p(d θ ) l(θ ) 1.2 x x x

Transcription:

HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014

Sequence labeling aking collective a set of interrelated instances x 1,, x and jointly labeling them We get as input a sequence of observations X = x 1: and need to label them with some joint label Y = y 1: 2

Generalization of mixture models for sequential data [Jordan] Z Y: states (latent variables) X: observations Y 1 Y 2 Y 1 Y X X 1 X 2 X 1 X 3

HMM examples Part-of-speech-tagging P VBZ VB o VB Students are expected to study Speech recognition 4

HMM: probabilistic model ransitional probabilities: transition probabilities between states A ij P(Y t = j Y t 1 = i) Initial state distribution: start probabilities in different states π i P(Y 1 = i) Observation model: Emission probabilities associated with each state P(X t Y t, Φ) 5

HMM: probabilistic model ransitional probabilities: transition probabilities between states P(Y t Y t 1 = i)~multi(a i1,, A im ) i states Initial state distribution: start probabilities in different states P(Y 1 )~Multi(π 1,, π M ) Observation model: Emission probabilities associated with each state Discrete observations: P(X t Y t = i)~multi(b i,1,, B i,k ) i states General: P X t Y t = i = f(. θ i ) Y: states (latent variables) X: observations 6

Inference problems in sequential data Decoding: argmax y 1,,y P y 1,, y x 1,, x Evaluation Filtering: P y t x 1,, x t Smoothing: t < t, P y t x 1,, x t Prediction: t > t, P y t x 1,, x t 7

Some questions P y t x 1,, x t =? Forward algorithm P x 1,, x =? Forward algorithm P y t x 1,, x =? Forward-Backward algorithm How do we adjust the HMM parameters: Complete data: each training data includes a state sequence and the corresponding observation sequence Incomplete data: each training data includes only an observation sequence 8

Forward algorithm α t i = P x 1,, x t, Y t = i i, j = 1,, M α t i = α t 1 j P Y t = i Y t 1 = j P x t Y t = i Initialization: j α 1 i = P x 1, Y 1 = i = P x 1 Y 1 = i P Y 1 = i Iterations: t = 2 to α t i = α t 1 j P Y t = i Y t 1 = j P x t Y t = i j α t i = m t 1 t (i) Y 1 Y 2 Y 1 Y α 1 (. ) α 2 (. ) α 1 (. ) α (. ) 9 X 1 X 2 X 1 X

Backward algorithm β t i = m t t 1 (i) = P x t+1,, x Y t = i i, j states Initialization: β t 1 i = β t j P Y t = j Y t 1 = i P x t Y t = i β i = 1 Iterations: t = down to 2 j β t 1 i = β t j P Y t = j Y t 1 = i P x t Y t = i j β 1. β 2. β 1. β. = 1 β t i = m t t 1 (i) Y 1 Y 2 Y 1 Y 10 X 1 X 2 X 1 X

Forward-backward algorithm α t i = α t 1 j P Y t = i Y t 1 = j P x t Y t = i j α 1 i = P x 1, Y 1 = i = P x 1 Y 1 = i P Y 1 = i α t (i) P x 1, x 2,, x t, Y t = i β t (i) P x t+1, x t+2,, x Y t = i β t 1 i = β t j P Y t = j Y t 1 = i P x t Y t = i β i = 1 j P x 1, x 2,, x = α i β i = α i P Y t = i x 1, x 2,, x 11 i = α t(i)β t (i) i i α i

Forward-backward algorithm his will be used for expectation maximization to train a HMM β 1. β 2. β 1. β. = 1 Y 1 Y 2 Y 1 Y α 1 (. ) α 2 (. ) α 1 (. ) α (. ) X 1 X 2 X 1 X P Y t = i x 1,, x = P x 1,,x,Y t =i P x 1,,x = α t(i)β t (i) α (j) j=1 12

Decoding Problem Choose state sequence to maximize the observations: argmax y 1,,y t P y 1,, y t x 1,, x t Viterbi algorithm: Define auxiliary variable δ: δ t i = max y 1,,y t 1 P(y 1, y 2,, y t 1, Y t = i, x 1, x 2,, x t ) δ t (i): probability of the most probable path ending in state Y t = i Recursive relation: δ t i = max j δ t 1 j P(Y t = i Y t 1 = j) P x t Y t = i δ t i = max j=1,, δ t 1 t P Y t = i Y t 1 = j P(x t Y t = i) 13

Decoding Problem: Viterbi algorithm Initialization δ 1 i = P x 1 Y 1 = i P Y 1 = i ψ 1 i = 0 Iterations: t = 2,, δ t i = max j ψ t i = argmax j Final computation: P = max j=1,, δ y = argmax j=1,, δ t 1 j P Y t = i Y t 1 = j δ δ t j P Y t = i Y t 1 = j j j P x t Y t = i raceback state sequence: t = 1 down to 1 y t = ψ t+1 y t+1 i = 1,, M i = 1,, M 14

Max-product algorithm m ji max x i = max φ x j φ x i, x j m max kj (x j ) x j k (j)\i δ t i = m max t 1,t φ x i 15

HMM Learning Supervised learning: When we have a set of data samples, each of them containing a pair of sequences (one is the observation sequence and the other is the state sequence) Unsupervised learning: When we have a set of data samples, each of them containing a sequence of observations 16

HMM supervised learning by MLE π A Y 1 Y 2 Y 1 Y B Initial state probability: X 2 X 1 X 1 X π i = P Y 1 = i, 1 i M 17 State transition probability: A ji = P Y t+1 = i Y t = j, State transition probability: B ik = P X t = k Y t = i, 1 i, j M 1 k K Discrete observations

HMM: supervised parameter learning by MLE (n) (n) (n) P D θ = P y 1 π P(yt yt 1, A) t=2 t=1 P(x t (n) yt (n), B) A ij = t=2 I (n) (n) y t 1 = i, yt = j t=2 I (n) y t 1 = i B ik = π i = t=1 I I y 1 (n) = i (n) (n) y t = i, xt = k t=1 I y t (n) = i Discrete observations 18

Learning Problem: how to construct an HHM given only observations? Find θ = (A, B, π), maximizing P(x 1,, x θ) Incomplete data EM algorithm 19

HMM learning by EM (Baum-Welch) θ old = π old, A old, Φ old E-Step: i (n) γ n,t = P Y t = k X (n) ; θ old i, j = 1,, M i,j ξ n,t = P M-Step: π i new = A new i,j = B new i,k = (n) (n) Yt 1 = i, Yt = j x (n) ; θ old γ i n,1 i,j t=2 ξ n,t 1 γi t=1 n,t t=1 i γ n,t t=1 I X t (n) =i, I X t (n) =i Discrete observations i, j = 1,, M 20 Baum-Welch algorithm (Baum, 1972)

HMM learning by EM (Baum-Welch) θ old = π old, A old, Φ old E-Step: i (n) γ n,t = P Y t = k X (n) ; θ old i, j = 1,, M 21 i,j ξ n,t = P M-Step: π i new = A new i,j = B new i,k = μ i new = Σ i new = (n) (n) Yt 1 = i, Yt = j x (n) ; θ old γ i n,1 i,j t=2 ξ n,t 1 γi t=1 n,t i t=1 γ n,t ii X (n) t =i, t=1 γ n,t x t (n) t=1 I X γ t i =i t=1 n,t t=1 i γ n,t x t (n) μi new t=1 Assumption: Gaussian emission probabilities i γ n,t x t (n) μi new i, j = 1,, M Baum-Welch algorithm (Baum, 1972)

Forward-backward algorithm for E-step P y t 1, y t x 1,, x = P x 1,, x y t 1, y t )P y t 1, y t P(x 1,, x ) = P x 1,, x t 1 y t 1 )P x t y t )P x t+1,, x y t )P y t y t 1 )P y t 1 P(x 1,, x ) = α t 1(y t 1 )P x t y t )P y t y t 1 )β t y t M α (i) i=1 22

HMM shortcomings In modeling the joint distribution P(Y, X), HMM ignores many dependencies between observations X 1,, X (similar to most generative models that need to simplify the structure) Y 1 Y 2 Y 1 Y X 1 X 2 X 1 X In the sequence labeling task, we need to classify an observation sequence using the conditional probability P(Y X) However, HMM learns a joint distribution P(Y, X) while uses only P(Y X) for labeling 23

Maximum Entropy Markov Model (MEMM) Y 1 Y 2 Y 1 Y Y 1 Y 2 Y 1 Y X 1 X 2 X 1 X X 1: P y 1: x 1: = P y 1 x 1: P y t y t 1, x 1: t=2 P y t y t 1, x 1: = exp w f y t, y t 1, x 1: exp w yt f y t, y t 1, x 1: Discriminative model Only models P(Y X) and completely ignores modeling P(X) Maximizes the conditional likelihood P(D Y D X, θ) 24 Z y t 1, x 1:, w = exp w f y t, y t 1, x 1: y t

Feature function Feature function f y t, y t 1, x 1: can take account of relations between both data and label space However, they are often indicator functions showing absence or presence of a feature w i captures how closely f y t, y t 1, x 1: is related with the label 25

MEMM disadvantages he later observation in the sequence has absolutely no effect on the posterior probability of the current state model does not allow for any smoothing. he model is incapable of going back and changing its prediction about the earlier observations. he label bias problem there are cases that a given observation is not useful in predicting the next state of the model. if a state has a unique out-going transition, the given observation is useless 26

Label bias problem in MEMM Label bias problem: states with fewer arcs are preferred Preference of states with lower entropy of transitions over others in decoding MEMMs should probably be avoided in cases where many transitions are close to deterministic Extreme case: When there is only one outgoing arc, it does not matter what the observation is he source of this problem: Probabilities of outgoing arcs normalized separately for each state sum of transition probability for any state has to sum to 1 Solution: Do not normalize probabilities locally 27

From MEMM to CRF From local probabilities to local potentials P Y X = 1 Z(X) = 1 Z(X, w) t=1 t=1 Y 1 Y 2 Y 1 Y φ y t 1, y t, X X 1: exp w f y t, y t 1, x X x 1: Y y 1: CRF is a discriminative model (like MEMM) can dependence between each state and the entire observation sequence uses global normalizer Z(X, w) that overcomes the label bias problem of MEMM MEMM use an exponential model for each state while CRF have a single exponential model for the joint probability of the entire label sequence 28

CRF: conditional distribution P y x = 1 Z(x, λ) exp λ f y i, y i 1, x i=1 = 1 Z(x, λ) exp λ kf k y i, y i 1, x i=1 k Z x, λ = exp λ f y i, y i 1, x y i=1 29

CRF: MAP inference Given CRF parameters λ, find the y that maximizes P y x : y = argmax y exp λ f y i, y i 1, x i=1 Z(x) is not a function of y and so has been ignored Max-product algorithm can be used for this MAP inference problem Same as Viterbi decoding used in HMMs 30

CRF: inference Exact inference for 1-D chain CRFs φ y i, y i 1 = exp λ f y i, y i 1, x Y 1 Y 2 Y 1 Y X 1: Y 2 Y 1 Y 1, Y 2 Y 2, Y 3 Y 1, Y 31

CRF: learning Maximum Conditional Likelihood θ = argmax P(D Y D X, θ) θ 32 P y (n) x (n), λ λ = argmax λ = = exp λ (n) (n) f y i, yi 1, x (n) i=1 P y (n) x (n), λ 1 Z(x (n), λ) exp λ (n) (n) f y i, yi 1, x (n) i=1 L λ = ln P y (n) x (n), λ (n) (n) λ L λ = f y i, yi 1, x (n) i=1 ln Z(x (n), λ) λ ln Z(x (n), λ) λ ln Z(x (n), λ) = P(y x n, λ) f y i, y i 1, x n y Gradient of the log-partition function in an exponential family is the expectation of the sufficient statistics. i=1

CRF: learning λ ln Z(x (n), λ) = P(y x n, λ) f y i, y i 1, x n y i=1 = P(y x n, λ)f y i, y i 1, x n i=1 y = P(y i, y i 1 x n, λ)f y i, y i 1, x i=1 y i,y i 1 How do we find the above expectations? P(y i, y i 1 x n, λ) must be computed for all i = 2,, 33

CRF: learning Inference to find P(y i, y i 1 x n, λ) Junction tree algorithm Initialization of clique potentials: ψ y i, y i 1 = exp λ f y i, y i 1, x n φ y i = 1 After calibration:ψ y i, y i 1 P y i, y i 1 x (n), λ = ψ y i, y i 1 yi,y i 1 ψ y i, y i 1 Y 2 Y 1 Y 1, Y 2 Y 2, Y 3 Y 1, Y 34

CRF learning: gradient descent λ t = λ t + η λ L λ t (n) (n) λ L λ = f y i, yi 1, x (n) i=1 λ ln Z(x (n), λ) In each gradient step, for each data sample, we use inference to find P(y i, y i 1 x n, λ) required for computing feature expectation: 35 λ ln Z(x (n), λ) = E P(y x n,λ t ) f y i, y i 1, x n = P(y i, y i 1 x n, λ)f y i, y i 1, x n y i,y i 1

Summary Discriminative vs. generative In cases where we have many correlated features, discriminative models (MEMM and CRF) are often better avoid the challenge of explicitly modeling the distribution over features. but, if only limited training data are available, the stronger bias of the generative model may dominate and these models may be preferred. Learning HMMs and MEMMs are much more easily learned. CRF requires an iterative gradient-based approach, which is considerably more expensive inference must be run separately for every training sequence MEMM vs. CRF (Label bias problem of MEMM) In many cases, CRFs are likely to be a safer choice (particularly in cases where many transitions are close to deterministic), but the computational cost may be prohibitive for large data sets. 36