Graphical models for part of speech tagging

Size: px

Start display at page:

Download "Graphical models for part of speech tagging"

Laura Logan
5 years ago
Views:

1 Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging

2 Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional Random Fields

3 POS tagging: A Sequence Labeling Problem Input and Output Input sequence x x 1 x 2 x n Output sequence y y 1 y 2 y m Labels of the input sequence Semantic representation of the input Other Applications Automatic speech recognition Text processing, e.g., tagging, name entity recognition, summarization by exploiting layout structure of text, etc.

4 Hidden Markov Models Doubly stochastic models Efficient dynamic programming algorithms exist for Finding PrS The highest probability path P that maximizes PrS,P Viterbi Training the model Baum-Welch algorithm A C 0.1 S S 3 A 0.5 C 0.5 A 0.6 C 0.4 S S A 0.3 C 0.7

5 Hidden Markov Model HMM : Generative Modeling Source Model PY y Noisy Channel PX Y x e.g., 1 st order Markov chain P y P y i y i 1 i Parameter estimation: P x y P x i y i i maximize the joint likelihood of training examples x, y T log2 P x, y

6 Dependency 1st order X k 2 X k 1 X k X k +1 P X k 2 Yk 2 P X k 1 Yk 1 P X k Yk P X k + 1 Yk + 1 P Y k 1 Yk 2 P Y k Y k 1 P + Y k 1 Yk Y k 2 Y k 1 Y k Y k +1

7 Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional Random Fields

8 Disadvantage of HMMs 1 No Rich Feature Information Rich information are required When x k is complex When data of x k is sparse Example: POS Tagging How to evaluate Pw k t k for unknown words w k? Useful features Suffix, e.g., -ed, -tion, -ing, etc. Capitalization

9 Disadvantage of HMMs 2 Generative Model Parameter estimation: maximize the joint likelihood of training examples log2 P X x, Y y x, y T Better Approach Discriminative model which models Py x directly Maximize the conditional likelihood of training examples x, y T log2 P Y y X x

10 Maximum Entropy Markov Model Discriminative Sub Models Unify two parameters in generative model into one conditional model Two parameters in generative model, parameter in source model P y k y k 1 and parameter in noisy channel P x k yk Unified conditional model P yk xk, yk 1 Employ maximum entropy principle Maximum Entropy Markov Model P yi yi P, x y x 1 i i

11 General Maximum Entropy Model Model Model distribution PY X with a set of features {f 1, f 2,, f l } defined on X and Y Idea Collect information of features from training data Assume nothing on distribution PY X other than the collected information Maximize the entropy as a criterion

12 Features IIT Bombay and IBM India Research Lab Features 0-1 indicator functions 1 if x, y satisfies a predefined condition 0 if not Example: POS Tagging f f 1 2 1, if x ends with - tion and x, y 0, otherwise y is NN 1, if x start with Captialization and x, y 0, otherwise y is NNP

13 Constraints Empirical Information Statistics from training data T T y x i i y x f T f P,, 1 ˆ Constraints ˆ i f i P f P T y x Y D y i i y x f x X y Y P T f P,, 1 Expected Value From the distribution PY X we want to model

14 Maximum Entropy: Objective Entropy x y T y x x X y Y P x X y Y P x P x X y Y P x X y Y P T I log ˆ log 1 2, 2 ˆ s.t. max f P f P I X Y P Maximization Problem

15 Dual Problem IIT Bombay and IBM India Research Lab Dual Problem Conditional model P Y y X x exp λ f l i 1 i i x, y Maximum likelihood of conditional data Solution max λ, L, 1 λl x, y T log P Y Improved iterative scaling IIS Berger et al Generalized iterative scaling GIS McCallum et al y X x

16 Maximum Entropy Markov Model Use Maximum Entropy Approach to Model 1st order P Yk yk X k xk, Yk 1 yk 1 Features Basic features like parameters in HMM Bigram 1st order or trigram 2nd order in source model State-output pair feature X k x k, Y k y k Advantage: incorporate other advanced features on x k, y k

17 HMM vs MEMM 1st order X k X k P X k Yk P Yk X k, Yk 1 P Y k Y k 1 Y k 1 Y k Y k 1 Y k HMM Maximum Entropy Markov Model MEMM

18 Performance in POS Tagging POS Tagging Data set: WSJ Features: HMM features, spelling features like ed, -tion, -s, -ing, etc. Results Lafferty et al st order HMM 94.31% accuracy, 54.01% OOV accuracy 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy

19 Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional Random Fields

20 Disadvantage of MEMMs 1 Complex Algorithm of Maximum Entropy Solution Both IIS and GIS are difficult to implement Require many tricks in implementation Slow in Training Time consuming when data set is large Especially for MEMM

21 Disadvantage of MEMMs 2 Maximum Entropy Markov Model Maximum entropy model as a sub model Optimization of entropy on sub models, not on global model Label Bias Problem Conditional models with per-state normalization Effects of observations are weakened for states with fewer outgoing transitions

22 Label Bias Problem Training Data X:Y rib:123 rib:123 rib:123 rob:456 Model r i b r o b Parameters P 1 r 0.4, P 4 r 0.6, P 2 i,1 P 2 o,1 1, P 5 i,4 P 5 o,4 1, P 3 b,2 P 6 b,5 1 rob:456 New input: rob P123 rob P1 r P2 o,1 P3 b, P456 rob P4 r P5 o,4 P6 b,

23 Solution Global Optimization Optimize parameters in a global model simultaneously, not in sub models separately Alternatives Conditional random fields Application of perceptron algorithm

24 Conditional Random Field CRF 1 Let G V, E be a graph such that Y is indexed by the vertices Y Y Then X, Y is a conditional random field if v V Conditioned globally on X v P Y X, Y, w v P Y X, Y, w, v E v w v w

25 Conditional Random Field CRF 2 Exponential Model G V, E : a tree or more specifically, a chain with cliques as edges and vertices Parameter Estimation Maximize the conditional likelihood of training examples IIS or GIS P Y y X x exp λ f e, y x + μ g v, y x x, y T Determined by State Transitions e E, k log2 P Y y X x k k e v V, k k k State determined v

26 MEMM vs CRF Similarities Both employ maximum entropy principle Both incorporate rich feature information Differences Conditional random fields are always globally conditioned on X, resulting in a global optimized model

27 Performance in POS Tagging POS Tagging Data set: WSJ Features: HMM features, spelling features like ed, -tion, -s, -ing, etc. Results Lafferty et al st order MEMM 95.19% accuracy, 73.01% OOV accuracy Conditional random fields 95.73% accuracy, 76.24% OOV accuracy

28 Comparison of the three approaches to POS Tagging Results Lafferty et al st order HMM 94.31% accuracy, 54.01% OOV accuracy 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy Conditional random fields 95.73% accuracy, 76.24% OOV accuracy

29 References A. Berger, S. Della Pietra, and V. Della Pietra A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 221, J. Lafferty, A. McCallumn, and F. Pereira Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. ICML-2001,

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated