Graphical models for part of speech tagging

Similar documents
Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Undirected Graphical Models

Probabilistic Models for Sequence Labeling

Statistical Methods for NLP

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Conditional Random Fields: An Introduction

Machine Learning for Structured Prediction

Conditional Random Field

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Maximum Entropy Markov Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Sequential Supervised Learning

Lecture 13: Structured Prediction

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Conditional Random Fields for Sequential Supervised Learning

Fall 2010 Graduate Course on Dynamic Learning

CSE 490 U Natural Language Processing Spring 2016

Intelligent Systems (AI-2)

Statistical Processing of Natural Language

Intelligent Systems (AI-2)

Information Extraction from Text

STA 4273H: Statistical Machine Learning

MACHINE LEARNING 2 UGM,HMMS Lecture 7

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

CRF for human beings

Undirected Graphical Models for Sequence Analysis

Machine Learning for natural language processing

Log-Linear Models, MEMMs, and CRFs

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

A brief introduction to Conditional Random Fields

The Noisy Channel Model and Markov Models

CISC 889 Bioinformatics (Spring 2004) Hidden Markov Models (II)

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Hidden Markov Models

CSE 447/547 Natural Language Processing Winter 2018

STA 414/2104: Machine Learning

Brief Introduction of Machine Learning Techniques for Content Analysis

Text Mining. March 3, March 3, / 49

Hidden Markov Models in Language Processing

Hidden Markov Models

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Parameter learning in CRF s

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

10. Hidden Markov Models (HMM) for Speech Processing. (some slides taken from Glass and Zue course)

Random Field Models for Applications in Computer Vision

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 7: Sequence Labeling

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Lecture 4: State Estimation in Hidden Markov Models (cont.)

Andrew McCallum Department of Computer Science University of Massachusetts Amherst, MA

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

CRF Word Alignment & Noisy Channel Translation

Hidden Markov Modelling

Natural Language Processing

Lecture 3: ASR: HMMs, Forward, Viterbi

order is number of previous outputs

Conditional Random Fields

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

lecture 6: modeling sequences (final part)

Adaptation of Maximum Entropy Capitalizer: Little Data Can Help a Lot

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Lecture 9: PGM Learning

CS838-1 Advanced NLP: Hidden Markov Models

Lecture 12: Algorithms for HMMs

Hidden Markov Models

Towards Maximum Geometric Margin Minimum Error Classification

Hidden Markov Models

COMP90051 Statistical Machine Learning

Sequences and Information

Structure Learning in Sequential Data

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Machine Learning for natural language processing

Hidden Markov Models

Recall: Modeling Time Series. CSE 586, Spring 2015 Computer Vision II. Hidden Markov Model and Kalman Filter. Modeling Time Series

Statistical Methods for NLP

Pattern Recognition and Machine Learning

Robert Collins CSE586 CSE 586, Spring 2015 Computer Vision II

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Lecture 12: Algorithms for HMMs

Probability and Structure in Natural Language Processing

A gentle introduction to Hidden Markov Models

Introduction to Hidden Markov Modeling (HMM) Daniel S. Terry Scott Blanchard and Harel Weinstein labs

arxiv: v1 [stat.ml] 17 Nov 2010

Dept. of Linguistics, Indiana University Fall 2009

Sequence Labeling: HMMs & Structured Perceptron

Active learning in sequence labeling

Introduction to Machine Learning CMU-10701

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Probabilistic Graphical Models

Midterm sample questions

Log-Linear Models with Structured Outputs

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Statistical Methods for NLP

Lecture 6: Graphical Models

Transcription:

Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging

Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional Random Fields

POS tagging: A Sequence Labeling Problem Input and Output Input sequence x x 1 x 2 x n Output sequence y y 1 y 2 y m Labels of the input sequence Semantic representation of the input Other Applications Automatic speech recognition Text processing, e.g., tagging, name entity recognition, summarization by exploiting layout structure of text, etc.

Hidden Markov Models Doubly stochastic models Efficient dynamic programming algorithms exist for Finding PrS The highest probability path P that maximizes PrS,P Viterbi Training the model Baum-Welch algorithm A 0.9 0.5 C 0.1 S 1 0.9 0.1 0.8 S 3 A 0.5 C 0.5 A 0.6 C 0.4 S 2 0.5 S 4 0.2 A 0.3 C 0.7

Hidden Markov Model HMM : Generative Modeling Source Model PY y Noisy Channel PX Y x e.g., 1 st order Markov chain P y P y i y i 1 i Parameter estimation: P x y P x i y i i maximize the joint likelihood of training examples x, y T log2 P x, y

Dependency 1st order X k 2 X k 1 X k X k +1 P X k 2 Yk 2 P X k 1 Yk 1 P X k Yk P X k + 1 Yk + 1 P Y k 1 Yk 2 P Y k Y k 1 P + Y k 1 Yk Y k 2 Y k 1 Y k Y k +1

Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional Random Fields

Disadvantage of HMMs 1 No Rich Feature Information Rich information are required When x k is complex When data of x k is sparse Example: POS Tagging How to evaluate Pw k t k for unknown words w k? Useful features Suffix, e.g., -ed, -tion, -ing, etc. Capitalization

Disadvantage of HMMs 2 Generative Model Parameter estimation: maximize the joint likelihood of training examples log2 P X x, Y y x, y T Better Approach Discriminative model which models Py x directly Maximize the conditional likelihood of training examples x, y T log2 P Y y X x

Maximum Entropy Markov Model Discriminative Sub Models Unify two parameters in generative model into one conditional model Two parameters in generative model, parameter in source model P y k y k 1 and parameter in noisy channel P x k yk Unified conditional model P yk xk, yk 1 Employ maximum entropy principle Maximum Entropy Markov Model P yi yi P, x y x 1 i i

General Maximum Entropy Model Model Model distribution PY X with a set of features {f 1, f 2,, f l } defined on X and Y Idea Collect information of features from training data Assume nothing on distribution PY X other than the collected information Maximize the entropy as a criterion

Features IIT Bombay and IBM India Research Lab Features 0-1 indicator functions 1 if x, y satisfies a predefined condition 0 if not Example: POS Tagging f f 1 2 1, if x ends with - tion and x, y 0, otherwise y is NN 1, if x start with Captialization and x, y 0, otherwise y is NNP

Constraints Empirical Information Statistics from training data T T y x i i y x f T f P,, 1 ˆ Constraints ˆ i f i P f P T y x Y D y i i y x f x X y Y P T f P,, 1 Expected Value From the distribution PY X we want to model

Maximum Entropy: Objective Entropy x y T y x x X y Y P x X y Y P x P x X y Y P x X y Y P T I log ˆ log 1 2, 2 ˆ s.t. max f P f P I X Y P Maximization Problem

Dual Problem IIT Bombay and IBM India Research Lab Dual Problem Conditional model P Y y X x exp λ f l i 1 i i x, y Maximum likelihood of conditional data Solution max λ, L, 1 λl x, y T log P Y Improved iterative scaling IIS Berger et al. 1996 Generalized iterative scaling GIS McCallum et al. 2000 2 y X x

Maximum Entropy Markov Model Use Maximum Entropy Approach to Model 1st order P Yk yk X k xk, Yk 1 yk 1 Features Basic features like parameters in HMM Bigram 1st order or trigram 2nd order in source model State-output pair feature X k x k, Y k y k Advantage: incorporate other advanced features on x k, y k

HMM vs MEMM 1st order X k X k P X k Yk P Yk X k, Yk 1 P Y k Y k 1 Y k 1 Y k Y k 1 Y k HMM Maximum Entropy Markov Model MEMM

Performance in POS Tagging POS Tagging Data set: WSJ Features: HMM features, spelling features like ed, -tion, -s, -ing, etc. Results Lafferty et al. 2001 1st order HMM 94.31% accuracy, 54.01% OOV accuracy 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy

Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional Random Fields

Disadvantage of MEMMs 1 Complex Algorithm of Maximum Entropy Solution Both IIS and GIS are difficult to implement Require many tricks in implementation Slow in Training Time consuming when data set is large Especially for MEMM

Disadvantage of MEMMs 2 Maximum Entropy Markov Model Maximum entropy model as a sub model Optimization of entropy on sub models, not on global model Label Bias Problem Conditional models with per-state normalization Effects of observations are weakened for states with fewer outgoing transitions

Label Bias Problem Training Data X:Y rib:123 rib:123 rib:123 rob:456 Model r i b 1 2 3 r o b Parameters P 1 r 0.4, P 4 r 0.6, P 2 i,1 P 2 o,1 1, P 5 i,4 P 5 o,4 1, P 3 b,2 P 6 b,5 1 rob:456 New input: rob 4 5 6 P123 rob P1 r P2 o,1 P3 b,2 0.6 1 1 0.6 P456 rob P4 r P5 o,4 P6 b,5 0.4 1 1 0.4

Solution Global Optimization Optimize parameters in a global model simultaneously, not in sub models separately Alternatives Conditional random fields Application of perceptron algorithm

Conditional Random Field CRF 1 Let G V, E be a graph such that Y is indexed by the vertices Y Y Then X, Y is a conditional random field if v V Conditioned globally on X v P Y X, Y, w v P Y X, Y, w, v E v w v w

Conditional Random Field CRF 2 Exponential Model G V, E : a tree or more specifically, a chain with cliques as edges and vertices Parameter Estimation Maximize the conditional likelihood of training examples IIS or GIS P Y y X x exp λ f e, y x + μ g v, y x x, y T Determined by State Transitions e E, k log2 P Y y X x k k e v V, k k k State determined v

MEMM vs CRF Similarities Both employ maximum entropy principle Both incorporate rich feature information Differences Conditional random fields are always globally conditioned on X, resulting in a global optimized model

Performance in POS Tagging POS Tagging Data set: WSJ Features: HMM features, spelling features like ed, -tion, -s, -ing, etc. Results Lafferty et al. 2001 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy Conditional random fields 95.73% accuracy, 76.24% OOV accuracy

Comparison of the three approaches to POS Tagging Results Lafferty et al. 2001 1st order HMM 94.31% accuracy, 54.01% OOV accuracy 1st order MEMM 95.19% accuracy, 73.01% OOV accuracy Conditional random fields 95.73% accuracy, 76.24% OOV accuracy

References A. Berger, S. Della Pietra, and V. Della Pietra 1996. A Maximum Entropy Approach to Natural Language Processing. Computational Linguistics, 221, 39-71. J. Lafferty, A. McCallumn, and F. Pereira 2001. Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data. In Proc. ICML-2001, 282-289.