Probabilistic Models for Sequence Labeling

Similar documents
Conditional Random Field

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Graphical models for part of speech tagging

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Undirected Graphical Models

A brief introduction to Conditional Random Fields

Intelligent Systems (AI-2)

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Intelligent Systems (AI-2)

Lecture 13: Structured Prediction

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Sequential Supervised Learning

Machine Learning for Structured Prediction

CRF for human beings

Log-Linear Models, MEMMs, and CRFs

Statistical Methods for NLP

Conditional Random Fields: An Introduction

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Fall 2010 Graduate Course on Dynamic Learning

Structure Learning in Sequential Data

Hidden Markov Models

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Basic math for biology

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Conditional Random Fields for Sequential Supervised Learning

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

CS838-1 Advanced NLP: Hidden Markov Models

Maximum Entropy Markov Models

Pattern Recognition and Machine Learning

Information Extraction from Text

Active learning in sequence labeling

Learning Parameters of Undirected Models. Sargur Srihari

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

CSE 490 U Natural Language Processing Spring 2016

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Hidden Markov Models in Language Processing

A gentle introduction to Hidden Markov Models

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

Logistic Regression & Neural Networks

Probabilistic Graphical Models

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

COMP90051 Statistical Machine Learning

STA 4273H: Statistical Machine Learning

Hidden Markov Models. By Parisa Abedi. Slides courtesy: Eric Xing

Learning Parameters of Undirected Models. Sargur Srihari

Log-Linear Models with Structured Outputs

Sequential Data Modeling - Conditional Random Fields

Random Field Models for Applications in Computer Vision

CSE 447/547 Natural Language Processing Winter 2018

p(d θ ) l(θ ) 1.2 x x x

Introduction to Machine Learning CMU-10701

Logistic Regression. Stochastic Gradient Descent

Segmental Recurrent Neural Networks for End-to-end Speech Recognition

STA 414/2104: Machine Learning

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

Dynamic Approaches: The Hidden Markov Model

2.2 Structured Prediction

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Introduction to Restricted Boltzmann Machines

CSC 412 (Lecture 4): Undirected Graphical Models

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Part 4: Conditional Random Fields

Fast Inference and Learning with Sparse Belief Propagation

Design and Implementation of Speech Recognition Systems

Hidden Markov Models

Machine Learning: Chenhao Tan University of Colorado Boulder LECTURE 5

with Local Dependencies

Conditional Random Fields

Undirected Graphical Models for Sequence Analysis

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

Hidden Markov Models. Aarti Singh Slides courtesy: Eric Xing. Machine Learning / Nov 8, 2010

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Brief Introduction of Machine Learning Techniques for Content Analysis

Graphical Models for Collaborative Filtering

Structured Prediction

Regression with Numerical Optimization. Logistic

Reading Group on Deep Learning Session 1

Hidden Markov Models

Hidden CRFs for Human Activity Classification from RGBD Data

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

Algorithmisches Lernen/Machine Learning

Lecture 4: Hidden Markov Models: An Introduction to Dynamic Decision Making. November 11, 2010

Lecture 5: Logistic Regression. Neural Networks

Graphical Models Another Approach to Generalize the Viterbi Algorithm

arxiv: v1 [stat.ml] 17 Nov 2010

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Probabilistic Graphical Models

UNIVERSITY of PENNSYLVANIA CIS 520: Machine Learning Final, Fall 2013

Logistic Regression Review Fall 2012 Recitation. September 25, 2012 TA: Selen Uguroglu

lecture 6: modeling sequences (final part)

Natural Language Processing

Expectation Maximization (EM)

Logistic Regression. Professor Ameet Talwalkar. Professor Ameet Talwalkar CS260 Machine Learning Algorithms January 25, / 48

Gaussian and Linear Discriminant Analysis; Multiclass Classification

Transcription:

Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26

Background & Motivation Problem introduction Generative vs. Discriminative models Existing approaches for sequence labeling Label bias problems Factor graphs Conditional Random Fields Parameter estimation Inference Semi-Markov CRFs Experimental Results Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 2 / 26

Problem Introduction Segmentation & Labeling Problem Segmentation and Labeling Problem Probabilistic nature of the problem. Dependence on previous, and future labels. Generalization. Data Sparsity. Ambiguity. Combinatorial explosions. Figure: Dependence Labeling Problem. Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 3 / 26

Problem Introduction Examples of Problems Problems that are considered more often on this field 1 Named Entity Recognition. Jim bought 300 shares of ACME Corp. in 2006. Persons: Jim Quantities: 300 Companies: ACME Corp. Dates: 2006. 2 Part Of Speech Tagging. Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 4 / 26

Generative vs. Discriminative Models Generative Models Generative vs. Discriminative Probabilistic Approaches Probabilistic Generative Models Model joint distribution Build models for each label Minimum variance Biased parameter estimation Aim: Find p(y x) Maximize Likelihood: n ˆθ GEN = arg max log p yi f yi (x i ; θ) θ Θ i=1 Figure: Generative Probabilistic Model Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 5 / 26

Generative vs. Discriminative Models Discriminative Models Generative vs. Discriminative Probabilistic Approaches Probabilistic Discriminative Models Model conditional distribution Best classification performance Minimize classification loss Parameters that influence only the conditional distribution Maximize the logistic regression: n ˆθ DISC = arg max log py i fy i (x i ;θ) θ Θ p k f k (x i ;θ) i=1 k Figure: Discriminative Probabilistic Model Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 6 / 26

Existing Approaches for Sequence Labeling Hidden Markov Models - HMMs Hidden Markov Models - HMMs Generative model Consider many combinations Observation, depends directly at a state, in some time. Evaluate: p(y, x) = T p(y t y t 1 )p(x t y t ) t=1 p(y, x) = 1 Z exp { K λ k f k (y t, y t 1, x t } k=1 Figure: Hidden Markov Model Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 7 / 26

Existing Approaches for Sequence Labeling Maximum Entropy Markov Models - MEMMs Maximum Entropy Markov Models - MEMMs Discriminative model Exponential model for each state-observation transition p(y y, x) = 1 Z(y,x) exp { K λ k f k (y, y, x} k=1 Figure: Maximum Entropy Markov Model Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 8 / 26

Label Bias Problem Problems with previous approaches Label Ambiguity Reasons: 1 Local model construction 2 Competing states against each other 3 Non-Discriminatory state transitions Proposed Approaches: 1 Delay branching of state transitions 2 Start with a fully connected graph Disadvantages of these approaches 1 Discretization can lead to combinatorial explosions. 2 Exclude prior knowledge. Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 9 / 26

Label Bias Problem Label Bias Problem Problems: 1 State transitioning 2 Both paths equally probable Figure: Label Bias Problem Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 10 / 26

Factor Graphs From Directed Graphs to Undirected Graphs Generative models represented as directed graphs 1 Outputs precede inputs. 2 Describe how outputs generate inputs. Discriminative models as factor graphs 1 Define factors, as the dependence of features with observations. 2 Arbitrary number of features i.e. Capital Letters, Noun,... Ψ k (y, x k ) = p(x k y) Figure: Factor Graph Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 11 / 26

CRF Construction Modeling of CRFs Definition Let G = (V, E) be a graph such that Y = (Y v ) v V so that Y is indexed by the vertices of G. Then (X, Y) is a conditional random field in case, when conditioned on X, the random variable Y v obeys the Markov property with respect to the graph p(y v X, Y w, w v) = p(y v X, Y w, w v), where w v means that w and v are neighbors in G. Figure: Factor Graph Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 12 / 26

Modeling of CRFs CRF Construction Properties of CRFs: Condition globally on observation X. Similar to bipartite graphs: Two sets of random variables as vertices, with factorized edges. Normalize probabilities, of labels y given observation x, by the product of potential functions. Fixed set of features. Usually more expensive than HMMs (Arbitrary dependencies on observation sequence). p θ (y x) exp( λ k f k (e, y e, x) + µ k g k (v, y v, x)). e E,k v V,k Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 13 / 26

Parameter Estimation Training - Improved Iterative Scaling - IIS Algorithm IIS Algorithm: Start with an arbitrary value for each of λ k, µ k Repeat until convergence: Solve: Ẽ[f k ], Ẽ[g k ] Set: λ k λ k + δλ k µ k µ k + δµ k Properties of IIS: Global optimum. Slow convergence. Objective for maximization (for edge features, similar for vertex features): Ẽ[f k ] = p(x)p(y x) n+1 f k (e i, y ei, x)e δλ kt(x,y) x,y i=1 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 14 / 26

Parameter Estimation Training - Stochastic Gradient Ascent - SDG Consider Stochastic Gradient Ascent (difference from descent that is for minimization) Increase the log likelihood. One example at a time. Most features of an example are 0 (skip), complexity O(nfp). Change parameters once. Works good on sparse environments. Figure: Stochastic Gradient Ascent Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 15 / 26

Parameter Estimation Training - State of the Art: Limited-memory BFGS Properties of L-BFGS: Second Order Derivatives. Build Approximations to the Hessian Matrix. Quick Convergence. Great performance for unconstrained problems. Approximations made in the gradient steps: δ k = B k G(θ k ) B (k) y (k)=δ(k 1) Where matrix B, represents the approximated inverse Hessian Matrix. Figure: Quasi-Newton Line Search. Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 16 / 26

Inference with CRFs Inference Here we consider Linear-Chain Structured CRFs: Viterbi Decoding by Dynamic Programming. Shortest Path. Each position in the observation, has the matrix:[m i (y, y x)] YxY = e(λ i (y, y x)) Y = {NN, NP, V } NN NP V NN e (Λ i (NN,NN x)) e (Λ i (NN,NP x)) e (Λ i (NN,V x)) NP e (Λ i (NP,NN x)) e (Λ i (NP,NP x)) e (Λ i (NP,V x)) V e (Λ i (V,NN x)) e (Λ i (V,NP x)) e (Λ i (V,V x)) Where: Λ i (y, y x) = k λ kf k (e i, Y ei = (y, y), x) + k µ kg k (v i, Y vi = y, x)) Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 17 / 26

Inference with CRFs Inference Figure: Forward-Backward Inference Calculation. { 1 : if y = start α 0 (y x) = 0 : otherwise { 1 : if y = stop β n+1 (y x) = 0 : otherwise α i (x) = α i 1 (x)m i (x). β i (x) T = β i+1 (x)m i+1 (x). Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 18 / 26

Semi-Markov CRF Semi-Markov Conditional Random Fields Semi-Markov Models: Semi-Markov Chains. Persist states for time d. Segment observations. Features built on the segmented observation. Faster Inference than order-l CRFs. Observation Segmentation I went skiing with Fernando Pereira in British Colombia s = (1, 1, O), (2, 2, O), (3, 3, O), (4, 4, O), (5, 6, I ), (7, 7, O), (8, 9, I ) y = O, O, O, O, I, I, O, I, I Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 19 / 26

Semi-Markov CRF Semi-Markov Conditional Random Fields Modeling of Semi-Markov CRFs: Segment: s j = t j, u j, y j Segment Feature Functions: g k (j, x, s) = g k (y j, y j 1, x, t j, u j Inference: P(s x, W ) = 1 Z(x) ew G(x,s) Figure: Semi-Markov Chains Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 20 / 26

Experimental Results Experimental Results Experimental setup: Label bias verification. Synthetic data, generated by randomly chosen HMMs. Transition probabilities are: p α (y i y i 1, y i 2 ) = αp 2 (y i y i 1, y i 2 ) + (1 α)p 1 (y i y i 1 ) Emission probabilities: p α (x i y i, x i 1 ) = αp 2 (x i y i, x i 1 ) + (1 α)p 1 (x i y i ) Mixture of first-order and second-order models. Five labels, a-e ( Y = 5), and 26 observation values, A-Z ( X = 26). POS tagging experiments on Penn treebank. Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 21 / 26

Experimental Results Experimental Results Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 22 / 26

Experimental Results Experimental Results Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 23 / 26

Experimental Results Conclusion Generative vs. Discriminative models. Arbitrary number of features. Global Modeling vs. Local modeling. Convex optimization problem. Different solutions to parameter estimation. Factor Graphs vs. Directed graphs. Semi-Markov CRFs. Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 24 / 26

Bibliography Experimental Results 1 Conditional Random Fields: Probabilistic Models for Segmenting and Labeling Sequence Data (Lafferty et al.) 2 Semi-Markov Conditional Random Fields for Information Extraction (Sarawagi S. and Cohen W). 3 The Trade-Off Between Generative and Discriminative Classifiers (Bouchard G. and Triggs B.) 4 Log Linear Models and Conditional Random Fields - Tutorial (Elkan Ch.) 5 Conditional Random Fields: An Introduction (Wallach H.) Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 25 / 26

Experimental Results Thank you! Questions? Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 26 / 26