Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Similar documents
Intelligent Systems (AI-2)

Conditional Random Field

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Alternative Parameterizations of Markov Networks. Sargur Srihari

Intelligent Systems (AI-2)

A brief introduction to Conditional Random Fields

Probabilistic classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2016

Probabilistic Graphical Models

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

Probabilistic Graphical Models

Lecture 9: PGM Learning

CS Lecture 4. Markov Random Fields

Using Graphs to Describe Model Structure. Sargur N. Srihari

Intelligent Systems (AI-2)

Alternative Parameterizations of Markov Networks. Sargur Srihari

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Generative and Discriminative Approaches to Graphical Models CMSC Topics in AI

Bayesian Networks Introduction to Machine Learning. Matt Gormley Lecture 24 April 9, 2018

Lecture 13: Structured Prediction

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Lecture 15. Probabilistic Models on Graph

Undirected Graphical Models: Markov Random Fields

Directed and Undirected Graphical Models

Naïve Bayes classification

9/12/17. Types of learning. Modeling data. Supervised learning: Classification. Supervised learning: Regression. Unsupervised learning: Clustering

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

COMP90051 Statistical Machine Learning

Classification CE-717: Machine Learning Sharif University of Technology. M. Soleymani Fall 2012

Graphical Models for Collaborative Filtering

Probabilistic Graphical Models

Machine Learning Lecture 14

Graphical Models and Kernel Methods

Naïve Bayes classification. p ij 11/15/16. Probability theory. Probability theory. Probability theory. X P (X = x i )=1 i. Marginal Probability

with Local Dependencies

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

Expectation Maximization (EM)

Probabilistic Graphical Models

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Chapter 4 Dynamic Bayesian Networks Fall Jin Gu, Michael Zhang

CSC 412 (Lecture 4): Undirected Graphical Models

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Lecture 6: Graphical Models

Statistical Methods for NLP

Bayesian Learning. HT2015: SC4 Statistical Data Mining and Machine Learning. Maximum Likelihood Principle. The Bayesian Learning Framework

Undirected Graphical Models

Markov Networks.

Hidden Markov Models

Hidden Markov Models

Logistic Regression. Machine Learning Fall 2018

1 Undirected Graphical Models. 2 Markov Random Fields (MRFs)

Probabilistic Graphical Models & Applications

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

4 : Exact Inference: Variable Elimination

CSCI-567: Machine Learning (Spring 2019)

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

Log-Linear Models, MEMMs, and CRFs

Intelligent Systems:

6.867 Machine learning, lecture 23 (Jaakkola)

Deep Learning Srihari. Deep Belief Nets. Sargur N. Srihari

Lab 12: Structured Prediction

Introduction: MLE, MAP, Bayesian reasoning (28/8/13)

3 : Representation of Undirected GM

Introduction to Systems Analysis and Decision Making Prepared by: Jakub Tomczak

Introduction to Graphical Models

ECE521 Lecture7. Logistic Regression

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

Hidden Markov Models in Language Processing

An Introduction to Bayesian Machine Learning

Conditional Random Fields: An Introduction

Machine Learning for Signal Processing Bayes Classification and Regression

Hidden Markov Models

Lecture 3: Pattern Classification

Machine Learning for Structured Prediction

CRF for human beings

STA 414/2104: Machine Learning

Notes on Markov Networks

STA 4273H: Statistical Machine Learning

Probabilistic Graphical Models

Machine Learning. Bayes Basics. Marc Toussaint U Stuttgart. Bayes, probabilities, Bayes theorem & examples

CS Lecture 18. Topic Models and LDA

Natural Language Processing

Chapter 16. Structured Probabilistic Models for Deep Learning

Active learning in sequence labeling

Machine Learning Basics Lecture 2: Linear Classification. Princeton University COS 495 Instructor: Yingyu Liang

Markov Networks. l Like Bayes Nets. l Graph model that describes joint probability distribution using tables (AKA potentials)

Variational Decoding for Statistical Machine Translation

Machine Learning. Classification, Discriminative learning. Marc Toussaint University of Stuttgart Summer 2015

Statistical Pattern Recognition

Maxent Models and Discriminative Estimation

Parametric Models. Dr. Shuang LIANG. School of Software Engineering TongJi University Fall, 2012

2 : Directed GMs: Bayesian Networks

10/17/04. Today s Main Points

ECE521 week 3: 23/26 January 2017

Brief Introduction of Machine Learning Techniques for Content Analysis

COMS 4771 Probabilistic Reasoning via Graphical Models. Nakul Verma

CPSC 540: Machine Learning

Statistical Methods for NLP

Transcription:

Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly used as part of a speech recognition system can be described by a graph this includes Gaussian distributions, mixture models, decision trees, factor analysis, principle component analysis, linear discriminant analysis, and hidden Markov models... [including] many advanced models at the acoustic-, pronunciation-, and languagemodeling levels. Bilmes(2004)

ACL 2010 Papers 1.Conditional Random Fields for Word Hyphenation 2.Practical Very Large Scale CRFs 3.Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm 4.Cross-Lingual Latent Topic Extraction 5.Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection 6.PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names 7.A Latent Dirichlet Allocation Method for Selectional Preferences 8.Latent Variable Models of Selectional Preference 9.A Bayesian Method for Robust Estimation of Distributional Similarities 10.Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression 11.Blocked Inference in Bayesian Tree Substitution Grammars 12.Decision Detection Using Hierarchical Graphical Models

Today s Lecture A very brief overview of Probabilistic Graphical Models (PGMs) Undirected models: MRFs and CRFs Learning and decoding in MRFs and CRFs Case study: Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables For concise introduction to PGMs and their uses in NLP: https://sites.google.com/site/spring2011pgm/

Markov Random Fields Describe joint probability, P(A,B,C,D) using an undirected graph Edges describe interacting RVs Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

Markov Random Fields Describe joint probability, P(A,B,C,D) using an undirected graph Edges describe interacting RVs Random variable: a variable whose value results from a measurement on some type of random process. I.e., A = {Alice s grade for homework 2} Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

Conditional dependencies. For instance, Alice and Bob worked on the homework together, while Alice and Charlie Markov Random Fields Describe joint probability, P(A,B,C,D) using did not. an undirected graph Edges describe interacting RVs Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

Markov Random Fields Describe joint probability, P(A,B,C,D) using an undirected graph Edges describe interacting RVs For each configuration of RVs X 1, X 2,, X n ϕ(x 1, X 2,, X n ) gives an affinity score. Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

Markov Random Fields Describe joint probability, P(A,B,C,D) using an undirected graph Edges describe interacting RVs ϕ 1 (A,B) a a 30 a b 5 a c 1 c c 10 Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

Joint Probability via Local Factors ϕ 1 (A,B) a 0 b 0 30 a 0 b 1 5 a 1 b 0 1 a 1 b 1 10 ϕ 2 (B,C) b 0 c 0 100 b 0 c 1 1 b 1 c 0 1 b 1 c 1 100 P(a,b,c,d) = ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z = abcd ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Factors map subsets of RVs to non-negative real numbers High values for P(a,b,c,d) mean the local factors have high scores Inference: we can use P(A,B,C,D) to answer queries (conditional probability, marginal probability, most probable (MAP) assignment)

Joint Probability via Local Factors P(a,b,c,d) invariant to scaling of factors ϕ 1 (A,B) a 0 b 0 30 a 0 b 1 5 a 1 b 0 1 a 1 b 1 10 ϕ 2 (B,C) b 0 c 0 100 b 0 c 1 1 b 1 c 0 1 b 1 c 1 100 P(a,b,c,d) = ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z = abcd ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Factors map subsets of RVs to non-negative real numbers High values for P(a,b,c,d) mean the local factors have high scores Inference: we can use P(A,B,C,D) to answer queries (conditional probability, marginal probability, most probable (MAP) assignment)

Joint Probability via Local Factors P(a,b,c,d) invariant to scaling of factors Factors don t have to be tables could be functions (that could be used to produce tables). Often depend on features. ϕ 1 (A,B) a 0 b 0 30 a 0 b 1 5 a 1 b 0 1 a 1 b 1 10 ϕ 2 (B,C) b 0 c 0 100 b 0 c 1 1 b 1 c 0 1 b 1 c 1 100 P(a,b,c,d) = ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z = abcd ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Factors map subsets of RVs to non-negative real numbers High values for P(a,b,c,d) mean the local factors have high scores Inference: we can use P(A,B,C,D) to answer queries (conditional probability, marginal probability, most probable (MAP) assignment)

Joint Probability via Local Factors P(a,b,c,d) invariant to scaling of factors Factors don t have to be tables could be functions (that could be used to produce tables). Often depend on features. ϕ 1 (A,B) a 0 b 0 30 a 0 b 1 5 a 1 b 0 1 a 1 b 1 10 ϕ 2 (B,C) b 0 c 0 100 b 0 c 1 1 b 1 c 0 1 b 1 c 1 100 P(a,b,c,d) = ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z = abcd ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z Factors is just a map constant, subsets so of is RVs often to non-negative left out of the real discussion. numbers We don t High need values don t for P(a,b,c,d) even need mean to the calculate local factors it if we have just high want scores to decode Inference: : find we most can use likely Y X P(A,B,C,D) to answer queries (conditional probability, marginal probability, most probable (MAP) assignment)

Factorization implies Independencies P(a,b,c,d) ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) [ϕ 1 (a,b) ϕ 2 (b,c) ] [ ϕ 3 (c,d) ϕ 4 (d,a) ] F(b, a,c) G(d, a,c) Intuition: If (a,c) given, knowing b tells us nothing of d Theorem: X Y W iff. P(X 1 X N ) = F(X, W) G(Y, W)

Factorization implies Independencies P(a,b,c,d) ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) [ϕ 1 (a,b) ϕ 2 (b,c) ] [ ϕ 3 (c,d) ϕ 4 (d,a) ] F(b, a,c) G(d, a,c) Intuition: This If (a,c) is how given, we knowing the b independencies tells us nothing from of d the Theorem: factors can we tell just by looking at the X Y W iff. P(Xgraph? 1 X N ) = F(X, W) G(Y, W)

Formal MN Independencies X Y W (X and Y are indep. given W) if we can NOT go from X X to Y Y without passing W W blocks the interaction of X and Y Unshaded: unobserved (X, Y) Shaded: observed (W)

Formal MN Independencies Consequence: Simpler than global X is independent independencies of everyone in BNs X Y Z (X and Y are indep. given Z) else (d-separation, given its neighbors see Chapter (its Markov 3.3 in KF09) blanket) if we can go from X X to Y Y without touching Z To You make show a me MN: a connect graph, I X can and tell Y you if they what should is never independent Z blocks be rendered the from interaction independent what given of X and what. via Y other rvs Unshaded: unobserved (X, Y) Shaded: observed (W)

More on Factors: Factor Product Figure 4.3, KF09 ϕ 1 (A,B) ϕ 2 (B,C) = ψ(a,b,c) every entry in ψ(a,b,c) is ϕ 1 (a,b) * ϕ 2 (b,c)

Log-Linear Models Factor graphs are more explicit, but still require bulky tables for all the factor values Can use features to capture patterns that we d like reflected in clique potentials: P Φ (X 1,,X N ) = ϕ 1 (D 1 ) ϕ 2 (D 2 ) ϕ K (D K ) Define ϕ i (D i ) = exp(-w i f i (D i )) f i (D i ) tell us something indicative about some rvs, e.g. f i (Y 1 Y 2 )=1 if the vals of Y 1 & Y 2 are DT & NN, else f i (Y 1 Y 2 )=0 So P Φ (X 1,,X N ) = exp(-w 1 f 1 (D 1 )) exp(-w 1 f 1 (D 1 )) = exp(- w i f i (D i ))

Log-Linear Models Factor graphs are more explicit, but still require bulky tables for all the factor values Can use features to capture patterns that we d like reflected in clique potentials: We can have many feature functions for each D i that each f i (D i ) tell us something indicative about indicate some different rvs, e.g. f i (Y 1 Y 2 )=1 if the vals of Y 1 & Y 2 are DT & NN, else f i (Y 1 Y 2 )=0 properties of our rvs. P Φ (X 1,,X N ) = ϕ 1 (D 1 ) ϕ 2 (D 2 ) ϕ K (D K ) Define ϕ i (D i ) = exp(-w i f i (D i )) So P Φ (X 1,,X N ) = exp(-w 1 f 1 (D 1 )) exp(-w 1 f 1 (D 1 )) = exp(- w i f i (D i ))

Features in Markov Random Fields Conditional tables vs. Features: ϕ 1 (XY) = 1(X=Y) ϕ 2 (XY) = 1 if X and Y one grade apart 0 otherwise ϕ 1 (A,B) a a 30 a b 5 a c 1 c c 10

Conditional Random Fields CRFs: An MRF over X (input) and Y (output) Model conditional probability P(Y X) Uses features

Conditional Random Fields

Conditional Random Fields Does not need to model input interactions Less parameters easier to learn

3906 citations on Google Scholar

We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models. 3906 citations on Google Scholar

Conditional Random Fields for Sequences Y 1 Y 2 Y 3 Y 4 x 1 x 2 x 3 x 4 A Markov Random Field With features Trained and used for conditional probabilities Useful for modeling sequences

Learning MRF parameters Easy when exact inference is possible Given some data D={X i } n, we want to learn MRF parameters Maximum likelihood principle Pick the parameters that maximize the (log) probability of the data: argmax Θ L(D,Θ)

MLE for MRF parameters Pick the parameters that maximize the (log) probability of the data: p( D )) p ( Xi) L( D, ) p (X ) i 1 Z log( p(x )) log( Z) log p( D ) i log i exp A j A f A j exp j f j log( j ( x f A j ( x X A f A j i A p ) (X )) ) log( Z) f j j ( x A i ) i log( p (X )) i

MLE for MRF parameters Compute the gradient of the log-likelihood Perform gradient-based optimization method (e.g. stochastic gradient decent) to find the parameters that maximize the loglikelihood

Learning CRF parameters Use conditional Log-likelihood instead Given some data D={X i,y i } n, we want to learn CRF parameters Conditional log-likelihood: log( p(y i Xi, )) j f j({ x,y} A) log( ZXi Z is now input-dependent: A j ) log( Z log exp x ) f j j({ x,y} A) i Y A f A j

Learning CRF parameters Use conditional Log-likelihood instead Given some data D={X i,y i } n, we want to learn CRF parameters Conditional log-likelihood: log( p(y i Xi, )) j f j({ x,y} A) log( ZXi Z is now input-dependent: log( log( Z p(x )) i log A x ) ({,y} ) i j f j x A Y A j A j j Compare to MRF: j f j ( x A ) log( Z) )

Learning CRF parameters Use conditional Log-likelihood instead Given some data D={X i,y i } n, we want to learn CRF parameters Conditional log-likelihood: Compare to MRF: log( p(y i X i, )) j f j({,y} A) log( ZXi log( Z) log exp A j f j x A) X A f ja Z is now input-dependent: log( Z log exp x ) f j j({ x,y} A) i Y A f A j )

Training a CRF Use a gradient-based method to optimize the conditional log-likelihood of the data

Training a CRF Use a gradient-based method to optimize the conditional log-likelihood of the data The derivative of a single parameter θ k and training example {x i,y i } i.e., feature counts expected feature counts Easy to compute when exact inference can be performed Need to count features in the data and compute marginals A i A i i A i i k A A i i k k i i x y x p y x f y x f θ y x L ) }, ({ ) }, ({ ) }, ({ ), (

CRFs for Classification CRFs are models for predicting probabilities: p(y X) or p(y k X) In ML we care about decisions (predictions) Is this spam? What is the POS tag of this word? How do we turn probabilities into decisions? For a sequence model: Sequence of states with max probability Max Product Algorithm (Viterbi decoding)

CRFs for Classification CRFs are models for predicting probabilities: p(y X) or p(y k X) In ML we care about decisions (predictions) Is this spam? What is the POS tag of this word? How do we turn probabilities into decisions? For a sequence model: Sequence of states with max probability Max Product Algorithm (Viterbi decoding)

Decision Theory Concerned with converting probabilities into decisions Use the generalized Bayes rule, or minimum Bayes risk (MBR) rule Select the output y that minimizes the risk or the expected loss: risk( y' ) E l ( y', y) p( y x) l( y', y' ') y'' MBR decoding depends on the loss function Examples: Accuracy output the label with the max marginal probability Can be intractable F-measure

Case Study: Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables [Nakagawa, Inui, Kurohashi, HLT-NAACL 2010]

Sentiment Classification Classify documents / sentences / phrases / words as subjective/objective or positive/negative: The picture quality is unparalleled Positive Watching this movie is like watching paint dry Negative The design is really kewl but this thing is expensive?

Sentiment classification Bag-of-words approaches: good kewl cool small positive bad dull outrageous big negative Cannot handle compositionality: great killed terrorist

Sentiment classification Bag-of-words approaches: good kewl cool small positive bad dull outrageous big negative Cannot handle compositionality: great at taking your money killed terrorist

Sentiment classification Bag-of-words approaches: good kewl cool small positive bad dull outrageous big negative Cannot handle compositionality: not great at taking your money killed terrorist targets

Sentiment Composionality Annotations only at the sentence (or phrase level) Need to model hidden structure Model needs to be compositional This paper: Utilize CRF to model dependencies between words Model hidden structure Use dependency parse to indicate sentential structure

Dependency Subtrees and Composionality

CRF Based on the Dependency Tree S 0 S 1 S 2 S 3 S 4 <root> It prevents cancer and heart disease.

CRF Based on the Dependency Tree S 0 S 1 S 2 S 3 S 4 <root> It prevents cancer and heart disease.

CRF Based on the Dependency Tree

Model Specifics Inference: Belief propagation Decision: Argmax over the root marginal Learning: MAP estimation of the parameters

Parameter Estimation Gradient computation: p l w l h l s l label of the entire sentence sequence of labels sequence of heads sequence of polarities

Features s i q i r i m i polarity prior polarity polarity reversal number of words i ranges over all phrases u i,k surface form b i,k c i,k f i,k base form coarse POS tag finepos tag k ranges over all words in phrase i

Results

Summary Probabilistic graphical models: Models of joint probability CRF: A discriminative model for conditional probabilities Training through MLE Decisions using CRFs