Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Size: px

Start display at page:

Download "Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov"

Donna Ferguson
6 years ago
Views:

1 Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

2 Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly used as part of a speech recognition system can be described by a graph this includes Gaussian distributions, mixture models, decision trees, factor analysis, principle component analysis, linear discriminant analysis, and hidden Markov models... [including] many advanced models at the acoustic-, pronunciation-, and languagemodeling levels. Bilmes(2004)

3 ACL 2010 Papers 1.Conditional Random Fields for Word Hyphenation 2.Practical Very Large Scale CRFs 3.Jointly Optimizing a Two-Step Conditional Random Field Model for Machine Transliteration and Its Fast Decoding Algorithm 4.Cross-Lingual Latent Topic Extraction 5.Topic Models for Word Sense Disambiguation and Token-Based Idiom Detection 6.PCFGs, Topic Models, Adaptor Grammars and Learning Topical Collocations and the Structure of Proper Names 7.A Latent Dirichlet Allocation Method for Selectional Preferences 8.Latent Variable Models of Selectional Preference 9.A Bayesian Method for Robust Estimation of Distributional Similarities 10.Bayesian Synchronous Tree-Substitution Grammar Induction and Its Application to Sentence Compression 11.Blocked Inference in Bayesian Tree Substitution Grammars 12.Decision Detection Using Hierarchical Graphical Models

4 Today s Lecture A very brief overview of Probabilistic Graphical Models (PGMs) Undirected models: MRFs and CRFs Learning and decoding in MRFs and CRFs Case study: Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables For concise introduction to PGMs and their uses in NLP:

5 Markov Random Fields Describe joint probability, P(A,B,C,D) using an undirected graph Edges describe interacting RVs Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

6 Markov Random Fields Describe joint probability, P(A,B,C,D) using an undirected graph Edges describe interacting RVs Random variable: a variable whose value results from a measurement on some type of random process. I.e., A = {Alice s grade for homework 2} Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

7 Conditional dependencies. For instance, Alice and Bob worked on the homework together, while Alice and Charlie Markov Random Fields Describe joint probability, P(A,B,C,D) using did not. an undirected graph Edges describe interacting RVs Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

8 Markov Random Fields Describe joint probability, P(A,B,C,D) using an undirected graph Edges describe interacting RVs For each configuration of RVs X 1, X 2,, X n ϕ(x 1, X 2,, X n ) gives an affinity score. Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

9 Markov Random Fields Describe joint probability, P(A,B,C,D) using an undirected graph Edges describe interacting RVs ϕ 1 (A,B) a a 30 a b 5 a c 1 c c 10 Include a term (factor, ϕ) in P(A,B,C,D) for all sets of RVs that interact

10 Joint Probability via Local Factors ϕ 1 (A,B) a 0 b 0 30 a 0 b 1 5 a 1 b 0 1 a 1 b 1 10 ϕ 2 (B,C) b 0 c b 0 c 1 1 b 1 c 0 1 b 1 c P(a,b,c,d) = ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z = abcd ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Factors map subsets of RVs to non-negative real numbers High values for P(a,b,c,d) mean the local factors have high scores Inference: we can use P(A,B,C,D) to answer queries (conditional probability, marginal probability, most probable (MAP) assignment)

11 Joint Probability via Local Factors P(a,b,c,d) invariant to scaling of factors ϕ 1 (A,B) a 0 b 0 30 a 0 b 1 5 a 1 b 0 1 a 1 b 1 10 ϕ 2 (B,C) b 0 c b 0 c 1 1 b 1 c 0 1 b 1 c P(a,b,c,d) = ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z = abcd ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Factors map subsets of RVs to non-negative real numbers High values for P(a,b,c,d) mean the local factors have high scores Inference: we can use P(A,B,C,D) to answer queries (conditional probability, marginal probability, most probable (MAP) assignment)

12 Joint Probability via Local Factors P(a,b,c,d) invariant to scaling of factors Factors don t have to be tables could be functions (that could be used to produce tables). Often depend on features. ϕ 1 (A,B) a 0 b 0 30 a 0 b 1 5 a 1 b 0 1 a 1 b 1 10 ϕ 2 (B,C) b 0 c b 0 c 1 1 b 1 c 0 1 b 1 c P(a,b,c,d) = ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z = abcd ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Factors map subsets of RVs to non-negative real numbers High values for P(a,b,c,d) mean the local factors have high scores Inference: we can use P(A,B,C,D) to answer queries (conditional probability, marginal probability, most probable (MAP) assignment)

13 Joint Probability via Local Factors P(a,b,c,d) invariant to scaling of factors Factors don t have to be tables could be functions (that could be used to produce tables). Often depend on features. ϕ 1 (A,B) a 0 b 0 30 a 0 b 1 5 a 1 b 0 1 a 1 b 1 10 ϕ 2 (B,C) b 0 c b 0 c 1 1 b 1 c 0 1 b 1 c P(a,b,c,d) = ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z = abcd ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) Z Factors is just a map constant, subsets so of is RVs often to non-negative left out of the real discussion. numbers We don t High need values don t for P(a,b,c,d) even need mean to the calculate local factors it if we have just high want scores to decode Inference: : find we most can use likely Y X P(A,B,C,D) to answer queries (conditional probability, marginal probability, most probable (MAP) assignment)

14 Factorization implies Independencies P(a,b,c,d) ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) [ϕ 1 (a,b) ϕ 2 (b,c) ] [ ϕ 3 (c,d) ϕ 4 (d,a) ] F(b, a,c) G(d, a,c) Intuition: If (a,c) given, knowing b tells us nothing of d Theorem: X Y W iff. P(X 1 X N ) = F(X, W) G(Y, W)

15 Factorization implies Independencies P(a,b,c,d) ϕ 1 (a,b) ϕ 2 (b,c) ϕ 3 (c,d) ϕ 4 (d,a) [ϕ 1 (a,b) ϕ 2 (b,c) ] [ ϕ 3 (c,d) ϕ 4 (d,a) ] F(b, a,c) G(d, a,c) Intuition: This If (a,c) is how given, we knowing the b independencies tells us nothing from of d the Theorem: factors can we tell just by looking at the X Y W iff. P(Xgraph? 1 X N ) = F(X, W) G(Y, W)

16 Formal MN Independencies X Y W (X and Y are indep. given W) if we can NOT go from X X to Y Y without passing W W blocks the interaction of X and Y Unshaded: unobserved (X, Y) Shaded: observed (W)

17 Formal MN Independencies Consequence: Simpler than global X is independent independencies of everyone in BNs X Y Z (X and Y are indep. given Z) else (d-separation, given its neighbors see Chapter (its Markov 3.3 in KF09) blanket) if we can go from X X to Y Y without touching Z To You make show a me MN: a connect graph, I X can and tell Y you if they what should is never independent Z blocks be rendered the from interaction independent what given of X and what. via Y other rvs Unshaded: unobserved (X, Y) Shaded: observed (W)

18 More on Factors: Factor Product Figure 4.3, KF09 ϕ 1 (A,B) ϕ 2 (B,C) = ψ(a,b,c) every entry in ψ(a,b,c) is ϕ 1 (a,b) * ϕ 2 (b,c)

19 Log-Linear Models Factor graphs are more explicit, but still require bulky tables for all the factor values Can use features to capture patterns that we d like reflected in clique potentials: P Φ (X 1,,X N ) = ϕ 1 (D 1 ) ϕ 2 (D 2 ) ϕ K (D K ) Define ϕ i (D i ) = exp(-w i f i (D i )) f i (D i ) tell us something indicative about some rvs, e.g. f i (Y 1 Y 2 )=1 if the vals of Y 1 & Y 2 are DT & NN, else f i (Y 1 Y 2 )=0 So P Φ (X 1,,X N ) = exp(-w 1 f 1 (D 1 )) exp(-w 1 f 1 (D 1 )) = exp(- w i f i (D i ))

20 Log-Linear Models Factor graphs are more explicit, but still require bulky tables for all the factor values Can use features to capture patterns that we d like reflected in clique potentials: We can have many feature functions for each D i that each f i (D i ) tell us something indicative about indicate some different rvs, e.g. f i (Y 1 Y 2 )=1 if the vals of Y 1 & Y 2 are DT & NN, else f i (Y 1 Y 2 )=0 properties of our rvs. P Φ (X 1,,X N ) = ϕ 1 (D 1 ) ϕ 2 (D 2 ) ϕ K (D K ) Define ϕ i (D i ) = exp(-w i f i (D i )) So P Φ (X 1,,X N ) = exp(-w 1 f 1 (D 1 )) exp(-w 1 f 1 (D 1 )) = exp(- w i f i (D i ))

21 Features in Markov Random Fields Conditional tables vs. Features: ϕ 1 (XY) = 1(X=Y) ϕ 2 (XY) = 1 if X and Y one grade apart 0 otherwise ϕ 1 (A,B) a a 30 a b 5 a c 1 c c 10

22 Conditional Random Fields CRFs: An MRF over X (input) and Y (output) Model conditional probability P(Y X) Uses features

23 Conditional Random Fields

24 Conditional Random Fields Does not need to model input interactions Less parameters easier to learn

25 3906 citations on Google Scholar

Conditional random fields offer several advantages over hidden Markov models and

26 We present conditional random fields, a framework for building probabilistic models to segment and label sequence data. Conditional random fields offer several advantages over hidden Markov models and stochastic grammars for such tasks, including the ability to relax strong independence assumptions made in those models citations on Google Scholar

27 Conditional Random Fields for Sequences Y 1 Y 2 Y 3 Y 4 x 1 x 2 x 3 x 4 A Markov Random Field With features Trained and used for conditional probabilities Useful for modeling sequences

28 Learning MRF parameters Easy when exact inference is possible Given some data D={X i } n, we want to learn MRF parameters Maximum likelihood principle Pick the parameters that maximize the (log) probability of the data: argmax Θ L(D,Θ)

29 MLE for MRF parameters Pick the parameters that maximize the (log) probability of the data: p( D )) p ( Xi) L( D, ) p (X ) i 1 Z log( p(x )) log( Z) log p( D ) i log i exp A j A f A j exp j f j log( j ( x f A j ( x X A f A j i A p ) (X )) ) log( Z) f j j ( x A i ) i log( p (X )) i

30 MLE for MRF parameters Compute the gradient of the log-likelihood Perform gradient-based optimization method (e.g. stochastic gradient decent) to find the parameters that maximize the loglikelihood

31 Learning CRF parameters Use conditional Log-likelihood instead Given some data D={X i,y i } n, we want to learn CRF parameters Conditional log-likelihood: log( p(y i Xi, )) j f j({ x,y} A) log( ZXi Z is now input-dependent: A j ) log( Z log exp x ) f j j({ x,y} A) i Y A f A j

32 Learning CRF parameters Use conditional Log-likelihood instead Given some data D={X i,y i } n, we want to learn CRF parameters Conditional log-likelihood: log( p(y i Xi, )) j f j({ x,y} A) log( ZXi Z is now input-dependent: log( log( Z p(x )) i log A x ) ({,y} ) i j f j x A Y A j A j j Compare to MRF: j f j ( x A ) log( Z) )

33 Learning CRF parameters Use conditional Log-likelihood instead Given some data D={X i,y i } n, we want to learn CRF parameters Conditional log-likelihood: Compare to MRF: log( p(y i X i, )) j f j({,y} A) log( ZXi log( Z) log exp A j f j x A) X A f ja Z is now input-dependent: log( Z log exp x ) f j j({ x,y} A) i Y A f A j )

34 Training a CRF Use a gradient-based method to optimize the conditional log-likelihood of the data

35 Training a CRF Use a gradient-based method to optimize the conditional log-likelihood of the data The derivative of a single parameter θ k and training example {x i,y i } i.e., feature counts expected feature counts Easy to compute when exact inference can be performed Need to count features in the data and compute marginals A i A i i A i i k A A i i k k i i x y x p y x f y x f θ y x L ) }, ({ ) }, ({ ) }, ({ ), (

36 CRFs for Classification CRFs are models for predicting probabilities: p(y X) or p(y k X) In ML we care about decisions (predictions) Is this spam? What is the POS tag of this word? How do we turn probabilities into decisions? For a sequence model: Sequence of states with max probability Max Product Algorithm (Viterbi decoding)

37 CRFs for Classification CRFs are models for predicting probabilities: p(y X) or p(y k X) In ML we care about decisions (predictions) Is this spam? What is the POS tag of this word? How do we turn probabilities into decisions? For a sequence model: Sequence of states with max probability Max Product Algorithm (Viterbi decoding)

38 Decision Theory Concerned with converting probabilities into decisions Use the generalized Bayes rule, or minimum Bayes risk (MBR) rule Select the output y that minimizes the risk or the expected loss: risk( y' ) E l ( y', y) p( y x) l( y', y' ') y'' MBR decoding depends on the loss function Examples: Accuracy output the label with the max marginal probability Can be intractable F-measure

39 Case Study: Dependency Tree-based Sentiment Classification using CRFs with Hidden Variables [Nakagawa, Inui, Kurohashi, HLT-NAACL 2010]

40 Sentiment Classification Classify documents / sentences / phrases / words as subjective/objective or positive/negative: The picture quality is unparalleled Positive Watching this movie is like watching paint dry Negative The design is really kewl but this thing is expensive?

41 Sentiment classification Bag-of-words approaches: good kewl cool small positive bad dull outrageous big negative Cannot handle compositionality: great killed terrorist

42 Sentiment classification Bag-of-words approaches: good kewl cool small positive bad dull outrageous big negative Cannot handle compositionality: great at taking your money killed terrorist

43 Sentiment classification Bag-of-words approaches: good kewl cool small positive bad dull outrageous big negative Cannot handle compositionality: not great at taking your money killed terrorist targets

44 Sentiment Composionality Annotations only at the sentence (or phrase level) Need to model hidden structure Model needs to be compositional This paper: Utilize CRF to model dependencies between words Model hidden structure Use dependency parse to indicate sentential structure

45 Dependency Subtrees and Composionality

46 CRF Based on the Dependency Tree S 0 S 1 S 2 S 3 S 4 <root> It prevents cancer and heart disease.

47 CRF Based on the Dependency Tree S 0 S 1 S 2 S 3 S 4 <root> It prevents cancer and heart disease.

48 CRF Based on the Dependency Tree

49 Model Specifics Inference: Belief propagation Decision: Argmax over the root marginal Learning: MAP estimation of the parameters

50 Parameter Estimation Gradient computation: p l w l h l s l label of the entire sentence sequence of labels sequence of heads sequence of polarities

51 Features s i q i r i m i polarity prior polarity polarity reversal number of words i ranges over all phrases u i,k surface form b i,k c i,k f i,k base form coarse POS tag finepos tag k ranges over all words in phrase i

52 Results

53 Summary Probabilistic graphical models: Models of joint probability CRF: A discriminative model for conditional probabilities Training through MLE Decisions using CRFs

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,