Log-Linear Models with Structured Outputs

Size: px

Start display at page:

Download "Log-Linear Models with Structured Outputs"

Duane Jones
5 years ago
Views:

1 Log-Linear Models with Structured Outputs Natural Language Processing CS 4120/6120 Spring 2016 Northeastern University David Smith (some slides from Andrew McCallum)

2 Overview Sequence labeling task (cf. POS tagging) Independent classifiers HMMs (Conditional) Maximum Entropy Markov Models Conditional Random Fields Beyond Sequence Labeling

3 Sequence Labeling Inputs: x = (x 1,, x n ) Labels: y = (y 1,, y n ) Typical goal: Given x, predict y Example sequence labeling tasks Part-of-speech tagging Named-entity-recognition (NER) Label people, places, organizations

4 NER Example:

5 First Solution: Maximum Entropy Classifier Conditional model p(y x). Do not waste effort modeling p(x), since x is given at test time anyway. Allows more complicated input features, since we do not need to model dependencies between them. Feature functions f(x,y): f 1 (x,y) = { word is Boston & y=location } f 2 (x,y) = { first letter capitalized & y=name } f 3 (x,y) = { x is an HTML link & y=location}

6 First Solution: MaxEnt Classifier How should we choose a classifier? Principle of maximum entropy We want a classifier that: Matches feature constraints from training data. Predictions maximize entropy. There is a unique, exponential family distribution that meets these criteria.

7 First Solution: MaxEnt Classifier Problem with using a maximum entropy classifier for sequence labeling: It makes decisions at each position independently!

8 Second Solution: HMM # t P(y,x) = P(y t y t"1 )P(x y t ) Defines a generative process. Can be viewed as a weighted finite state machine.

9 Second Solution: HMM How can represent we multiple features in an HMM? Treat them as conditionally independent given the class label? The example features we talked about are not independent. Try to model a more complex generative process of the input features? We may lose tractability (i.e. lose a dynamic programming for exact inference).

10 Second Solution: HMM Let s use a conditional model instead.

11 Third Solution: MEMM Use a series of maximum entropy classifiers that know the previous label. Define a Viterbi algorithm for inference. # t P(y x) = P yt"1 (y t x)

12 Third Solution: MEMM Combines the advantages of maximum entropy and HMM! But there is a problem

13 Problem with MEMMs: Label Bias In some state space configurations, MEMMs essentially completely ignore the inputs. r:_ Example 0 r:_ (ON BOARD). 1 4 This is not a problem for HMMs, because the input sequence is generated by the model. i:_ o:_ 2 5 b:rib b:rob 3

14 Fourth Solution: Conditional Random Field Conditionally-trained, undirected graphical model. For a standard linear-chain structure: P(y x) = $ t " k (y t, y t#1,x) ' " (y, y,x) = exp % k t t#1 ) k ( & k * f (y, y,x) t t#1, +

15 Fourth Solution: CRF Have the advantages of MEMMs, but avoid the label bias problem. CRFs are globally normalized, whereas MEMMs are locally normalized. Widely used and applied. CRFs give state-the-art results in many domains.

16 Fourth Solution: CRF Have the advantages of MEMMs, but avoid the label bias problem. CRFs are globally normalized, whereas MEMMs are locally normalized. Widely used and applied. CRFs give state-the-art results in many domains. Remember, Z is the normalization constant. How do we compute it?

17 CRF Applications Part-of-speech tagging Named entity recognition Gene prediction Chinese word segmentation Morphological disambiguation Citation parsing Etc., etc. Document layout (e.g. table) classification

18 NER as Sequence Tagging The Phoenicians came from the Red Sea 17

19 NER as Sequence Tagging O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

20 NER as Sequence Tagging Capitalized word O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

21 NER as Sequence Tagging Capitalized word Ends in s O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

22 NER as Sequence Tagging Capitalized word Ends in s Ends in ans O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

23 NER as Sequence Tagging Capitalized word Ends in s Ends in ans Previous word the O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

24 NER as Sequence Tagging Capitalized word Ends in s Ends in ans Previous word the Phoenicians in gazetteer O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

25 NER as Sequence Tagging O B-E O O O B-L I-L The Phoenicians came from the Red Sea 18

26 NER as Sequence Tagging Not capitalized O B-E O O O B-L I-L The Phoenicians came from the Red Sea 18

27 NER as Sequence Tagging Not capitalized Tagged as VB O B-E O O O B-L I-L The Phoenicians came from the Red Sea 18

28 NER as Sequence Tagging B-E to right Not capitalized Tagged as VB O B-E O O O B-L I-L The Phoenicians came from the Red Sea 18

29 NER as Sequence Tagging O B-E O O O B-L I-L The Phoenicians came from the Red Sea 19

30 NER as Sequence Tagging Word sea O B-E O O O B-L I-L The Phoenicians came from the Red Sea 19

31 NER as Sequence Tagging Word sea preceded by the ADJ Word sea O B-E O O O B-L I-L The Phoenicians came from the Red Sea 19

32 NER as Sequence Tagging Word sea preceded by the ADJ Hard constraint: I-L must follow B-L or I-L Word sea O B-E O O O B-L I-L The Phoenicians came from the Red Sea 19

33 Overview What computations do we need? Smoothing log-linear models MEMMs vs. CRFs again Action-based parsing and dependency parsing

34 Recipe for Conditional Training of p(y x) 1.Gather constraints/features from training data er constraints from training data: 2.Initialize ize all all parameters to to zero. 3.Classify training data with current parameters; calculate expectations 4.Gradient is in the direction o 5.Take a step in the direction of the gradient 6.Repeat from 3 until convergence

35 Recipe for Conditional Training of p(y x) 1.Gather constraints/features from training data er constraints from training data: 2.Initialize ize all all parameters to to zero. 3.Classify training data with current parameters; calculate expectations 4.Gradient is in the direction o 5.Take a step in the direction of the gradient 6.Repeat from 3 until convergence Where have we seen expected counts before?

36 Recipe for Conditional Training of p(y x) 1.Gather constraints/features from training data er constraints from training data: 2.Initialize ize all all parameters to to zero. 3.Classify training data with current parameters; calculate expectations 4.Gradient is in the direction o 5.Take a step in the direction of the gradient 6.Repeat from 3 until convergence EM! Where have we seen expected counts before?

37 Gradient-Based Training λ := λ + rate * Gradient(F) After all training examples? (batch) After every example? (on-line) Use second derivative for faster learning? A big field: numerical optimization

38 Parsing as Structured Prediction

39 Shift-reduce parsing Stack Input remaining Action () Book that flight shift (Book) that flight reduce, Verb book, (Choice #1 of 2) (Verb) that flight shift (Verb that) flight reduce, Det that (Verb Det) flight shift (Verb Det flight) reduce, Noun flight (Verb Det Noun) reduce, NOM Noun (Verb Det NOM) reduce, NP Det NOM (Verb NP) reduce, VP Verb NP (Verb) reduce, S V (S) SUCCESS! Ambiguity may lead to the need for backtracking.

40 Shift-reduce parsing Stack Input remaining Action () Book that flight shift (Book) that flight reduce, Verb book, (Choice #1 of 2) (Verb) that flight shift (Verb that) flight reduce, Det that (Verb Det) flight shift (Verb Det flight) reduce, Noun flight (Verb Det Noun) reduce, NOM Noun (Verb Det NOM) reduce, NP Det NOM (Verb NP) reduce, VP Verb NP (Verb) reduce, S V (S) SUCCESS! Ambiguity may lead to the need for backtracking.

41 Shift-reduce parsing Stack Input remaining Action () Book that flight shift (Book) that flight reduce, Verb book, (Choice #1 of 2) (Verb) that flight shift (Verb that) flight reduce, Det that (Verb Det) flight shift (Verb Det flight) reduce, Noun flight (Verb Det Noun) reduce, NOM Noun (Verb Det NOM) reduce, NP Det NOM (Verb NP) reduce, VP Verb NP (Verb) reduce, S V (S) SUCCESS! Ambiguity may lead to the need for backtracking. Train log-linear model of p(action context)

42 Shift-reduce parsing Stack Input remaining Action () Book that flight shift (Book) that flight reduce, Verb book, (Choice #1 of 2) (Verb) that flight shift (Verb that) flight reduce, Det that (Verb Det) flight shift (Verb Det flight) reduce, Noun flight (Verb Det Noun) reduce, NOM Noun (Verb Det NOM) reduce, NP Det NOM (Verb NP) reduce, VP Verb NP (Verb) reduce, S V (S) SUCCESS! Ambiguity may lead to the need for backtracking. Compare to an MEMM Train log-linear model of p(action context)

43 Structured Log-Linear Models 25

44 Structured Log-Linear Models Linear model for scoring structures score(out, in) = features(out, in) 25

45 Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

46 Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing Viz. logistic regression, Markov random fields, undirected graphical models score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

47 Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing Viz. logistic regression, Markov random fields, undirected graphical models Usually the bottleneck in NLP score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

48 Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing Viz. logistic regression, Markov random fields, undirected graphical models Inference: sampling, variational methods, dynamic programming, local search,... Usually the bottleneck in NLP score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

49 Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing Viz. logistic regression, Markov random fields, undirected graphical models Inference: sampling, variational methods, dynamic programming, local search,... Training: maximum likelihood, minimum risk, etc. Usually the bottleneck in NLP score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

50 Structured Log-Linear Models With latent variables Several layers of linguistic structure Unknown correspondences Naturally handled by probabilistic framework Several inference setups, for example: p(out 1 in) = out 2,alignment p(out 1, out 2, alignment in) 26

51 Structured Log-Linear Models With latent variables Several layers of linguistic structure Unknown correspondences Naturally handled by probabilistic framework Several inference setups, for example: p(out 1 in) = out 2,alignment p(out 1, out 2, alignment in) Another computational problem 26

52 Edge-Factored Parsers No global features of a parse (McDonald et al. 2005) Each feature is attached to some edge MST or CKY-like DP for fast O(n 2 ) or O(n 3 ) parsing Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 27

53 Edge-Factored Parsers Is this a good edge? Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 28

54 Edge-Factored Parsers Is this a good edge? yes, lots of positive features... Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 28

55 Edge-Factored Parsers Is this a good edge? Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 29

56 Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 29

57 Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 30

58 Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) jasný! N ( bright NOUN ) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 30

59 Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) jasný! N ( bright NOUN ) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 31

60 Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) jasný! N ( bright NOUN ) A! N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 31

61 Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) jasný! N ( bright NOUN ) A! N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 32

62 Edge-Factored Parsers Is this a good edge? jasný! den A!( bright N day ) preceding conjunction jasný! N ( bright NOUN ) A! N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 32

63 Edge-Factored Parsers How about this competing edge? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 33

64 Edge-Factored Parsers How about this competing edge? not as good, lots of red... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 33

65 Edge-Factored Parsers How about this competing edge? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 34

66 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks ) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 34

67 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 34

68 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 35

69 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... jasn! hodi ( bright clock, stems only) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 35

70 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... jasn! hodi ( bright clock, stems only) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 36

71 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... jasn! hodi ( bright clock, stems only) A singular! N plural Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 36

72 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... jasn! hodi ( bright clock, stems only) A singular! N plural Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 37

73 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright A! clocks ) N... where undertrained N follows... a conjunction jasn! hodi ( bright clock, stems only) A singular! N plural Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 37

74 Edge-Factored Parsers Which edge is better? bright day or bright clocks? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 38

75 Edge-Factored Parsers Which edge is better? Score of an edge e = θ features(e) Standard algos " valid parse with max total score Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 39

76 Edge-Factored Parsers Which edge is better? our current weight vector Score of an edge e = θ features(e) Standard algos " valid parse with max total score Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 39

77 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging find preferred tags 40

78 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging find preferred tags Observed input sentence (shaded) 40

79 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) v v v find preferred tags Observed input sentence (shaded) 40

80 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging find preferred tags Observed input sentence (shaded) 41

81 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging v a n find preferred tags Observed input sentence (shaded) 41

82 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Binary factor that measures compatibility of 2 adjacent tags v n a v n a find preferred tags 42

83 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Binary factor that measures compatibility of 2 adjacent tags v n a v n a v n a v n a Model reuses same parameters at this position find preferred tags 42

84 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging v 0.2 n 0.2 a 0 find preferred tags 43

85 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag v 0.2 n 0.2 a 0 find preferred tags 43

86 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Its values depend on corresponding word v 0.2 n 0.2 a 0 find preferred tags 43

87 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Its values depend on corresponding word v 0.2 n 0.2 a 0 find preferred tags can t be adj 43

88 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Its values depend on corresponding word v 0.2 n 0.2 a 0 find preferred tags 44

89 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Its values depend on corresponding word v 0.2 n 0.2 a 0 find preferred tags (could be made to depend on entire observed sentence) 44

90 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 45

91 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Different unary factor at each position v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 45

92 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging v n a v n a v n a v n a v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 46

93 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging p(v a n) is proportional to the product of all factors values on v a n v n a v n a v n a v n a v a n v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 46

94 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging p(v a n) is proportional to the product of all factors values on v a n v n a v n a v n a v n a v a n v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 47

95 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging p(v a n) is proportional to the product of all factors values on v a n v n a v n a v n a v n a v a n = 1*3*0.3*0.1*0.2 v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 47

96 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links find preferred links 48

97 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links find preferred links 48

98 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links find preferred links 48

99 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links Possible parse... find preferred links 49

100 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links Possible f t... and variable parse... assignment f f t f find preferred links 49

101 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links Another parse... find preferred links 50

102 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links Another f f... and variable parse... assignment f t f t find preferred links 50

103 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links An illegal t f... with a parse... cycle! f t f t find preferred links 51

104 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! An illegal parse... O(n 2 ) boolean variables for the possible links f t... with a cycle and multiple f t t t parents! find preferred links 52

105 Local Factors for Parsing What factors determine parse probability? find preferred links 53

106 Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation t 2 f 1 find preferred links 53

107 Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation t 2 f 1 t 1 f 8 t 1 f 3 t 1 f 2 find t 1 preferred t 1links f 6 f 2 53

108 Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation But what if the best assignment isn t a tree? t 2 f 1 t 1 f 8 t 1 f 3 t 1 f 2 find t 1 preferred t 1links f 6 f 2 53

109 Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation find preferred links 54

110 Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 find preferred links 54

111 Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 find preferred links 54

112 Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 t f f f f t ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 we re legal! find preferred links 55

113 Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation optionally require the tree to be projective (no crossing links) Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 So far, this is equivalent to edge-factored parsing f f f t t f 64 entries (0/1) ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 we re legal! find preferred links 55

114 Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation optionally require the tree to be projective (no crossing links) Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 So far, this is equivalent to edge-factored parsing f f f t t f 64 entries (0/1) ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 we re legal! find preferred links Note: traditional parsers don t loop through this table to consider exponentially many trees one at a time. They use combinatorial algorithms; so should we! 55

115 Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 Second order effects: factors on 2 variables Grandparent parent child chains find preferred links 56

116 Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 Second order effects: factors on 2 variables Grandparent parent child chains f t f 1 1 t 1 3 find preferred links 56

117 Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 Second order effects: factors on 2 variables Grandparent parent child chains No crossing links Siblings Hidden morphological tags Word senses and subcategorization frames find preferred links 57

118 Great Ideas in ML: Message Passing adapted from MacKay (2003) textbook 58

119 Great Ideas in ML: Message Passing Count the soldiers adapted from MacKay (2003) textbook 58

120 Great Ideas in ML: Message Passing Count the soldiers 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

121 Great Ideas in ML: Message Passing Count the soldiers there s 1 of me 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

122 Great Ideas in ML: Message Passing Count the soldiers there s 1 of me 5 behind you 4 behind you 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

123 Great Ideas in ML: Message Passing Count the soldiers 1 before you 2 before you 5 behind you 4 behind you 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

124 Great Ideas in ML: Message Passing Count the soldiers 1 before you 2 before you 3 before you 4 before you 5 before you 5 behind you 4 behind you 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

125 Great Ideas in ML: Message Passing Count the soldiers 2 before you there s 1 of me only see my incoming messages 3 behind you adapted from MacKay (2003) textbook 59

126 Great Ideas in ML: Message Passing Count the soldiers 2 before you there s 1 of me Belief: Must be = 6 of us only see my incoming messages 3 behind you adapted from MacKay (2003) textbook 59

127 Great Ideas in ML: Message Passing Count the soldiers 2 before you there s 1 of me Belief: Must be = 6 of us only see my incoming messages 3 behind you adapted from MacKay (2003) textbook 59

128 Great Ideas in ML: Message Passing Count the soldiers 1 before you there s 1 of me Belief: Must be = 6 of us only see my incoming messages 4 behind you adapted from MacKay (2003) textbook 60

129 Great Ideas in ML: Message Passing Count the soldiers 1 before you there s 1 of me Belief: Must be = 6 of us only see my incoming messages 4 behind you adapted from MacKay (2003) textbook 60

130 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook 61

131 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here adapted from MacKay (2003) textbook 61

132 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 1 of me 11 here (= 7+3+1) adapted from MacKay (2003) textbook 61

133 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook 62

134 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 3 here adapted from MacKay (2003) textbook 62

135 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here (= 3+3+1) 3 here adapted from MacKay (2003) textbook 62

136 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook 63

137 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 7 here 3 here adapted from MacKay (2003) textbook 63

138 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 11 here (= 7+3+1) 7 here 3 here adapted from MacKay (2003) textbook 63

139 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook 64

140 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree Belief: Must be 14 of us adapted from MacKay (2003) textbook 64

141 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here Belief: Must be 14 of us adapted from MacKay (2003) textbook 64

142 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here Belief: Must be 14 of us adapted from MacKay (2003) textbook 65

143 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here Belief: Must be 14 of us wouldn t work correctly with a loopy (cyclic) graph adapted from MacKay (2003) textbook 65

144 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm v n a v n a v n a v n a find preferred tags 66

145 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm belief v 1. n 8 0 a 4. 2 v n a v n a v n a v n a find preferred tags 66

146 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm message α v n a v 3 v n 1 n a 6 a belief v 1. 8 n 0 a 4. 2 β message v 2 v n a nv a n a find preferred tags 66

147 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm message α v n a v 3 v n 1 n a 6 a belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β message v 2 v n a nv a n a find preferred tags 66

148 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v n a v 3 v n 1 n a 6 a belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β message v 2 v n a nv a n a find preferred tags 66

149 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v 3 n 1 a 6 belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β message v 2 v n a nv a n a find preferred tags 66

150 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α message α belief v 1. 8 n 0 a 4. 2 β message v 7 n 2 a 1 v 3 n 1 a 6 v 2 n 1 a 7 v 0.3 n 0 a 0.1 find preferred tags 66

151 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v 3 n 1 a 6 v n a v n a belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β v 2 n 1 a 7 message find preferred tags 66

152 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v n a v 3 v n 1 n a 6 a belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β v 2 n 1 a 7 message β v 3 n 6 a 1 find preferred tags 66

153 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v n a v 3 v n 1 n a 6 a belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β message v 2 v n a nv a n a β v 3 n 6 a 1 find preferred tags 66

154 Sum-Product Equations # Message from variable v to factor f m v!f (x) = Y f 0 2N(v)\{f} m f 0!v(x) # Message from factor f to variable v X m f!v (x) = 2 4f(x m ) Y N(f)\{v} v 0 2N(f)\{v} 3 m v0!f (y) 5 67

155 Great ideas in ML: Forward-Backward α β v 3 n 1 a 6 v 2 n 1 a 7 v 0.3 n 0 a 0.1 find preferred tags 68

156 Great ideas in ML: Forward-Backward α β v 3 n 1 a 6 v 2 n 1 a 7 v 0.3 n 0 a 0.1 find preferred tags 68

157 Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor α β v 3 n 1 a 6 v 2 n 1 a 7 v 0.3 n 0 a 0.1 find preferred tags 68

158 Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor $ More influences on belief v 5.4 n 0 a 25.2 α β v 3 n 1 a 6 v 2 n 1 a 7 v 3 n 1 a 6 v 0.3 n 0 a 0.1 find preferred tags 68

159 Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor $ More influences on belief $ Graph becomes loopy & α v 5.4` n 0 a 25.2` β v 3 n 1 a 6 v 2 n 1 a 7 v 3 n 1 a 6 v 0.3 n 0 a 0.1 find preferred tags 69

160 Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor $ More influences on belief $ Graph becomes loopy & α Red messages not independent? Pretend they are! v 5.4` n 0 a 25.2` β v 3 n 1 a 6 v 2 n 1 a 7 v 3 n 1 a 6 v 0.3 n 0 a 0.1 find preferred tags 69

161 Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor $ More influences on belief $ Graph becomes loopy & α Red messages not independent? Pretend they are! Loopy Belief Propagation v 5.4` n 0 a 25.2` β v 3 n 1 a 6 v 2 n 1 a 7 v 3 n 1 a 6 v 0.3 n 0 a 0.1 find preferred tags 69

162 Terminological Clarification propagation 70

163 Terminological Clarification belief propagation 70

164 Terminological Clarification loopy belief propagation 70

165 Terminological Clarification loopy belief propagation 70

166 Terminological Clarification loopy belief propagation 70

167 Terminological Clarification loopy belief propagation loopy belief propagation 70

168 Terminological Clarification loopy belief propagation loopy belief propagation 70

169 Propagating Global Factors Loopy belief propagation is easy for local factors How do combinatorial factors (like TREE) compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links?? find preferred links 71

170 Propagating Global Factors Loopy belief propagation is easy for local factors How do combinatorial factors (like TREE) compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links?? find preferred links 71

171 Propagating Global Factors Loopy belief propagation is easy for local factors How do combinatorial factors (like TREE) compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links? find preferred links? TREE factor ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 71

172 Propagating Global Factors How does the TREE factor compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links? find preferred links? TREE factor ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 72

173 Propagating Global Factors How does the TREE factor compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links? Old-school parsing to the rescue! find preferred links? TREE factor ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 This is the outside probability of the link in an edge-factored parser! TREE factor computes all outgoing messages at once (given all incoming messages) Projective case: total O(n 3 ) time by inside-outside Non-projective: total O(n 3 ) time by inverting Kirchhoff matrix 72

174 Graph Theory to the Rescue! Tutte s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. 73

175 Graph Theory to the Rescue! Tutte s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! 73

176 Graph Theory to the Rescue! O(n 3 ) time! Tutte s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! 73

177 Kirchoff (Laplacian) Matrix # 0 s(1,0) s(2,0)! s(n,0) & % ( 0 0 s(2,1)! s(n,1) % ( % 0 s(1,2) 0! s(n,2) ( % ( " " " # " % ( $ % 0 s(1,n) s(2,n)! 0 '( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant 74

178 Kirchoff (Laplacian) Matrix # 0 s(1,0) s(2,0)! s(n,0) & % ( 0 0 s(2,1)! s(n,1) % ( % 0 s(1,2) 0! s(n,2) ( % ( " " " # " % ( $ % 0 s(1,n) s(2,n)! 0 '( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant 74

179 Kirchoff (Laplacian) Matrix % 0 s(1,0) s(2,0)! s(n,0) # 0 s(1,0) s(2,0)! s(n,0) & ' * % ( ' 0 s(1, j) s(2,1)! s(n,1) 0 s(2,1) s(n,1) * % ( j 1 ' * '% 0 s(1,2) s(2, 0 j)! s(n,2) (* '% ( j 2 * '% " " " # " (* ' $ % 0 s(1,n) s(2,n)! 0 '( * j n 0 s(1,n) s(2,n)! s(n, j) & ' ) * ( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant 74

180 Kirchoff (Laplacian) Matrix % 0 s(1,0) s(2,0)! s(n,0) # 0 s(1, s(1,0) j) s(2,1) s(2,0)! s(n,0) s(n,1) & ' * % ( ' 0 j 1 s(1, j) s(2,1)! s(n,1) 0 s(2,1) s(n,1) * % ( j 1 ' s(1,2) s(2, j)! s(n,2) * '% 0 s(1,2) s(2, 0 j) s(n,2) (* % j 2 ' ( j 2 * '% " " " " " # " (* ' $ % s(1,n) 0 s(1,n) s(2,n) s(2,n)! s(n, 0 j) '( * j n 0 s(1,n) s(2,n)! s(n, j) & ' ) * ( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant 74

181 Kirchoff (Laplacian) Matrix % 0 s(1,0) s(2,0)! s(n,0) # 0 s(1, s(1,0) j) s(2,1) s(2,0)! s(n,0) s(n,1) & ' * % ( ' 0 j 1 s(1, j) s(2,1)! s(n,1) 0 s(2,1) s(n,1) * % ( j 1 ' s(1,2) s(2, j)! s(n,2) * '% 0 s(1,2) s(2, 0 j) s(n,2) (* % j 2 ' ( j 2 * '% " " " " " # " (* ' $ % s(1,n) 0 s(1,n) s(2,n) s(2,n)! s(n, 0 j) '( * j n 0 s(1,n) s(2,n)! s(n, j) & ' ) * ( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant N.B.: This allows multiple children of root, but see Koo et al

182 Transition-Based Parsing Linear time Online Train a classifier to predict next action Deterministic or beam-search strategies But... generally less accurate

183 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Start state: ([ ], [1,...,n], {}) Final state: (S, [], A) Shift: (S, i B, A) ) (S i, B, A) Reduce: (S i, B, A) ) (S, B, A) Right-Arc: (S i, j B, A) ) (S i j, B, A [ {i! j}) Left-Arc: (S i, j B, A) ) (S, j B, A [ {i j})

184 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [] S [who, did, you, see] B {}

185 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [who] S [did, you, see] B {} Shift

186 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [] S [did, you, see] B { who OBJ did } Left-arc OBJ

187 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [did] S [you, see] B { who OBJ did } Shift

188 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs [did, you] S [see] B { who OBJ did, did SBJ! you }

189 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Reduce Stack Buffer Arcs [did] S [see] B { who OBJ did, did SBJ! you }

190 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc VG Stack Buffer Arcs [did, see] S [] B { who OBJ did, did SBJ! you, did VG! see }

191 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs [did, you] S [see] B { who OBJ did, did SBJ! you }

192 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs [did, you] S [see] B { who OBJ did, did SBJ! you } Choose action w/best classifier score 100k - 1M features

193 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs Very fast linear-time performance WSJ 23 (2k sentences) in 3 s [did, you] S [see] B { who OBJ did, did SBJ! you } Choose action w/best classifier score 100k - 1M features

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017 Computational Linguistics CSC 2501 / 485 Fall 2017 13A 13A. Log-Likelihood Dependency Parsing Gerald Penn Department of Computer Science, University of Toronto Based on slides by Yuji Matsumoto, Dragomir