Log-Linear Models with Structured Outputs

Size: px
Start display at page:

Download "Log-Linear Models with Structured Outputs"

Transcription

1 Log-Linear Models with Structured Outputs Natural Language Processing CS 4120/6120 Spring 2016 Northeastern University David Smith (some slides from Andrew McCallum)

2 Overview Sequence labeling task (cf. POS tagging) Independent classifiers HMMs (Conditional) Maximum Entropy Markov Models Conditional Random Fields Beyond Sequence Labeling

3 Sequence Labeling Inputs: x = (x 1,, x n ) Labels: y = (y 1,, y n ) Typical goal: Given x, predict y Example sequence labeling tasks Part-of-speech tagging Named-entity-recognition (NER) Label people, places, organizations

4 NER Example:

5 First Solution: Maximum Entropy Classifier Conditional model p(y x). Do not waste effort modeling p(x), since x is given at test time anyway. Allows more complicated input features, since we do not need to model dependencies between them. Feature functions f(x,y): f 1 (x,y) = { word is Boston & y=location } f 2 (x,y) = { first letter capitalized & y=name } f 3 (x,y) = { x is an HTML link & y=location}

6 First Solution: MaxEnt Classifier How should we choose a classifier? Principle of maximum entropy We want a classifier that: Matches feature constraints from training data. Predictions maximize entropy. There is a unique, exponential family distribution that meets these criteria.

7 First Solution: MaxEnt Classifier Problem with using a maximum entropy classifier for sequence labeling: It makes decisions at each position independently!

8 Second Solution: HMM # t P(y,x) = P(y t y t"1 )P(x y t ) Defines a generative process. Can be viewed as a weighted finite state machine.

9 Second Solution: HMM How can represent we multiple features in an HMM? Treat them as conditionally independent given the class label? The example features we talked about are not independent. Try to model a more complex generative process of the input features? We may lose tractability (i.e. lose a dynamic programming for exact inference).

10 Second Solution: HMM Let s use a conditional model instead.

11 Third Solution: MEMM Use a series of maximum entropy classifiers that know the previous label. Define a Viterbi algorithm for inference. # t P(y x) = P yt"1 (y t x)

12 Third Solution: MEMM Combines the advantages of maximum entropy and HMM! But there is a problem

13 Problem with MEMMs: Label Bias In some state space configurations, MEMMs essentially completely ignore the inputs. r:_ Example 0 r:_ (ON BOARD). 1 4 This is not a problem for HMMs, because the input sequence is generated by the model. i:_ o:_ 2 5 b:rib b:rob 3

14 Fourth Solution: Conditional Random Field Conditionally-trained, undirected graphical model. For a standard linear-chain structure: P(y x) = $ t " k (y t, y t#1,x) ' " (y, y,x) = exp % k t t#1 ) k ( & k * f (y, y,x) t t#1, +

15 Fourth Solution: CRF Have the advantages of MEMMs, but avoid the label bias problem. CRFs are globally normalized, whereas MEMMs are locally normalized. Widely used and applied. CRFs give state-the-art results in many domains.

16 Fourth Solution: CRF Have the advantages of MEMMs, but avoid the label bias problem. CRFs are globally normalized, whereas MEMMs are locally normalized. Widely used and applied. CRFs give state-the-art results in many domains. Remember, Z is the normalization constant. How do we compute it?

17 CRF Applications Part-of-speech tagging Named entity recognition Gene prediction Chinese word segmentation Morphological disambiguation Citation parsing Etc., etc. Document layout (e.g. table) classification

18 NER as Sequence Tagging The Phoenicians came from the Red Sea 17

19 NER as Sequence Tagging O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

20 NER as Sequence Tagging Capitalized word O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

21 NER as Sequence Tagging Capitalized word Ends in s O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

22 NER as Sequence Tagging Capitalized word Ends in s Ends in ans O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

23 NER as Sequence Tagging Capitalized word Ends in s Ends in ans Previous word the O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

24 NER as Sequence Tagging Capitalized word Ends in s Ends in ans Previous word the Phoenicians in gazetteer O B-E O O O B-L I-L The Phoenicians came from the Red Sea 17

25 NER as Sequence Tagging O B-E O O O B-L I-L The Phoenicians came from the Red Sea 18

26 NER as Sequence Tagging Not capitalized O B-E O O O B-L I-L The Phoenicians came from the Red Sea 18

27 NER as Sequence Tagging Not capitalized Tagged as VB O B-E O O O B-L I-L The Phoenicians came from the Red Sea 18

28 NER as Sequence Tagging B-E to right Not capitalized Tagged as VB O B-E O O O B-L I-L The Phoenicians came from the Red Sea 18

29 NER as Sequence Tagging O B-E O O O B-L I-L The Phoenicians came from the Red Sea 19

30 NER as Sequence Tagging Word sea O B-E O O O B-L I-L The Phoenicians came from the Red Sea 19

31 NER as Sequence Tagging Word sea preceded by the ADJ Word sea O B-E O O O B-L I-L The Phoenicians came from the Red Sea 19

32 NER as Sequence Tagging Word sea preceded by the ADJ Hard constraint: I-L must follow B-L or I-L Word sea O B-E O O O B-L I-L The Phoenicians came from the Red Sea 19

33 Overview What computations do we need? Smoothing log-linear models MEMMs vs. CRFs again Action-based parsing and dependency parsing

34 Recipe for Conditional Training of p(y x) 1.Gather constraints/features from training data er constraints from training data: 2.Initialize ize all all parameters to to zero. 3.Classify training data with current parameters; calculate expectations 4.Gradient is in the direction o 5.Take a step in the direction of the gradient 6.Repeat from 3 until convergence

35 Recipe for Conditional Training of p(y x) 1.Gather constraints/features from training data er constraints from training data: 2.Initialize ize all all parameters to to zero. 3.Classify training data with current parameters; calculate expectations 4.Gradient is in the direction o 5.Take a step in the direction of the gradient 6.Repeat from 3 until convergence Where have we seen expected counts before?

36 Recipe for Conditional Training of p(y x) 1.Gather constraints/features from training data er constraints from training data: 2.Initialize ize all all parameters to to zero. 3.Classify training data with current parameters; calculate expectations 4.Gradient is in the direction o 5.Take a step in the direction of the gradient 6.Repeat from 3 until convergence EM! Where have we seen expected counts before?

37 Gradient-Based Training λ := λ + rate * Gradient(F) After all training examples? (batch) After every example? (on-line) Use second derivative for faster learning? A big field: numerical optimization

38 Parsing as Structured Prediction

39 Shift-reduce parsing Stack Input remaining Action () Book that flight shift (Book) that flight reduce, Verb book, (Choice #1 of 2) (Verb) that flight shift (Verb that) flight reduce, Det that (Verb Det) flight shift (Verb Det flight) reduce, Noun flight (Verb Det Noun) reduce, NOM Noun (Verb Det NOM) reduce, NP Det NOM (Verb NP) reduce, VP Verb NP (Verb) reduce, S V (S) SUCCESS! Ambiguity may lead to the need for backtracking.

40 Shift-reduce parsing Stack Input remaining Action () Book that flight shift (Book) that flight reduce, Verb book, (Choice #1 of 2) (Verb) that flight shift (Verb that) flight reduce, Det that (Verb Det) flight shift (Verb Det flight) reduce, Noun flight (Verb Det Noun) reduce, NOM Noun (Verb Det NOM) reduce, NP Det NOM (Verb NP) reduce, VP Verb NP (Verb) reduce, S V (S) SUCCESS! Ambiguity may lead to the need for backtracking.

41 Shift-reduce parsing Stack Input remaining Action () Book that flight shift (Book) that flight reduce, Verb book, (Choice #1 of 2) (Verb) that flight shift (Verb that) flight reduce, Det that (Verb Det) flight shift (Verb Det flight) reduce, Noun flight (Verb Det Noun) reduce, NOM Noun (Verb Det NOM) reduce, NP Det NOM (Verb NP) reduce, VP Verb NP (Verb) reduce, S V (S) SUCCESS! Ambiguity may lead to the need for backtracking. Train log-linear model of p(action context)

42 Shift-reduce parsing Stack Input remaining Action () Book that flight shift (Book) that flight reduce, Verb book, (Choice #1 of 2) (Verb) that flight shift (Verb that) flight reduce, Det that (Verb Det) flight shift (Verb Det flight) reduce, Noun flight (Verb Det Noun) reduce, NOM Noun (Verb Det NOM) reduce, NP Det NOM (Verb NP) reduce, VP Verb NP (Verb) reduce, S V (S) SUCCESS! Ambiguity may lead to the need for backtracking. Compare to an MEMM Train log-linear model of p(action context)

43 Structured Log-Linear Models 25

44 Structured Log-Linear Models Linear model for scoring structures score(out, in) = features(out, in) 25

45 Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

46 Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing Viz. logistic regression, Markov random fields, undirected graphical models score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

47 Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing Viz. logistic regression, Markov random fields, undirected graphical models Usually the bottleneck in NLP score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

48 Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing Viz. logistic regression, Markov random fields, undirected graphical models Inference: sampling, variational methods, dynamic programming, local search,... Usually the bottleneck in NLP score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

49 Structured Log-Linear Models Linear model for scoring structures Get a probability distribution by normalizing Viz. logistic regression, Markov random fields, undirected graphical models Inference: sampling, variational methods, dynamic programming, local search,... Training: maximum likelihood, minimum risk, etc. Usually the bottleneck in NLP score(out, in) = features(out, in) p(out in) = 1 Z escore(out,in) Z = out GEN(in) e score(out,in) 25

50 Structured Log-Linear Models With latent variables Several layers of linguistic structure Unknown correspondences Naturally handled by probabilistic framework Several inference setups, for example: p(out 1 in) = out 2,alignment p(out 1, out 2, alignment in) 26

51 Structured Log-Linear Models With latent variables Several layers of linguistic structure Unknown correspondences Naturally handled by probabilistic framework Several inference setups, for example: p(out 1 in) = out 2,alignment p(out 1, out 2, alignment in) Another computational problem 26

52 Edge-Factored Parsers No global features of a parse (McDonald et al. 2005) Each feature is attached to some edge MST or CKY-like DP for fast O(n 2 ) or O(n 3 ) parsing Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 27

53 Edge-Factored Parsers Is this a good edge? Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 28

54 Edge-Factored Parsers Is this a good edge? yes, lots of positive features... Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 28

55 Edge-Factored Parsers Is this a good edge? Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 29

56 Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) Byl jasný studený dubnový den a hodiny odbíjely třináctou It was a bright cold day in April and the clocks were striking thirteen 29

57 Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 30

58 Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) jasný! N ( bright NOUN ) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 30

59 Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) jasný! N ( bright NOUN ) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 31

60 Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) jasný! N ( bright NOUN ) A! N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 31

61 Edge-Factored Parsers Is this a good edge? jasný! den ( bright day ) jasný! N ( bright NOUN ) A! N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 32

62 Edge-Factored Parsers Is this a good edge? jasný! den A!( bright N day ) preceding conjunction jasný! N ( bright NOUN ) A! N Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 32

63 Edge-Factored Parsers How about this competing edge? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 33

64 Edge-Factored Parsers How about this competing edge? not as good, lots of red... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 33

65 Edge-Factored Parsers How about this competing edge? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 34

66 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks ) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 34

67 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C It was a bright cold day in April and the clocks were striking thirteen 34

68 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 35

69 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... jasn! hodi ( bright clock, stems only) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 35

70 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... jasn! hodi ( bright clock, stems only) Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 36

71 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... jasn! hodi ( bright clock, stems only) A singular! N plural Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 36

72 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright clocks )... undertrained... jasn! hodi ( bright clock, stems only) A singular! N plural Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 37

73 Edge-Factored Parsers How about this competing edge? jasný! hodiny ( bright A! clocks ) N... where undertrained N follows... a conjunction jasn! hodi ( bright clock, stems only) A singular! N plural Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 37

74 Edge-Factored Parsers Which edge is better? bright day or bright clocks? Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 38

75 Edge-Factored Parsers Which edge is better? Score of an edge e = θ features(e) Standard algos " valid parse with max total score Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 39

76 Edge-Factored Parsers Which edge is better? our current weight vector Score of an edge e = θ features(e) Standard algos " valid parse with max total score Byl jasný studený dubnový den a hodiny odbíjely třináctou V A A A N J N V C byl jasn stud dubn den a hodi odbí třin It was a bright cold day in April and the clocks were striking thirteen 39

77 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging find preferred tags 40

78 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging find preferred tags Observed input sentence (shaded) 40

79 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) v v v find preferred tags Observed input sentence (shaded) 40

80 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging find preferred tags Observed input sentence (shaded) 41

81 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Possible tagging (i.e., assignment to remaining variables) Another possible tagging v a n find preferred tags Observed input sentence (shaded) 41

82 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Binary factor that measures compatibility of 2 adjacent tags v n a v n a find preferred tags 42

83 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Binary factor that measures compatibility of 2 adjacent tags v n a v n a v n a v n a Model reuses same parameters at this position find preferred tags 42

84 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging v 0.2 n 0.2 a 0 find preferred tags 43

85 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag v 0.2 n 0.2 a 0 find preferred tags 43

86 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Its values depend on corresponding word v 0.2 n 0.2 a 0 find preferred tags 43

87 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Its values depend on corresponding word v 0.2 n 0.2 a 0 find preferred tags can t be adj 43

88 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Its values depend on corresponding word v 0.2 n 0.2 a 0 find preferred tags 44

89 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Its values depend on corresponding word v 0.2 n 0.2 a 0 find preferred tags (could be made to depend on entire observed sentence) 44

90 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 45

91 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging Unary factor evaluates this tag Different unary factor at each position v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 45

92 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging v n a v n a v n a v n a v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 46

93 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging p(v a n) is proportional to the product of all factors values on v a n v n a v n a v n a v n a v a n v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 46

94 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging p(v a n) is proportional to the product of all factors values on v a n v n a v n a v n a v n a v a n v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 47

95 Local factors in a graphical model # First, a familiar example $ Conditional Random Field (CRF) for POS tagging p(v a n) is proportional to the product of all factors values on v a n v n a v n a v n a v n a v a n = 1*3*0.3*0.1*0.2 v 0.3 n 0.02 a 0 v 0.3 n 0 a 0.1 v 0.2 n 0.2 a 0 find preferred tags 47

96 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links find preferred links 48

97 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links find preferred links 48

98 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links find preferred links 48

99 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links Possible parse... find preferred links 49

100 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links Possible f t... and variable parse... assignment f f t f find preferred links 49

101 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links Another parse... find preferred links 50

102 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links Another f f... and variable parse... assignment f t f t find preferred links 50

103 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! O(n 2 ) boolean variables for the possible links An illegal t f... with a parse... cycle! f t f t find preferred links 51

104 Graphical Models for Parsing First, a labeling example v a n CRF for POS tagging Now let s do dependency parsing! An illegal parse... O(n 2 ) boolean variables for the possible links f t... with a cycle and multiple f t t t parents! find preferred links 52

105 Local Factors for Parsing What factors determine parse probability? find preferred links 53

106 Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation t 2 f 1 find preferred links 53

107 Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation t 2 f 1 t 1 f 8 t 1 f 3 t 1 f 2 find t 1 preferred t 1links f 6 f 2 53

108 Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation But what if the best assignment isn t a tree? t 2 f 1 t 1 f 8 t 1 f 3 t 1 f 2 find t 1 preferred t 1links f 6 f 2 53

109 Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation find preferred links 54

110 Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 find preferred links 54

111 Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 find preferred links 54

112 Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 t f f f f t ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 we re legal! find preferred links 55

113 Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation optionally require the tree to be projective (no crossing links) Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 So far, this is equivalent to edge-factored parsing f f f t t f 64 entries (0/1) ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 we re legal! find preferred links 55

114 Global Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation optionally require the tree to be projective (no crossing links) Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 So far, this is equivalent to edge-factored parsing f f f t t f 64 entries (0/1) ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 we re legal! find preferred links Note: traditional parsers don t loop through this table to consider exponentially many trees one at a time. They use combinatorial algorithms; so should we! 55

115 Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 Second order effects: factors on 2 variables Grandparent parent child chains find preferred links 56

116 Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 Second order effects: factors on 2 variables Grandparent parent child chains f t f 1 1 t 1 3 find preferred links 56

117 Local Factors for Parsing What factors determine parse probability? Unary factors to score each link in isolation Global TREE factor to require links to form a legal tree A hard constraint: potential is either 0 or 1 Second order effects: factors on 2 variables Grandparent parent child chains No crossing links Siblings Hidden morphological tags Word senses and subcategorization frames find preferred links 57

118 Great Ideas in ML: Message Passing adapted from MacKay (2003) textbook 58

119 Great Ideas in ML: Message Passing Count the soldiers adapted from MacKay (2003) textbook 58

120 Great Ideas in ML: Message Passing Count the soldiers 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

121 Great Ideas in ML: Message Passing Count the soldiers there s 1 of me 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

122 Great Ideas in ML: Message Passing Count the soldiers there s 1 of me 5 behind you 4 behind you 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

123 Great Ideas in ML: Message Passing Count the soldiers 1 before you 2 before you 5 behind you 4 behind you 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

124 Great Ideas in ML: Message Passing Count the soldiers 1 before you 2 before you 3 before you 4 before you 5 before you 5 behind you 4 behind you 3 behind you 2 behind you 1 behind you adapted from MacKay (2003) textbook 58

125 Great Ideas in ML: Message Passing Count the soldiers 2 before you there s 1 of me only see my incoming messages 3 behind you adapted from MacKay (2003) textbook 59

126 Great Ideas in ML: Message Passing Count the soldiers 2 before you there s 1 of me Belief: Must be = 6 of us only see my incoming messages 3 behind you adapted from MacKay (2003) textbook 59

127 Great Ideas in ML: Message Passing Count the soldiers 2 before you there s 1 of me Belief: Must be = 6 of us only see my incoming messages 3 behind you adapted from MacKay (2003) textbook 59

128 Great Ideas in ML: Message Passing Count the soldiers 1 before you there s 1 of me Belief: Must be = 6 of us only see my incoming messages 4 behind you adapted from MacKay (2003) textbook 60

129 Great Ideas in ML: Message Passing Count the soldiers 1 before you there s 1 of me Belief: Must be = 6 of us only see my incoming messages 4 behind you adapted from MacKay (2003) textbook 60

130 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook 61

131 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here adapted from MacKay (2003) textbook 61

132 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 1 of me 11 here (= 7+3+1) adapted from MacKay (2003) textbook 61

133 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook 62

134 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 3 here adapted from MacKay (2003) textbook 62

135 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here (= 3+3+1) 3 here adapted from MacKay (2003) textbook 62

136 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook 63

137 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 7 here 3 here adapted from MacKay (2003) textbook 63

138 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 11 here (= 7+3+1) 7 here 3 here adapted from MacKay (2003) textbook 63

139 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree adapted from MacKay (2003) textbook 64

140 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree Belief: Must be 14 of us adapted from MacKay (2003) textbook 64

141 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here Belief: Must be 14 of us adapted from MacKay (2003) textbook 64

142 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here Belief: Must be 14 of us adapted from MacKay (2003) textbook 65

143 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 3 here Belief: Must be 14 of us wouldn t work correctly with a loopy (cyclic) graph adapted from MacKay (2003) textbook 65

144 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm v n a v n a v n a v n a find preferred tags 66

145 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm belief v 1. n 8 0 a 4. 2 v n a v n a v n a v n a find preferred tags 66

146 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm message α v n a v 3 v n 1 n a 6 a belief v 1. 8 n 0 a 4. 2 β message v 2 v n a nv a n a find preferred tags 66

147 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm message α v n a v 3 v n 1 n a 6 a belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β message v 2 v n a nv a n a find preferred tags 66

148 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v n a v 3 v n 1 n a 6 a belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β message v 2 v n a nv a n a find preferred tags 66

149 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v 3 n 1 a 6 belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β message v 2 v n a nv a n a find preferred tags 66

150 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α message α belief v 1. 8 n 0 a 4. 2 β message v 7 n 2 a 1 v 3 n 1 a 6 v 2 n 1 a 7 v 0.3 n 0 a 0.1 find preferred tags 66

151 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v 3 n 1 a 6 v n a v n a belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β v 2 n 1 a 7 message find preferred tags 66

152 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v n a v 3 v n 1 n a 6 a belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β v 2 n 1 a 7 message β v 3 n 6 a 1 find preferred tags 66

153 Great ideas in ML: Forward-Backward # In the CRF, message passing = forward-backward= sum-product algorithm α v 7 n 2 a 1 message α v n a v 3 v n 1 n a 6 a belief v 1. 8 n 0 a 4. 2 v 0.3 n 0 a 0.1 β message v 2 v n a nv a n a β v 3 n 6 a 1 find preferred tags 66

154 Sum-Product Equations # Message from variable v to factor f m v!f (x) = Y f 0 2N(v)\{f} m f 0!v(x) # Message from factor f to variable v X m f!v (x) = 2 4f(x m ) Y N(f)\{v} v 0 2N(f)\{v} 3 m v0!f (y) 5 67

155 Great ideas in ML: Forward-Backward α β v 3 n 1 a 6 v 2 n 1 a 7 v 0.3 n 0 a 0.1 find preferred tags 68

156 Great ideas in ML: Forward-Backward α β v 3 n 1 a 6 v 2 n 1 a 7 v 0.3 n 0 a 0.1 find preferred tags 68

157 Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor α β v 3 n 1 a 6 v 2 n 1 a 7 v 0.3 n 0 a 0.1 find preferred tags 68

158 Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor $ More influences on belief v 5.4 n 0 a 25.2 α β v 3 n 1 a 6 v 2 n 1 a 7 v 3 n 1 a 6 v 0.3 n 0 a 0.1 find preferred tags 68

159 Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor $ More influences on belief $ Graph becomes loopy & α v 5.4` n 0 a 25.2` β v 3 n 1 a 6 v 2 n 1 a 7 v 3 n 1 a 6 v 0.3 n 0 a 0.1 find preferred tags 69

160 Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor $ More influences on belief $ Graph becomes loopy & α Red messages not independent? Pretend they are! v 5.4` n 0 a 25.2` β v 3 n 1 a 6 v 2 n 1 a 7 v 3 n 1 a 6 v 0.3 n 0 a 0.1 find preferred tags 69

161 Great ideas in ML: Forward-Backward # Extend CRF to skip chain to capture non-local factor $ More influences on belief $ Graph becomes loopy & α Red messages not independent? Pretend they are! Loopy Belief Propagation v 5.4` n 0 a 25.2` β v 3 n 1 a 6 v 2 n 1 a 7 v 3 n 1 a 6 v 0.3 n 0 a 0.1 find preferred tags 69

162 Terminological Clarification propagation 70

163 Terminological Clarification belief propagation 70

164 Terminological Clarification loopy belief propagation 70

165 Terminological Clarification loopy belief propagation 70

166 Terminological Clarification loopy belief propagation 70

167 Terminological Clarification loopy belief propagation loopy belief propagation 70

168 Terminological Clarification loopy belief propagation loopy belief propagation 70

169 Propagating Global Factors Loopy belief propagation is easy for local factors How do combinatorial factors (like TREE) compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links?? find preferred links 71

170 Propagating Global Factors Loopy belief propagation is easy for local factors How do combinatorial factors (like TREE) compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links?? find preferred links 71

171 Propagating Global Factors Loopy belief propagation is easy for local factors How do combinatorial factors (like TREE) compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links? find preferred links? TREE factor ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 71

172 Propagating Global Factors How does the TREE factor compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links? find preferred links? TREE factor ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 72

173 Propagating Global Factors How does the TREE factor compute the message to the link in question? Does the TREE factor think the link is probably t given the messages it receives from all the other links? Old-school parsing to the rescue! find preferred links? TREE factor ffffff 0 ffffft 0 fffftf 0 fftfft 1 tttttt 0 This is the outside probability of the link in an edge-factored parser! TREE factor computes all outgoing messages at once (given all incoming messages) Projective case: total O(n 3 ) time by inside-outside Non-projective: total O(n 3 ) time by inverting Kirchhoff matrix 72

174 Graph Theory to the Rescue! Tutte s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. 73

175 Graph Theory to the Rescue! Tutte s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! 73

176 Graph Theory to the Rescue! O(n 3 ) time! Tutte s Matrix-Tree Theorem (1948) The determinant of the Kirchoff (aka Laplacian) adjacency matrix of directed graph G without row and column r is equal to the sum of scores of all directed spanning trees of G rooted at node r. Exactly the Z we need! 73

177 Kirchoff (Laplacian) Matrix # 0 s(1,0) s(2,0)! s(n,0) & % ( 0 0 s(2,1)! s(n,1) % ( % 0 s(1,2) 0! s(n,2) ( % ( " " " # " % ( $ % 0 s(1,n) s(2,n)! 0 '( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant 74

178 Kirchoff (Laplacian) Matrix # 0 s(1,0) s(2,0)! s(n,0) & % ( 0 0 s(2,1)! s(n,1) % ( % 0 s(1,2) 0! s(n,2) ( % ( " " " # " % ( $ % 0 s(1,n) s(2,n)! 0 '( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant 74

179 Kirchoff (Laplacian) Matrix % 0 s(1,0) s(2,0)! s(n,0) # 0 s(1,0) s(2,0)! s(n,0) & ' * % ( ' 0 s(1, j) s(2,1)! s(n,1) 0 s(2,1) s(n,1) * % ( j 1 ' * '% 0 s(1,2) s(2, 0 j)! s(n,2) (* '% ( j 2 * '% " " " # " (* ' $ % 0 s(1,n) s(2,n)! 0 '( * j n 0 s(1,n) s(2,n)! s(n, j) & ' ) * ( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant 74

180 Kirchoff (Laplacian) Matrix % 0 s(1,0) s(2,0)! s(n,0) # 0 s(1, s(1,0) j) s(2,1) s(2,0)! s(n,0) s(n,1) & ' * % ( ' 0 j 1 s(1, j) s(2,1)! s(n,1) 0 s(2,1) s(n,1) * % ( j 1 ' s(1,2) s(2, j)! s(n,2) * '% 0 s(1,2) s(2, 0 j) s(n,2) (* % j 2 ' ( j 2 * '% " " " " " # " (* ' $ % s(1,n) 0 s(1,n) s(2,n) s(2,n)! s(n, 0 j) '( * j n 0 s(1,n) s(2,n)! s(n, j) & ' ) * ( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant 74

181 Kirchoff (Laplacian) Matrix % 0 s(1,0) s(2,0)! s(n,0) # 0 s(1, s(1,0) j) s(2,1) s(2,0)! s(n,0) s(n,1) & ' * % ( ' 0 j 1 s(1, j) s(2,1)! s(n,1) 0 s(2,1) s(n,1) * % ( j 1 ' s(1,2) s(2, j)! s(n,2) * '% 0 s(1,2) s(2, 0 j) s(n,2) (* % j 2 ' ( j 2 * '% " " " " " # " (* ' $ % s(1,n) 0 s(1,n) s(2,n) s(2,n)! s(n, 0 j) '( * j n 0 s(1,n) s(2,n)! s(n, j) & ' ) * ( 'Negate edge scores 'Sum columns (children) 'Strike root row/col. 'Take determinant N.B.: This allows multiple children of root, but see Koo et al

182 Transition-Based Parsing Linear time Online Train a classifier to predict next action Deterministic or beam-search strategies But... generally less accurate

183 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Start state: ([ ], [1,...,n], {}) Final state: (S, [], A) Shift: (S, i B, A) ) (S i, B, A) Reduce: (S i, B, A) ) (S, B, A) Right-Arc: (S i, j B, A) ) (S i j, B, A [ {i! j}) Left-Arc: (S i, j B, A) ) (S, j B, A [ {i j})

184 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [] S [who, did, you, see] B {}

185 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [who] S [did, you, see] B {} Shift

186 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [] S [did, you, see] B { who OBJ did } Left-arc OBJ

187 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Stack Buffer Arcs [did] S [you, see] B { who OBJ did } Shift

188 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs [did, you] S [see] B { who OBJ did, did SBJ! you }

189 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Reduce Stack Buffer Arcs [did] S [see] B { who OBJ did, did SBJ! you }

190 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc VG Stack Buffer Arcs [did, see] S [] B { who OBJ did, did SBJ! you, did VG! see }

191 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs [did, you] S [see] B { who OBJ did, did SBJ! you }

192 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs [did, you] S [see] B { who OBJ did, did SBJ! you } Choose action w/best classifier score 100k - 1M features

193 Transition-Based Parsing Arc-eager shift-reduce parsing (Nivre, 2003) Right-arc SBJ Stack Buffer Arcs Very fast linear-time performance WSJ 23 (2k sentences) in 3 s [did, you] S [see] B { who OBJ did, did SBJ! you } Choose action w/best classifier score 100k - 1M features

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017

13A. Computational Linguistics. 13A. Log-Likelihood Dependency Parsing. CSC 2501 / 485 Fall 2017 Computational Linguistics CSC 2501 / 485 Fall 2017 13A 13A. Log-Likelihood Dependency Parsing Gerald Penn Department of Computer Science, University of Toronto Based on slides by Yuji Matsumoto, Dragomir

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 24, 2016 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

A brief introduction to Conditional Random Fields

A brief introduction to Conditional Random Fields A brief introduction to Conditional Random Fields Mark Johnson Macquarie University April, 2005, updated October 2010 1 Talk outline Graphical models Maximum likelihood and maximum conditional likelihood

More information

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013

Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Conditional Random Fields and beyond DANIEL KHASHABI CS 546 UIUC, 2013 Outline Modeling Inference Training Applications Outline Modeling Problem definition Discriminative vs. Generative Chain CRF General

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 19 Oct, 23, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models D. Page,

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them

Sequence labeling. Taking collective a set of interrelated instances x 1,, x T and jointly labeling them HMM, MEMM and CRF 40-957 Special opics in Artificial Intelligence: Probabilistic Graphical Models Sharif University of echnology Soleymani Spring 2014 Sequence labeling aking collective a set of interrelated

More information

10/17/04. Today s Main Points

10/17/04. Today s Main Points Part-of-speech Tagging & Hidden Markov Model Intro Lecture #10 Introduction to Natural Language Processing CMPSCI 585, Fall 2004 University of Massachusetts Amherst Andrew McCallum Today s Main Points

More information

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging

10 : HMM and CRF. 1 Case Study: Supervised Part-of-Speech Tagging 10-708: Probabilistic Graphical Models 10-708, Spring 2018 10 : HMM and CRF Lecturer: Kayhan Batmanghelich Scribes: Ben Lengerich, Michael Kleyman 1 Case Study: Supervised Part-of-Speech Tagging We will

More information

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction

What s an HMM? Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) Hidden Markov Models (HMMs) for Information Extraction Hidden Markov Models (HMMs) for Information Extraction Daniel S. Weld CSE 454 Extraction with Finite State Machines e.g. Hidden Markov Models (HMMs) standard sequence model in genomics, speech, NLP, What

More information

Probabilistic Models for Sequence Labeling

Probabilistic Models for Sequence Labeling Probabilistic Models for Sequence Labeling Besnik Fetahu June 9, 2011 Besnik Fetahu () Probabilistic Models for Sequence Labeling June 9, 2011 1 / 26 Background & Motivation Problem introduction Generative

More information

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron)

Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Lecture 13: Discriminative Sequence Models (MEMM and Struct. Perceptron) Intro to NLP, CS585, Fall 2014 http://people.cs.umass.edu/~brenocon/inlp2014/ Brendan O Connor (http://brenocon.com) 1 Models for

More information

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs

Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs Empirical Methods in Natural Language Processing Lecture 11 Part-of-speech tagging and HMMs (based on slides by Sharon Goldwater and Philipp Koehn) 21 February 2018 Nathan Schneider ENLP Lecture 11 21

More information

Lecture 6: Graphical Models

Lecture 6: Graphical Models Lecture 6: Graphical Models Kai-Wei Chang CS @ Uniersity of Virginia kw@kwchang.net Some slides are adapted from Viek Skirmar s course on Structured Prediction 1 So far We discussed sequence labeling tasks:

More information

COMP90051 Statistical Machine Learning

COMP90051 Statistical Machine Learning COMP90051 Statistical Machine Learning Semester 2, 2017 Lecturer: Trevor Cohn 24. Hidden Markov Models & message passing Looking back Representation of joint distributions Conditional/marginal independence

More information

Undirected Graphical Models

Undirected Graphical Models Outline Hong Chang Institute of Computing Technology, Chinese Academy of Sciences Machine Learning Methods (Fall 2012) Outline Outline I 1 Introduction 2 Properties Properties 3 Generative vs. Conditional

More information

Machine Learning for Structured Prediction

Machine Learning for Structured Prediction Machine Learning for Structured Prediction Grzegorz Chrupa la National Centre for Language Technology School of Computing Dublin City University NCLT Seminar Grzegorz Chrupa la (DCU) Machine Learning for

More information

Graphical models for part of speech tagging

Graphical models for part of speech tagging Indian Institute of Technology, Bombay and Research Division, India Research Lab Graphical models for part of speech tagging Different Models for POS tagging HMM Maximum Entropy Markov Models Conditional

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Sequence Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(21) Introduction Structured

More information

Linear Classifiers IV

Linear Classifiers IV Universität Potsdam Institut für Informatik Lehrstuhl Linear Classifiers IV Blaine Nelson, Tobias Scheffer Contents Classification Problem Bayesian Classifier Decision Linear Classifiers, MAP Models Logistic

More information

LECTURER: BURCU CAN Spring

LECTURER: BURCU CAN Spring LECTURER: BURCU CAN 2017-2018 Spring Regular Language Hidden Markov Model (HMM) Context Free Language Context Sensitive Language Probabilistic Context Free Grammar (PCFG) Unrestricted Language PCFGs can

More information

Sequential Supervised Learning

Sequential Supervised Learning Sequential Supervised Learning Many Application Problems Require Sequential Learning Part-of of-speech Tagging Information Extraction from the Web Text-to to-speech Mapping Part-of of-speech Tagging Given

More information

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace

A.I. in health informatics lecture 8 structured learning. kevin small & byron wallace A.I. in health informatics lecture 8 structured learning kevin small & byron wallace today models for structured learning: HMMs and CRFs structured learning is particularly useful in biomedical applications:

More information

Directed Probabilistic Graphical Models CMSC 678 UMBC

Directed Probabilistic Graphical Models CMSC 678 UMBC Directed Probabilistic Graphical Models CMSC 678 UMBC Announcement 1: Assignment 3 Due Wednesday April 11 th, 11:59 AM Any questions? Announcement 2: Progress Report on Project Due Monday April 16 th,

More information

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction

Predicting Sequences: Structured Perceptron. CS 6355: Structured Prediction Predicting Sequences: Structured Perceptron CS 6355: Structured Prediction 1 Conditional Random Fields summary An undirected graphical model Decompose the score over the structure into a collection of

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 22 April 2, 2018 1 Reminders Homework

More information

Midterm sample questions

Midterm sample questions Midterm sample questions CS 585, Brendan O Connor and David Belanger October 12, 2014 1 Topics on the midterm Language concepts Translation issues: word order, multiword translations Human evaluation Parts

More information

Conditional Random Field

Conditional Random Field Introduction Linear-Chain General Specific Implementations Conclusions Corso di Elaborazione del Linguaggio Naturale Pisa, May, 2011 Introduction Linear-Chain General Specific Implementations Conclusions

More information

Hidden Markov Models in Language Processing

Hidden Markov Models in Language Processing Hidden Markov Models in Language Processing Dustin Hillard Lecture notes courtesy of Prof. Mari Ostendorf Outline Review of Markov models What is an HMM? Examples General idea of hidden variables: implications

More information

Conditional Random Fields: An Introduction

Conditional Random Fields: An Introduction University of Pennsylvania ScholarlyCommons Technical Reports (CIS) Department of Computer & Information Science 2-24-2004 Conditional Random Fields: An Introduction Hanna M. Wallach University of Pennsylvania

More information

Conditional Random Fields for Sequential Supervised Learning

Conditional Random Fields for Sequential Supervised Learning Conditional Random Fields for Sequential Supervised Learning Thomas G. Dietterich Adam Ashenfelter Department of Computer Science Oregon State University Corvallis, Oregon 97331 http://www.eecs.oregonstate.edu/~tgd

More information

Hidden Markov Models

Hidden Markov Models CS769 Spring 2010 Advanced Natural Language Processing Hidden Markov Models Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu 1 Part-of-Speech Tagging The goal of Part-of-Speech (POS) tagging is to label each

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015

Sequence Modelling with Features: Linear-Chain Conditional Random Fields. COMP-599 Oct 6, 2015 Sequence Modelling with Features: Linear-Chain Conditional Random Fields COMP-599 Oct 6, 2015 Announcement A2 is out. Due Oct 20 at 1pm. 2 Outline Hidden Markov models: shortcomings Generative vs. discriminative

More information

Conditional Random Fields

Conditional Random Fields Conditional Random Fields Micha Elsner February 14, 2013 2 Sums of logs Issue: computing α forward probabilities can undeflow Normally we d fix this using logs But α requires a sum of probabilities Not

More information

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP NP PP 1.0. N people 0.

S NP VP 0.9 S VP 0.1 VP V NP 0.5 VP V 0.1 VP V PP 0.1 NP NP NP 0.1 NP NP PP 0.2 NP N 0.7 PP P NP 1.0 VP  NP PP 1.0. N people 0. /6/7 CS 6/CS: Natural Language Processing Instructor: Prof. Lu Wang College of Computer and Information Science Northeastern University Webpage: www.ccs.neu.edu/home/luwang The grammar: Binary, no epsilons,.9..5

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015

Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch. COMP-599 Oct 1, 2015 Part of Speech Tagging: Viterbi, Forward, Backward, Forward- Backward, Baum-Welch COMP-599 Oct 1, 2015 Announcements Research skills workshop today 3pm-4:30pm Schulich Library room 313 Start thinking about

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability

More information

CS838-1 Advanced NLP: Hidden Markov Models

CS838-1 Advanced NLP: Hidden Markov Models CS838-1 Advanced NLP: Hidden Markov Models Xiaojin Zhu 2007 Send comments to jerryzhu@cs.wisc.edu 1 Part of Speech Tagging Tag each word in a sentence with its part-of-speech, e.g., The/AT representative/nn

More information

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9

CSCI 5832 Natural Language Processing. Today 2/19. Statistical Sequence Classification. Lecture 9 CSCI 5832 Natural Language Processing Jim Martin Lecture 9 1 Today 2/19 Review HMMs for POS tagging Entropy intuition Statistical Sequence classifiers HMMs MaxEnt MEMMs 2 Statistical Sequence Classification

More information

Soft Inference and Posterior Marginals. September 19, 2013

Soft Inference and Posterior Marginals. September 19, 2013 Soft Inference and Posterior Marginals September 19, 2013 Soft vs. Hard Inference Hard inference Give me a single solution Viterbi algorithm Maximum spanning tree (Chu-Liu-Edmonds alg.) Soft inference

More information

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc.

Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Logistic Regression: Online, Lazy, Kernelized, Sequential, etc. Harsha Veeramachaneni Thomson Reuter Research and Development April 1, 2010 Harsha Veeramachaneni (TR R&D) Logistic Regression April 1, 2010

More information

Parsing with Context-Free Grammars

Parsing with Context-Free Grammars Parsing with Context-Free Grammars Berlin Chen 2005 References: 1. Natural Language Understanding, chapter 3 (3.1~3.4, 3.6) 2. Speech and Language Processing, chapters 9, 10 NLP-Berlin Chen 1 Grammars

More information

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008

6.047 / Computational Biology: Genomes, Networks, Evolution Fall 2008 MIT OpenCourseWare http://ocw.mit.edu 6.047 / 6.878 Computational Biology: Genomes, etworks, Evolution Fall 2008 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction

Administrivia. What is Information Extraction. Finite State Models. Graphical Models. Hidden Markov Models (HMMs) for Information Extraction Administrivia Hidden Markov Models (HMMs) for Information Extraction Group meetings next week Feel free to rev proposals thru weekend Daniel S. Weld CSE 454 What is Information Extraction Landscape of

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Final Exam, Machine Learning, Spring 2009

Final Exam, Machine Learning, Spring 2009 Name: Andrew ID: Final Exam, 10701 Machine Learning, Spring 2009 - The exam is open-book, open-notes, no electronics other than calculators. - The maximum possible score on this exam is 100. You have 3

More information

Log-Linear Models, MEMMs, and CRFs

Log-Linear Models, MEMMs, and CRFs Log-Linear Models, MEMMs, and CRFs Michael Collins 1 Notation Throughout this note I ll use underline to denote vectors. For example, w R d will be a vector with components w 1, w 2,... w d. We use expx

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Lecture 7: Sequence Labeling

Lecture 7: Sequence Labeling http://courses.engr.illinois.edu/cs447 Lecture 7: Sequence Labeling Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Recap: Statistical POS tagging with HMMs (J. Hockenmaier) 2 Recap: Statistical

More information

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Natural Language Processing CS Lecture 06. Razvan C. Bunescu School of Electrical Engineering and Computer Science Natural Language Processing CS 6840 Lecture 06 Razvan C. Bunescu School of Electrical Engineering and Computer Science bunescu@ohio.edu Statistical Parsing Define a probabilistic model of syntax P(T S):

More information

Maxent Models and Discriminative Estimation

Maxent Models and Discriminative Estimation Maxent Models and Discriminative Estimation Generative vs. Discriminative models (Reading: J+M Ch6) Introduction So far we ve looked at generative models Language models, Naive Bayes But there is now much

More information

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark

Penn Treebank Parsing. Advanced Topics in Language Processing Stephen Clark Penn Treebank Parsing Advanced Topics in Language Processing Stephen Clark 1 The Penn Treebank 40,000 sentences of WSJ newspaper text annotated with phrasestructure trees The trees contain some predicate-argument

More information

Hidden Markov Models

Hidden Markov Models 10-601 Introduction to Machine Learning Machine Learning Department School of Computer Science Carnegie Mellon University Hidden Markov Models Matt Gormley Lecture 19 Nov. 5, 2018 1 Reminders Homework

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2014 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2014 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

with Local Dependencies

with Local Dependencies CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence

More information

Structure Learning in Sequential Data

Structure Learning in Sequential Data Structure Learning in Sequential Data Liam Stewart liam@cs.toronto.edu Richard Zemel zemel@cs.toronto.edu 2005.09.19 Motivation. Cau, R. Kuiper, and W.-P. de Roever. Formalising Dijkstra's development

More information

Natural Language Processing

Natural Language Processing SFU NatLangLab Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class Simon Fraser University October 9, 2018 0 Natural Language Processing Anoop Sarkar anoopsarkar.github.io/nlp-class

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models David Sontag New York University Lecture 4, February 16, 2012 David Sontag (NYU) Graphical Models Lecture 4, February 16, 2012 1 / 27 Undirected graphical models Reminder

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Stochastic Grammars Joakim Nivre Uppsala University Department of Linguistics and Philology joakim.nivre@lingfil.uu.se Statistical Methods for NLP 1(22) Structured Classification

More information

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing

Dependency Parsing. Statistical NLP Fall (Non-)Projectivity. CoNLL Format. Lecture 9: Dependency Parsing Dependency Parsing Statistical NLP Fall 2016 Lecture 9: Dependency Parsing Slav Petrov Google prep dobj ROOT nsubj pobj det PRON VERB DET NOUN ADP NOUN They solved the problem with statistics CoNLL Format

More information

Information Extraction from Text

Information Extraction from Text Information Extraction from Text Jing Jiang Chapter 2 from Mining Text Data (2012) Presented by Andrew Landgraf, September 13, 2013 1 What is Information Extraction? Goal is to discover structured information

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016

Bayesian Networks: Construction, Inference, Learning and Causal Interpretation. Volker Tresp Summer 2016 Bayesian Networks: Construction, Inference, Learning and Causal Interpretation Volker Tresp Summer 2016 1 Introduction So far we were mostly concerned with supervised learning: we predicted one or several

More information

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari

Partially Directed Graphs and Conditional Random Fields. Sargur Srihari Partially Directed Graphs and Conditional Random Fields Sargur srihari@cedar.buffalo.edu 1 Topics Conditional Random Fields Gibbs distribution and CRF Directed and Undirected Independencies View as combination

More information

Structured Prediction Models via the Matrix-Tree Theorem

Structured Prediction Models via the Matrix-Tree Theorem Structured Prediction Models via the Matrix-Tree Theorem Terry Koo Amir Globerson Xavier Carreras Michael Collins maestro@csail.mit.edu gamir@csail.mit.edu carreras@csail.mit.edu mcollins@csail.mit.edu

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/xilnmn Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Introduction to Machine Learning Midterm, Tues April 8

Introduction to Machine Learning Midterm, Tues April 8 Introduction to Machine Learning 10-701 Midterm, Tues April 8 [1 point] Name: Andrew ID: Instructions: You are allowed a (two-sided) sheet of notes. Exam ends at 2:45pm Take a deep breath and don t spend

More information

Lecture 15. Probabilistic Models on Graph

Lecture 15. Probabilistic Models on Graph Lecture 15. Probabilistic Models on Graph Prof. Alan Yuille Spring 2014 1 Introduction We discuss how to define probabilistic models that use richly structured probability distributions and describe how

More information

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing

Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing Context-Free Parsing: CKY & Earley Algorithms and Probabilistic Parsing Natural Language Processing CS 4120/6120 Spring 2017 Northeastern University David Smith with some slides from Jason Eisner & Andrew

More information

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima.

Statistical Machine Learning Theory. From Multi-class Classification to Structured Output Prediction. Hisashi Kashima. http://goo.gl/jv7vj9 Course website KYOTO UNIVERSITY Statistical Machine Learning Theory From Multi-class Classification to Structured Output Prediction Hisashi Kashima kashima@i.kyoto-u.ac.jp DEPARTMENT

More information

Probabilistic Context-free Grammars

Probabilistic Context-free Grammars Probabilistic Context-free Grammars Computational Linguistics Alexander Koller 24 November 2017 The CKY Recognizer S NP VP NP Det N VP V NP V ate NP John Det a N sandwich i = 1 2 3 4 k = 2 3 4 5 S NP John

More information

Structured Prediction

Structured Prediction Structured Prediction Classification Algorithms Classify objects x X into labels y Y First there was binary: Y = {0, 1} Then multiclass: Y = {1,...,6} The next generation: Structured Labels Structured

More information

CRF for human beings

CRF for human beings CRF for human beings Arne Skjærholt LNS seminar CRF for human beings LNS seminar 1 / 29 Let G = (V, E) be a graph such that Y = (Y v ) v V, so that Y is indexed by the vertices of G. Then (X, Y) is a conditional

More information

Maximum Entropy Markov Models

Maximum Entropy Markov Models Wi nøt trei a høliday in Sweden this yër? September 19th 26 Background Preliminary CRF work. Published in 2. Authors: McCallum, Freitag and Pereira. Key concepts: Maximum entropy / Overlapping features.

More information

CPSC 340: Machine Learning and Data Mining

CPSC 340: Machine Learning and Data Mining CPSC 340: Machine Learning and Data Mining Linear Classifiers: multi-class Original version of these slides by Mark Schmidt, with modifications by Mike Gelbart. 1 Admin Assignment 4: Due in a week Midterm:

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

Intelligent Systems (AI-2)

Intelligent Systems (AI-2) Intelligent Systems (AI-2) Computer Science cpsc422, Lecture 18 Oct, 21, 2015 Slide Sources Raymond J. Mooney University of Texas at Austin D. Koller, Stanford CS - Probabilistic Graphical Models CPSC

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Linear Models Joakim Nivre Uppsala University Department of Linguistics and Philology Slides adapted from Ryan McDonald, Google Research Machine Learning for NLP 1(26) Outline

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October,

MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, MIDTERM: CS 6375 INSTRUCTOR: VIBHAV GOGATE October, 23 2013 The exam is closed book. You are allowed a one-page cheat sheet. Answer the questions in the spaces provided on the question sheets. If you run

More information

Statistical Methods for NLP

Statistical Methods for NLP Statistical Methods for NLP Information Extraction, Hidden Markov Models Sameer Maskey Week 5, Oct 3, 2012 *many slides provided by Bhuvana Ramabhadran, Stanley Chen, Michael Picheny Speech Recognition

More information

Message Passing and Junction Tree Algorithms. Kayhan Batmanghelich

Message Passing and Junction Tree Algorithms. Kayhan Batmanghelich Message Passing and Junction Tree Algorithms Kayhan Batmanghelich 1 Review 2 Review 3 Great Ideas in ML: Message Passing Each soldier receives reports from all branches of tree 3 here 7 here 1 of me 11

More information

Structured Output Prediction: Generative Models

Structured Output Prediction: Generative Models Structured Output Prediction: Generative Models CS6780 Advanced Machine Learning Spring 2015 Thorsten Joachims Cornell University Reading: Murphy 17.3, 17.4, 17.5.1 Structured Output Prediction Supervised

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology

Basic Text Analysis. Hidden Markov Models. Joakim Nivre. Uppsala University Department of Linguistics and Philology Basic Text Analysis Hidden Markov Models Joakim Nivre Uppsala University Department of Linguistics and Philology joakimnivre@lingfiluuse Basic Text Analysis 1(33) Hidden Markov Models Markov models are

More information

p L yi z n m x N n xi

p L yi z n m x N n xi y i z n x n N x i Overview Directed and undirected graphs Conditional independence Exact inference Latent variables and EM Variational inference Books statistical perspective Graphical Models, S. Lauritzen

More information

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations

Bowl Maximum Entropy #4 By Ejay Weiss. Maxent Models: Maximum Entropy Foundations. By Yanju Chen. A Basic Comprehension with Derivations Bowl Maximum Entropy #4 By Ejay Weiss Maxent Models: Maximum Entropy Foundations By Yanju Chen A Basic Comprehension with Derivations Outlines Generative vs. Discriminative Feature-Based Models Softmax

More information

Statistical Approaches to Learning and Discovery

Statistical Approaches to Learning and Discovery Statistical Approaches to Learning and Discovery Graphical Models Zoubin Ghahramani & Teddy Seidenfeld zoubin@cs.cmu.edu & teddy@stat.cmu.edu CALD / CS / Statistics / Philosophy Carnegie Mellon University

More information

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April

Probabilistic Graphical Models. Guest Lecture by Narges Razavian Machine Learning Class April Probabilistic Graphical Models Guest Lecture by Narges Razavian Machine Learning Class April 14 2017 Today What is probabilistic graphical model and why it is useful? Bayesian Networks Basic Inference

More information

Machine Learning for NLP

Machine Learning for NLP Machine Learning for NLP Uppsala University Department of Linguistics and Philology Slides borrowed from Ryan McDonald, Google Research Machine Learning for NLP 1(50) Introduction Linear Classifiers Classifiers

More information

Design and Implementation of Speech Recognition Systems

Design and Implementation of Speech Recognition Systems Design and Implementation of Speech Recognition Systems Spring 2013 Class 7: Templates to HMMs 13 Feb 2013 1 Recap Thus far, we have looked at dynamic programming for string matching, And derived DTW from

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Natural Language Processing Prof. Pawan Goyal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Maximum Entropy Models I Welcome back for the 3rd module

More information

CSE 490 U Natural Language Processing Spring 2016

CSE 490 U Natural Language Processing Spring 2016 CSE 490 U Natural Language Processing Spring 2016 Feature Rich Models Yejin Choi - University of Washington [Many slides from Dan Klein, Luke Zettlemoyer] Structure in the output variable(s)? What is the

More information

Hidden Markov Models Part 1: Introduction

Hidden Markov Models Part 1: Introduction Hidden Markov Models Part 1: Introduction CSE 6363 Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Modeling Sequential Data Suppose that

More information