Integrating Morphology in Probabilistic Translation Models

Size: px

Start display at page:

Download "Integrating Morphology in Probabilistic Translation Models"

Darrell Dawson
6 years ago
Views:

1 Integrating Morphology in Probabilistic Translation Models Chris Dyer joint work with Jon Clark, Alon Lavie, and Noah Smith January 24, 2011 lti

2 das alte Haus the old house mach das do that 2

3 das alte Haus the old house mach das do that guten Tag hello 3

5 das alte Haus the old house mach das do that guten Tag hello 5

6 das alte Haus the old house Haus mach das do that guten Tag hello 6

7 das alte Haus the old house Haus mach das do that house guten Tag hello 7

8 das alte Haus the old house das mach das do that guten Tag hello 8

9 das alte Haus the old house das mach das do that the guten Tag hello 9

10 das alte Haus the old house das mach das do that that guten Tag hello 10

11 das alte Haus the old house markant mach das do that guten Tag hello 11

12 das alte Haus the old house markant mach das do that??? guten Tag hello 12

13 So far so good, but... 13

14 das alte Haus the old house alten mach das do that guten Tag hello 14

15 das alte Haus the old house alten mach das do that??? guten Tag hello 15

16 das alte Haus the old house alten mach das do that old? guten Tag hello 16

17 Problems 1. Source language inflectional richness. 17

18 das alte Haus the old house mach das do that guten Tag hello 18

19 das alte Haus the old house mach das do that guten Tag hello 19

20 das alte Haus the old house mach das do that old guten Tag hello 20

21 das alte Haus the old house alte mach das do that old guten Tag hello 21

22 das alte Haus the old house alten? mach das do that old guten Tag hello 22

23 Problems 1. Source language inflectional richness. 2. Target language inflectional richness. 23

24 Kopfschmerzen head ache Bauchschmerzen Rückenschmerzen abdominal pain Rücken back Kopf head 24

25 Kopfschmerzen head ache Bauchschmerzen Rückenschmerzen abdominal pain Rücken??? back Kopf head 25

26 Kopfschmerzen head ache Bauchschmerzen Rückenschmerzen abdominal pain Rücken back pain back Kopf head 26

27 Kopfschmerzen head ache Bauchschmerzen Rückenschmerzen abdominal pain Rücken back ache back Kopf head 27

28 Problems 1. Source language inflectional richness. 2. Target language inflectional richness. 3.Source language sublexical semantic compositionality. 28

29 General Solution MORPHOLOGY 29

30 f Analysis f Translation e Synthesis e 30

31 f f e 31

32 f AlAbAmA f e 31

33 f AlAbAmA f Al# Abama (looks like Al + OOV) e 31

34 f AlAbAmA f Al# Abama (looks like Al + OOV) e the Ibama 31

35 But...Ambiguity! Morphology is an inherently ambiguous problem Competing linguistic theories Lexicalization Morphological analyzers (tools) make mistakes Are minimal linguistic morphemes the optimal morphemes for MT? 32

36 Problems 1. Source language inflectional richness. 2. Target language inflectional richness. 3.Source language sublexical semantic compositionality. 4. Ambiguity everywhere! 33

37 General Solution MORPHOLOGY + PROBABILITY 34

38 Why probability? Probabilistic models formalize uncertainty e.g., words can be formed via a morphological derivation according to a joint distribution: p(word, derivation) The probability of a word is naturally defined as the marginal probability: p(word) = p(word, derivation) derivation Such a model can even be trained observing just words (EM!) 35

39 p(derived) = p(derived, de+rive+d) + p(derived, derived+ ) + p(derived, derive+d) + p(derived, deriv+ed)

40 Outline Introduction: 4 problems Three probabilistic modeling solutions Embracing uncertainty: multi-segmentations for decoding and learning Rich morphology via sparse lexical features Hierarchical Bayesian translation: infinite translation lexicons Conclusion 37

41 Outline Introduction: 4 problems Three probabilistic modeling solutions Embracing uncertainty: multi-segmentations for decoding and learning Rich morphology via sparse lexical features Hierarchical Bayesian translation: infinite translation lexicons Conclusion 38

42 f 39

43 f AlAbAmA 39

44 f AlAbAmA f Al# Abama f AlAbama 39

45 f AlAbAmA f Al# Abama f AlAbama e the Ibama e Alabama 39

46 Two problems We need to decode lots of similar source candidates efficiently Lattice / confusion network decoding Kumar & Byrne (EMNLP, 2005), Bertoldi, Zens, Federico (ICAASP, 2007), Dyer et al. (ACL, 2008), inter alia We need a model to generate a set of candidate sources What are the right candidates? 40

47 Two problems We need to decode lots of similar source candidates efficiently Lattice / confusion network decoding Kumar & Byrne (EMNLP, 2005), Bertoldi, Zens, Federico (ICAASP, 2007), Dyer et al. (ACL, 2008), inter alia We need a model to generate a set of candidate sources What are the right candidates? 41

48 Uncertainty is everywhere Requirement: a probabilistic model p(f f) that transforms f f Possible solution: a discriminatively trained model, e.g., a CRF Required data: example (f,f ) pairs from a linguistic expert or other source 42

49 Uncertainty is everywhere What is the best/right analysis... for MT? AlAntxAbAt (DEF+election+PL) 43

50 Uncertainty is everywhere What is the best/right analysis... for MT? AlAntxAbAt (DEF+election+PL) Some possibilities: Sadat & Habash (NAACL, 2007) AlAntxAb +At Al+ AntxAb +At Al+ AntxAbAt AlAntxAbAt 43

51 Uncertainty is everywhere What is the best/right analysis... for MT? AlAntxAbAt (DEF+election+PL) Some possibilities: Sadat & Habash (NAACL, 2007) AlAntxAb +At Al+ AntxAb +At Al+ AntxAbAt AlAntxAbAt Let s use them all! 43

52 Wait...multiple references?!? Train with EM variant Lattices can encode very large sets of references and support efficient inference Bonus: annotation task is much simpler Don t know whether to label an example with A or B? Dyer (NAACL, 2009), Dyer (thesis, 2010) Label it with both! 44

53 Wait...multiple references?!? Train with EM variant Lattices can encode very large sets of references and support efficient inference Bonus: annotation task is much simpler Don t know whether to label an example with A or B? Dyer (NAACL, 2009), Dyer (thesis, 2010) Label it with both! 45

54 Reference Segmentations f f freitag freitag tonbandaufnahme ton tonband band aufnahme 46

55 good phonotactics! Rücken + schmerzen Rückenschmerzen Rückensc + hmerzen Rü + cke + nschme + rzen bad phonotactics! Phonotactic features! 47

56 Just 20 features Phonotactic probability Lexical features (in vocab, OOV) Lexical frequencies Is high frequency? Segment length

57 Input: tonbandaufnahme 49

58 ! " #$%&'()*!+,-. - #$%/&'()*!+01., #$%/2&'()*!+0". 3 #$%/2%&'()*! #$%/2%5&'()*! #$%/2%52&'()*! #$%/2%527&'()*!+6,. 6 #$%/2%5278&'()*! #$%/2%5278%&'()*!+94. "! #$%/2%5278%2&'()*"+!9. "" #$%/2%5278%2:;<&'()*"+!0. /2%&'()*!+,-. /2%5&'()*!+3!. /2%52&'()*!+36. /2%527&'()*!+10. /2%5278&'()*!+6-. /2%5278%&'()*!+61. /2%5278%2&'()*!+9,. /2%5278%2:;<&'()*"+"9. 2%5&'()*!+91. 2%52&'()*!+40. 2%527&'()*!+10. 2%5278&'()*!+6". 2%5278%&'()*!+61. 2%5278%2&'()*!+9,. 2%5278%2:;<&'()*"+"!. %52&'()*!+0-. %527&'()*"+-1. %5278&'()*"+-". %5278%&'()*"+-1. %5278%2&'()*"+,,. %5278%2:;<&'()*"+4!. 527&'()*! &'()*"+! %&'()*"+! %2&'()*"+! %2:;<&'()*" &'()*! %&'()*!+3,. 278%2&'()*! %2:;<&'()*!+,6. 78%&'()*! %2&'()*"+3,. 78%2:;<&'()*"+39. 8%2&'()*!+19. 8%2:;<&'()*"+,3. %2:;<&'()*!+03. 2:;<&'()*!+1!. :;<&'()*! tonbandaufnahme Input:

59 ! " #$%&'%(')*%'+,-./01!2!!3 4 #$%&'%(./015!2673 ')*%'+,-./015!2893! " #$%&'()*!+,-. - #$%/0%1&'()*!+23., #$%/0%1045%0678&'()*"+!9. /0%1&'()*!+2!. 045%0678&'()*!+,:.! " #$%&'()*!+,-. - #$%/&'()*!+01., #$%/2&'()*!+0". 3 #$%/2%&'()*! #$%/2%5&'()*! #$%/2%52&'()*! #$%/2%527&'()*!+6,. 6 #$%/2%5278&'()*! #$%/2%5278%&'()*!+94. "! #$%/2%5278%2&'()*"+!9. "" #$%/2%5278%2:;<&'()*"+!0. /2%&'()*!+,-. /2%5&'()*!+3!. /2%52&'()*!+36. /2%527&'()*!+10. /2%5278&'()*!+6-. /2%5278%&'()*!+61. /2%5278%2&'()*!+9,. /2%5278%2:;<&'()*"+"9. 2%5&'()*!+91. 2%52&'()*!+40. 2%527&'()*!+10. 2%5278&'()*!+6". 2%5278%&'()*!+61. 2%5278%2&'()*!+9,. 2%5278%2:;<&'()*"+"!. %52&'()*!+0-. %527&'()*"+-1. %5278&'()*"+-". %5278%&'()*"+-1. %5278%2&'()*"+,,. %5278%2:;<&'()*"+4!. 527&'()*! &'()*"+! %&'()*"+! %2&'()*"+! %2:;<&'()*" &'()*! %&'()*!+3,. 278%2&'()*! %2:;<&'()*!+,6. 78%&'()*! %2&'()*"+3,. 78%2:;<&'()*"+39. 8%2&'()*!+19. 8%2:;<&'()*"+,3. %2:;<&'()*!+03. 2:;<&'()*!+1!. :;<&'()*!+66. a=0.4 a= a= tonbandaufnahme Input:

60 Recall Precision

61 Translation Evaluation Input BLEU TER Unsegmented best segmentation Lattice (a=0.2) in police raids found illegal guns, ammunition stahlkern, laserzielfernrohr and a machine gun. in police raids found with illegal guns and ammunition steel core, a laser objective telescope and a machine gun. REF: police raids found illegal guns, steel core ammunition, a laser scope and a machine gun. 52

62 Outline Introduction: 4 problems Three probabilistic modeling solutions Embracing uncertainty: multi-segmentations for decoding and learning Rich morphology via sparse lexical features Hierarchical Bayesian translation: infinite translation lexicons Conclusion 53

63 What do we see when we look inside the IBM models? (or any multinomial-based generative model...like parsing models!) 54

64 What do we see when we look inside the IBM models? (or any multinomial-based generative model...like parsing models!) old altes 0.3 car Wagen 0.2 alte alt alter gammelig Auto PKW gammeliges

65 What do we see when we look inside the IBM models? (or any multinomial-based generative model...like parsing models!) old altes 0.3 car Wagen 0.2 alte alt alter gammelig Auto PKW gammeliges

66 DLVM for Translation Addresses problems: 1. Source language inflectional richness. 2. Target language inflectional richness. How? 1. Replace the locally normalized multinomial parameterization in a translation model globally normalized log-linear model. p(e f) with a 2. Add lexical association features sensitive to sublexical units. C. Dyer, J. Clark, A. Lavie, and N. Smith (in review) 57

67 s s n n a 1 a 2 a 3 a n s s s s a... 1 a 2 a 3 a... n... t 1 t 2 t 3 t n s s s... t 1 t 2 t 3 t n s Fully directed model (Brown et al., 1993; Vogel et al., 1996; Berg-Kirkpatrick et al., 2010) Our model 58

68 s s n n a 1 a 2 a 3 a n s s s s a... 1 a 2 a 3 a... n... t 1 t 2 t 3 t n s s s... t 1 t 2 t 3 t n s Fully directed model (Brown et al., 1993; Vogel et al., 1996; Berg-Kirkpatrick et al., 2010) Our model 59

69 old altes 0.3 car Wagen 0.2 alte alt alter gammelig Auto PKW gammeliges

70 old altes 0.3 car Wagen 0.2 alte alt alter gammelig Auto PKW gammeliges 0.1 New model: score(e,f) = 0.2h1(e,f) + 0.9h2(e,f) + 1.3h1(e,f) +... old alt+ Ω [0,2] gammelig+ Ω [0,2] 61

71 old altes 0.3 car Wagen 0.2 alte alt alter gammelig Auto PKW gammeliges 0.1 New model: score(e,f) = 0.2h1(e,f) + 0.9h2(e,f) + 1.3h1(e,f) +... old alt+ Ω [0,2] gammelig+ Ω [0,2] (~ Incremental vs. realizational) 62

72 Sublexical Features každoroční annual IDkaždoroční_annual PREFIXkaž_ann PREFIXkažd_annu PREFIXkaždo_annua SUFFIXí_l SUFFIXní_al 63

73 Sublexical Features každoroční annually IDkaždoroční_annually PREFIXkaž_ann PREFIXkažd_annu PREFIXkaždo_annua SUFFIXí_y SUFFIXní_ly 64

74 Sublexical Features každoročního annually IDkaždoročního_annually PREFIXkaž_ann PREFIXkažd_annu PREFIXkaždo_annua SUFFIXo_y SUFFIXho_ly 65

75 Sublexical Features každoročního annually IDkaždoročního_annually PREFIXkaž_ann PREFIXkažd_annu PREFIXkaždo_annua }Abstract away from inflectional variation! SUFFIXo_y SUFFIXho_ly 66

76 Evaluation Given a parallel corpus (no supervised alignments!), we can infer The weights in the log-linear translation model The MAP alignment The model is a translation model, but we evaluate it as applied to alignment 67

77 Alignment Evaluation AER e f 24.8 Model 4 f e 33.6 sym e f 21.9 DLVM f e 29.3 sym Czech-English, 3.1M words training, 525 sentences gold alignments. 68

78 Translation Evaluation Alignment BLEU METEOR TER Model σ= σ= σ=0.3 Our model 16.5 σ= σ= σ=0.2 Both 17.4 σ= σ= σ=0.5 Czech-English, WMT 2010 test set, 1 reference 69

79 Outline Introduction: 4 problems Three probabilistic modeling solutions Embracing uncertainty: multi-segmentations for decoding and learning Rich morphology via sparse lexical features Hierarchical Bayesian translation: infinite translation lexicons Conclusion 70

80 Bayesian Translation Addresses problems: 2. Target language inflectional richness. How? 1. Replace multinomials in a lexical translation model with a process that generates target language lexical items by combining stems and suffixes. 2. Fully inflected forms can be generated, but a hierarchical prior backs off to a component-wise generation. 71

81 Chinese Restaurant Process a b a c x... 72

82 Chinese Restaurant Process New customer a b a c x... 73

83 Chinese Restaurant Process a b a c x αp 0 (x) 7+α 7+α 7+α 7+α 7+α 74

84 Chinese Restaurant Process a b a c x αp 0 (x) 7+α 7+α 7+α 7+α 7+α α P 0 (x) Concentration parameter Base distribution 75

85 old altes 0.3 car Wagen 0.2 alte alt alter gammelig Auto PKW gammeliges

86 old altes 0.3 car Wagen 0.2 alte alt alter gammelig Auto PKW gammeliges 0.1 New model: suffixes +es + old alt+e +es alt +en alt+es +e +er alt+ + 77

87 Modeling assumptions Observed words are formed by an unobserved process that concatenates a stem α and a suffix β, yielding αβ A source word should have only a few translations αβ translate into only a few stems α The suffix β occurs many times, with many different stems β may be null β will have a maximum length of r Once a word has been translated into some inflected form, that inflected form, its stem, and its suffix should be more likely ( rich get richer ) 78

88 f Translation e Synthesis e X Z Observed during training Latent variable 79

89 f Translation e + Synthesis e X Z Observed during training Latent variable 80

90 Task: Translate the word old 81

91 Task: Translate the word old old alt+e alt+ gammelig+ inflected old 82

92 Task: Translate the word old alt old alt+e alt + alt+ gammelig +e gammelig+ inflected old stem old suffix old 83

93 Task: Translate the word old alt + old alt+e alt + alt+ gammelig +e gammelig+ inflected old stem old? 84

94 +en +e +s alt + + +er old alt+e alt + alt+ gammelig +e gammelig+ inflected old stem old? 85

95 +en +e +s alt + en + +er old alt+e alt + alt+ gammelig +e gammelig+ inflected old stem old +en 86

96 Evaluation Given a parallel corpus, we can infer The MAP alignment The MAP segmentation of each target word into <stem+suffix> 87

97 Alignment Evaluation AER Model 1 - EM Model 1 - HPYP Model 1 - EM Model 1 - HPYP f e 43.3 f e 37.5 e f 38.4 e f 36.6 English-French, 115k words, 447 sentences gold alignments. 88

98 Frequent suffixes Suffix Count + 20,837 +s 334 +d 217 +e 156 +n 156 +y 130 +ed 121 +ing

99 Assessment Breaking the lexical independence assumption is computationally costly The search space is much, much larger! Dealing only with inflectional morphology simplifies the problems Sparse priors are crucial for avoiding degenerate solutions 90

100 In conclusion... 91

101 Why don t we have integrated morphology? 92

102 Why don t we have integrated morphology? Because we spend all our time working on English, which doesn t have much morphology! 93

103 Why don t we have integrated morphology? Translation with words is already hard: an n-word sentence has n! permutations But, if you re looking at a sentence with m letters there are m! permutations Search is... considerably harder m > n m! >>>>> n! Modeling is harder too must also support all these permutations! 94

104 Take away messages Morphology matters for MT Probabilistic models are a great fit for the uncertainty involved Breaking the lexical independence assumption is hard 95

105 Thank you! Toda! $kraf!

106 n Poisson(λ) a i Uniform(1/ f ) e i f ai T fai T fai a, b, M PYP(a, b, M( f ai )) M(e = α + β f) =G f (α) H f (β) G f a, b, f, P 0 PYP(a, b, P 0 ( )) H f a, b, f, H 0 PYP(a, b, H 0 ( )) H 0 a, b, Q 0 PYP(a, b, Q 0 ( )) P 0 (α; p) = Q 0 (β; r) = p β (1 p) V β 1 ( V r) β 97

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers