Learning to translate with neural networks. Michael Auli

Size: px
Start display at page:

Download "Learning to translate with neural networks. Michael Auli"

Transcription

1 Learning to translate with neural networks Michael Auli 1

2 Neural networks for text processing Similar words near each other France Spain dog cat

3 Neural networks for text processing Similar words near each other Changing model parameters for one example effects similar words in similar contexts dog cat France Spain

4 Neural networks for text processing Similar words near each other Changing model parameters for one example effects similar words in similar contexts Traditional discrete models treat each word separately dog cat France Spain

5 Neural networks Error 26% 15% (Krizhevsky 2012) Error 27% 18 % (Hinton 2012) Language Modeling PPLX (Mikolov 2011)

6 Neural networks Error 26% 15% (Krizhevsky 2012) Error 27% 18 % (Hinton 2012) Language Modeling Machine Translation PPLX (Mikolov 2011) This talk Le (2012), Kalchbrenner (2013), Devlin (2014), Sutskever (2014), Cho (2014),

7 What happened in MT over the past 10 years?

8 What happened in MT over the past 10 years? Learning simple models from large bi-texts is a solved problem!!!!!!!!! (Lopez & Post, 2013)

9 What happened in MT over the past 10 years? Learning simple models from large bi-texts is a solved problem!!!!!!!!! (Lopez & Post, 2013) WMT /10 times

10 Machine translation development and progress of the region.

11 Machine translation Translation modeling development and progress of the region.

12 Machine translation Translation modeling Language modeling development and progress of the region.

13 Machine translation Translation modeling Language modeling development and progress of the region. Optimization

14 Machine translation Translation modeling Language modeling development and progress of the region. Optimization Reordering

15 Machine translation Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014

16 Discrete phrase-based translation Koehn et al. (2003) development and progress of the region.

17 Discrete phrase-based translation Koehn et al. (2003) development and progress of the region.

18 Discrete phrase-based translation Koehn et al. (2003) development and progress of the region.

19 Discrete phrase-based translation Koehn et al. (2003) development and progress of the region. of the region development and progress

20 Discrete phrase-based translation Koehn et al. (2003) 70% 47% development and progress of the region. 23% Train Test of the region development and progress 0% Phrase Length

21 Discrete phrase-based translation Koehn et al. (2003) 70% 47% development and progress of the region. 23% Train Test of the region development and progress 0% Phrase Length

22 Discrete phrase-based translation Koehn et al. (2003) 70% Average in train: 2.3 words Average in test: 1.5 words 47% development and progress of the region. 23% Train Test of the region development and progress 0% Phrase Length

23 Kneser & Ney (1996) Discrete n-gram language modeling p(progress in the region) =

24 Discrete n-gram language modeling Kneser & Ney (1996) p(progress in the region) = Train data: development and progress of the region. in

25 Discrete n-gram language modeling Kneser & Ney (1996) p(progress in the region) = p(progress) p(in) p(the) p(region the) Train data: development and progress of the region. in 40% 30% 20% 10% 0% n-gram order Does not include out-of-vocabulary tokens

26 Discrete n-gram language modeling Kneser & Ney (1996) p(progress in the region) = p(progress) p(in) p(the) p(region the) Train data: development and progress of the region. in 40% 30% 20% 10% 0% Average: 2.7 words n-gram order Does not include out-of-vocabulary tokens

27 How can we improve this? Or: how to capture relationships beyond 1.5 to 2.7 words? Neural nets: distributional representations make it easier to capture relationships Recurrent nets: easy to model variable-length sequences dog cat France Spain

28 This talk Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014

29 This talk Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014

30 e.g. bigram LM Feed-forward network

31 Feed-forward network e.g. bigram LM and

32 e.g. bigram LM Feed-forward network U h t = (Ux t ) 1 (z) = 1+exp{ z} and

33 e.g. bigram LM Feed-forward network V U y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 h t = (Ux t ) 1 (z) = 1+exp{ z} and

34 e.g. bigram LM Feed-forward network p(progress and) V y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 U h t = (Ux t ) 1 (z) = 1+exp{ z} and

35 e.g. bigram LM Classification Feed-forward network p(progress and) V y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 Feature learning U and h t = (Ux t ) 1 (z) = 1+exp{ z}

36 e.g. bigram LM Classification Feed-forward network p(progress and) V y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 Still based on limited context! Feature learning U and h t = (Ux t ) 1 (z) = 1+exp{ z}

37 Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t ) U and

38 Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t ) Dependence on previous time step W U and

39 Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t + W h t 1 ) V W U U and

40 Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t + W h t 1 ) V W U W U and

41 <s> Recurrent network

42 Recurrent network <s> p(development <s>)

43 Recurrent network <s> development

44 Recurrent network <s> development

45 Recurrent network <s> development p(and development,<s>)

46 Recurrent network <s> development and

47 Recurrent network <s> development and

48 Recurrent network <s> development and p(progress development,and, <s>)

49 Recurrent network <s> development and progress

50 Recurrent network History of inputs up to current time-step <s> development and progress

51 Recurrent network History of inputs up to current time-step <s> development and progress State of the art in language modeling (Mikolov 2011) More accurate than feed-forward nets (Sundermeyer 2013)

52 Combined language and translation model Auli et al., EMNLP 2013

53 Combined language and translation model Auli et al., EMNLP 2013

54 Combined language and translation model Auli et al., EMNLP 2013

55 Combined language and translation model s Auli et al., EMNLP 2013 Entire source sentence representation

56 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s>

57 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development

58 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development

59 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development and

60 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development and

61 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development and progress

62 Combined language and translation model h t = (Ux t + Wh t 1 + Fs)

63 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window

64 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s>

65 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development

66 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) Source word-window <s> development

67 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development

68 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development

69 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and

70 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and

71 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and

72 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and progress

73 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and progress Le (2012), Kalchbrenner (2013), Devlin (2014), Cho (2014), Sutskever (2014)

74 Experimental setup WMT 2012 French-English translation task Data: 100M words Baseline: Phrase-based model similar to Moses Rescoring Mini-batch gradient descent Class-structured output layer (Goodman, 1996) 15

75 Does the neural model learn to translate? -Discrete translation model WMT 2012 French-English, 100M words, phrase-based baseline, n-best rescoring

76 Does the neural model learn to translate? Discrete translation model BLEU Phrase-based RNNLM RNNTM full-sent RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, n-best rescoring

77 Does the neural model learn to translate? Discrete translation model BLEU Phrase-based RNNLM RNNTM full-sent RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, n-best rescoring

78 Improving a phrase-based baseline Discrete translation model BLEU Phrase-based RNNLM RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, lattice rescoring

79 Improving a phrase-based baseline +Discrete translation model BLEU Phrase-based RNNLM RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, lattice rescoring

80 Combining recurrent nets with discrete models Auli & Gao, ACL 2014

81 Combining recurrent nets with discrete models Auli & Gao, ACL 2014 BLEU Phrase-based n-best rescore lattice rescore decoder integration

82 Combining recurrent nets with discrete models Auli & Gao, ACL BLEU Phrase-based n-best rescore lattice rescore decoder integration

83 Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order

84 Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words Neural models more robust when discrete models rely on sparse estimates? n-gram order

85 Neural nets vs discrete models Neural models more robust when discrete models rely on sparse estimates? Language modeling: n-gram LM and neural net LM two components of log-linear model of translation ê = argmax e X i ih i (f,e) h 1 (f,e) =log p lm (e) 40% 30% 20% 10% 0% Average: 2.7 words n-gram order h 2 (f,e) =log p rnn (e)

86 Neural nets vs discrete models Neural models more robust when discrete models rely on sparse estimates? Language modeling: n-gram LM and neural net LM two components of log-linear model of translation ê = argmax e X i ih i (f,e) h 1 (f,e) =log p lm (e) log p lm (e) = 40% 30% 20% 10% 0% Average: 2.7 words n-gram order h 2 (f,e) =log p rnn (e) Split each model into five features, one for each n-gram order s.t. 5X h i (f,e) log p rnn (e) = i=1 X10 i=6 h i (f,e)

87 Neural nets vs discrete models Neural models more robust when discrete models rely on sparse estimates? Language modeling: n-gram LM and neural net LM two components of log-linear model of translation ê = argmax e X i ih i (f,e) h 1 (f,e) =log p lm (e) log p lm (e) = 40% 30% 20% 10% 0% Average: 2.7 words n-gram order h 2 (f,e) =log p rnn (e) Split each model into five features, one for each n-gram order s.t. 5X h i (f,e) log p rnn (e) = Standard optimizer (MERT) to find weights for each n-gram order i=1 X10 i=6 h i (f,e)

88 Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order Optimzer: MERT

89 Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram order Optimzer: MERT

90 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight less important n-gram order Optimzer: MERT

91 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram RNN 0 less important n-gram order Optimzer: MERT

92 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram RNN 0 less important n-gram order Optimzer: MERT

93 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram RNN 0 less important n-gram model relies on unigram counts n-gram order Optimzer: MERT

94 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram RNN 0 less important n-gram model relies on unigram counts n-gram order Optimzer: MERT n-gram model has higher-order statistics

95 Recurrent Minimum Translation Unit Models development and progress of the region

96 Recurrent Minimum Translation Unit Models M1 M2 M3 M4 M5 M6 development and progress of the region

97 Recurrent Minimum Translation Unit Models M1 M2 M3 M4 M5 M6 development and progress of the region M1 M2 M3 the region of

98 Recurrent Minimum Translation Unit Models M1 M2 M3 M4 M5 M6 development and progress of the region M1 M2 M3 the region of n-gram models over MTUs:! p(m1) p(m2 M1) p(m3 M1,M2) Banchs et al. (2005), Quirk & Menezes (2006)

99 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s>

100 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development

101 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development

102 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development and

103 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development and

104 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development and progress

105 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 Singletons <s> development and progress

106 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s>

107 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s>

108 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s>

109 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development

110 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development

111 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development

112 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and

113 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and

114 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and

115 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and

116 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and progress

117 Recurrent Minimum Translation Unit Models

118 Recurrent Minimum Translation Unit Models BLEU Phrase-based n-gram MTU RNN MTU Koehn et al. (2003) Banchs et al. (2005), Quirk & Menezes (2006)

119 Recurrent Minimum Translation Unit Models BLEU Phrase-based n-gram MTU RNN MTU Koehn et al. (2003) Banchs et al. (2005), Quirk & Menezes (2006)

120 This talk Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014

121 Back propagation with cross entropy error Model distribution (y) delayed

122 Back propagation with cross entropy error True distribution (t) Model distribution (y) delayed

123 Back propagation with cross entropy error True distribution (t) Model distribution (y) J = X i t i log y i delayed Goal: Make correct outputs most likely

124 Back propagation with cross entropy error True distribution (t) Error vector (ẟ) i = y i t i Model distribution (y) J = X i t i log y i delayed Goal: Make correct outputs most likely

125 Back propagation with cross entropy error True distribution (t) Error vector (ẟ) i = y i t i Model distribution (y) J = X i t i log y i delayed Goal: Make correct outputs most likely

126 Back propagation with cross entropy error True distribution (t) Error vector (ẟ) i = y i t i Model distribution (y) J = X i t i log y i delayed Goal: Make correct outputs most likely

127 Back propagation through time <s> development and

128 Back propagation through time <s> development and

129 Back propagation through time <s> development and

130 Back propagation through time <s> development and

131 Optimization Likelihood training very common Auli & Gao, ACL 2014 Optimizing for evaluation metrics difficult, but empirically successful (Och 2003, Smith 2006, Chiang 2009, Gimpel 2010, Hopkins 2011)

132 Optimization Likelihood training very common Auli & Gao, ACL 2014 Optimizing for evaluation metrics difficult, but empirically successful (Och 2003, Smith 2006, Chiang 2009, Gimpel 2010, Hopkins 2011) Next: Task-specific training of neural nets for translation

133 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002)

134 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty

135 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region

136 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region

137 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region

138 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region

139 Expected BLEU Training (Smith 2006, He 2012, Gao 2014) L: max p(ẽ f; )

140 Expected BLEU Training (Smith 2006, He 2012, Gao 2014) Desired translation output L: max p(ẽ f; )

141 Expected BLEU Training (Smith 2006, He 2012, Gao 2014) Desired translation output L: max p(ẽ f; ) xbleu: max X e2e(f) sbleu(e, ẽ)p(e f; ) Generated outputs e.g. n-best, lattice Gain function Model probability

142 Expected BLEU Training Human:! development and progress of the region

143 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region

144 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu

145 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; )

146 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; )

147 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; )

148 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

149 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) t xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

150 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) t p t+1 (e f; ) xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

151 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) t p t+1 (e f; ) xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

152 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) t p t+1 (e f; ) xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

153 Results BLEU Phrase-based RNN Cross Entropy RNN +xbleu Koehn et al. (2003)

154 Results BLEU Phrase-based RNN Cross Entropy RNN +xbleu Koehn et al. (2003)

155 Scaling linear reordering models xbleu training of millions of linear features Auli et al., EMNLP 2014

156 Scaling linear reordering models xbleu training of millions of linear features Auli et al., EMNLP % % +0.7 BLEU % 26.0% news2011% % 24.5 Baseline 4.5K 9K 900K 3M Features 25.6% 2K% 4K% 8K% 16K% 32K% 64K% 136K%272K% Training set size

157 My other neural network projects Social media response generation with RNNs building neural net-based conversational agents based on twitter conversations Semi-supervised phrase table expansion with word embeddings using distributional word and phrase representations and by mapping between distributional source and target spaces with RBVs CCG parsing & tagging with RNNs

158 Semantic CCG Parsing Zettlemoyer (2005, 2007), Bos (2008), Kwiatkowski (2010, 2013) Krishnamurthy (2012), Lewis (2013a,b) Marcel proved completeness NP : marcel (S\NP)/NP : λy.λx.proved(x,y) NP : completeness S\NP : λx.proved(x, completeness) S : proved(marcel, completeness) Combinatory Categorial Grammar (CCG; Steedman 2000)

159 How is this useful? Marcel proved completeness NP : marcel (S\NP)/NP : NP : λy.λx.proved(x,y) completeness S\NP : λx.proved(x, completeness) Answer S : proved(marcel, completeness) User query (Semantic) parsing Knowledge Base

160 How is this useful? Marcel proved completeness Parse failures and lexical ambiguity are a NP : (S\NP)/NP : NP : marcel λy.λx.proved(x,y) completeness major source of errors in semantic parsing S\NP : λx.proved(x, completeness) (Kwiatkowski 2013) S : proved(marcel, completeness) Answer User query (Semantic) parsing Knowledge Base

161 Integrated Parsing & Tagging with belief propagation, dual decomposition and softmax-margin training Auli & Lopez ACL 2011a,b Auli & Lopez EMNLP 2011 parsing factor p i,j = 1 Z f i,je i,j b i,j o i,j inside-outside supertagging factor forward-backward Marcel proved completeness

162 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

163 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

164 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

165 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

166 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

167 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

168 Recurrent nets for CCG supertagging & parsing... with Wenduan Xu Marcel NP proved S\NP/NP

169 Recurrent nets for CCG supertagging & parsing... with Wenduan Xu Marcel NP proved S\NP/NP tagging parsing CRF (Clark & Curran 07) FFN (Lewis & Steedman, 14) RNN

170 Two RNN translation models Summary Neural nets help most when discrete models sparse Task-specific objective gives best performance Next: Better modeling of source-side, e.g., bi-directional RNNs, different architectures

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill

More information

Lecture 5 Neural models for NLP

Lecture 5 Neural models for NLP CS546: Machine Learning in NLP (Spring 2018) http://courses.engr.illinois.edu/cs546/ Lecture 5 Neural models for NLP Julia Hockenmaier juliahmr@illinois.edu 3324 Siebel Center Office hours: Tue/Thu 2pm-3pm

More information

Conditional Language Modeling. Chris Dyer

Conditional Language Modeling. Chris Dyer Conditional Language Modeling Chris Dyer Unconditional LMs A language model assigns probabilities to sequences of words,. w =(w 1,w 2,...,w`) It is convenient to decompose this probability using the chain

More information

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti

Quasi-Synchronous Phrase Dependency Grammars for Machine Translation. lti Quasi-Synchronous Phrase Dependency Grammars for Machine Translation Kevin Gimpel Noah A. Smith 1 Introduction MT using dependency grammars on phrases Phrases capture local reordering and idiomatic translations

More information

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation.

Sparse vectors recap. ANLP Lecture 22 Lexical Semantics with Dense Vectors. Before density, another approach to normalisation. ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Previous lectures: Sparse vectors recap How to represent

More information

ANLP Lecture 22 Lexical Semantics with Dense Vectors

ANLP Lecture 22 Lexical Semantics with Dense Vectors ANLP Lecture 22 Lexical Semantics with Dense Vectors Henry S. Thompson Based on slides by Jurafsky & Martin, some via Dorota Glowacka 5 November 2018 Henry S. Thompson ANLP Lecture 22 5 November 2018 Previous

More information

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287

Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Part-of-Speech Tagging + Neural Networks 3: Word Embeddings CS 287 Review: Neural Networks One-layer multi-layer perceptron architecture, NN MLP1 (x) = g(xw 1 + b 1 )W 2 + b 2 xw + b; perceptron x is the

More information

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM

a) b) (Natural Language Processing; NLP) (Deep Learning) Bag of words White House RGB [1] IBM c 1. (Natural Language Processing; NLP) (Deep Learning) RGB IBM 135 8511 5 6 52 yutat@jp.ibm.com a) b) 2. 1 0 2 1 Bag of words White House 2 [1] 2015 4 Copyright c by ORSJ. Unauthorized reproduction of

More information

The Geometry of Statistical Machine Translation

The Geometry of Statistical Machine Translation The Geometry of Statistical Machine Translation Presented by Rory Waite 16th of December 2015 ntroduction Linear Models Convex Geometry The Minkowski Sum Projected MERT Conclusions ntroduction We provide

More information

Multi-Source Neural Translation

Multi-Source Neural Translation Multi-Source Neural Translation Barret Zoph and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {zoph,knight}@isi.edu In the neural encoder-decoder

More information

Multi-Source Neural Translation

Multi-Source Neural Translation Multi-Source Neural Translation Barret Zoph and Kevin Knight Information Sciences Institute Department of Computer Science University of Southern California {zoph,knight}@isi.edu Abstract We build a multi-source

More information

Tuning as Linear Regression

Tuning as Linear Regression Tuning as Linear Regression Marzieh Bazrafshan, Tagyoung Chung and Daniel Gildea Department of Computer Science University of Rochester Rochester, NY 14627 Abstract We propose a tuning method for statistical

More information

Variational Decoding for Statistical Machine Translation

Variational Decoding for Statistical Machine Translation Variational Decoding for Statistical Machine Translation Zhifei Li, Jason Eisner, and Sanjeev Khudanpur Center for Language and Speech Processing Computer Science Department Johns Hopkins University 1

More information

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto

TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto TALP Phrase-Based System and TALP System Combination for the IWSLT 2006 IWSLT 2006, Kyoto Marta R. Costa-jussà, Josep M. Crego, Adrià de Gispert, Patrik Lambert, Maxim Khalilov, José A.R. Fonollosa, José

More information

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing

Semantics with Dense Vectors. Reference: D. Jurafsky and J. Martin, Speech and Language Processing Semantics with Dense Vectors Reference: D. Jurafsky and J. Martin, Speech and Language Processing 1 Semantics with Dense Vectors We saw how to represent a word as a sparse vector with dimensions corresponding

More information

Phrase-Based Statistical Machine Translation with Pivot Languages

Phrase-Based Statistical Machine Translation with Pivot Languages Phrase-Based Statistical Machine Translation with Pivot Languages N. Bertoldi, M. Barbaiani, M. Federico, R. Cattoni FBK, Trento - Italy Rovira i Virgili University, Tarragona - Spain October 21st, 2008

More information

CSC321 Lecture 15: Recurrent Neural Networks

CSC321 Lecture 15: Recurrent Neural Networks CSC321 Lecture 15: Recurrent Neural Networks Roger Grosse Roger Grosse CSC321 Lecture 15: Recurrent Neural Networks 1 / 26 Overview Sometimes we re interested in predicting sequences Speech-to-text and

More information

arxiv: v1 [cs.cl] 22 Jun 2015

arxiv: v1 [cs.cl] 22 Jun 2015 Neural Transformation Machine: A New Architecture for Sequence-to-Sequence Learning arxiv:1506.06442v1 [cs.cl] 22 Jun 2015 Fandong Meng 1 Zhengdong Lu 2 Zhaopeng Tu 2 Hang Li 2 and Qun Liu 1 1 Institute

More information

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26

Machine Translation. 10: Advanced Neural Machine Translation Architectures. Rico Sennrich. University of Edinburgh. R. Sennrich MT / 26 Machine Translation 10: Advanced Neural Machine Translation Architectures Rico Sennrich University of Edinburgh R. Sennrich MT 2018 10 1 / 26 Today s Lecture so far today we discussed RNNs as encoder and

More information

Neural Networks Language Models

Neural Networks Language Models Neural Networks Language Models Philipp Koehn 10 October 2017 N-Gram Backoff Language Model 1 Previously, we approximated... by applying the chain rule p(w ) = p(w 1, w 2,..., w n ) p(w ) = i p(w i w 1,...,

More information

Discriminative Training

Discriminative Training Discriminative Training February 19, 2013 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Info 59/259 Lecture 4: Text classification 3 (Sept 5, 207) David Bamman, UC Berkeley . https://www.forbes.com/sites/kevinmurnane/206/04/0/what-is-deep-learning-and-how-is-it-useful

More information

Improved Learning through Augmenting the Loss

Improved Learning through Augmenting the Loss Improved Learning through Augmenting the Loss Hakan Inan inanh@stanford.edu Khashayar Khosravi khosravi@stanford.edu Abstract We present two improvements to the well-known Recurrent Neural Network Language

More information

Minimum Error Rate Training Semiring

Minimum Error Rate Training Semiring Minimum Error Rate Training Semiring Artem Sokolov & François Yvon LIMSI-CNRS & LIMSI-CNRS/Univ. Paris Sud {artem.sokolov,francois.yvon}@limsi.fr EAMT 2011 31 May 2011 Artem Sokolov & François Yvon (LIMSI)

More information

Lecture 15: Recurrent Neural Nets

Lecture 15: Recurrent Neural Nets Lecture 15: Recurrent Neural Nets Roger Grosse 1 Introduction Most of the prediction tasks we ve looked at have involved pretty simple kinds of outputs, such as real values or discrete categories. But

More information

arxiv: v2 [cs.cl] 1 Jan 2019

arxiv: v2 [cs.cl] 1 Jan 2019 Variational Self-attention Model for Sentence Representation arxiv:1812.11559v2 [cs.cl] 1 Jan 2019 Qiang Zhang 1, Shangsong Liang 2, Emine Yilmaz 1 1 University College London, London, United Kingdom 2

More information

Today s Lecture. Dropout

Today s Lecture. Dropout Today s Lecture so far we discussed RNNs as encoder and decoder we discussed some architecture variants: RNN vs. GRU vs. LSTM attention mechanisms Machine Translation 1: Advanced Neural Machine Translation

More information

Natural Language Processing

Natural Language Processing Natural Language Processing Global linear models Based on slides from Michael Collins Globally-normalized models Why do we decompose to a sequence of decisions? Can we directly estimate the probability

More information

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown

Homework 3 COMS 4705 Fall 2017 Prof. Kathleen McKeown Homework 3 COMS 4705 Fall 017 Prof. Kathleen McKeown The assignment consists of a programming part and a written part. For the programming part, make sure you have set up the development environment as

More information

From perceptrons to word embeddings. Simon Šuster University of Groningen

From perceptrons to word embeddings. Simon Šuster University of Groningen From perceptrons to word embeddings Simon Šuster University of Groningen Outline A basic computational unit Weighting some input to produce an output: classification Perceptron Classify tweets Written

More information

Lecture 13: Structured Prediction

Lecture 13: Structured Prediction Lecture 13: Structured Prediction Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 CS6501: NLP 1 Quiz 2 v Lectures 9-13 v Lecture 12: before page

More information

Deep Learning For Mathematical Functions

Deep Learning For Mathematical Functions 000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050

More information

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model

Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Foundations of Natural Language Processing Lecture 5 More smoothing and the Noisy Channel Model Alex Lascarides (Slides based on those from Alex Lascarides, Sharon Goldwater and Philipop Koehn) 30 January

More information

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model

Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model Empirical Methods in Natural Language Processing Lecture 10a More smoothing and the Noisy Channel Model (most slides from Sharon Goldwater; some adapted from Philipp Koehn) 5 October 2016 Nathan Schneider

More information

Deep Learning Recurrent Networks 2/28/2018

Deep Learning Recurrent Networks 2/28/2018 Deep Learning Recurrent Networks /8/8 Recap: Recurrent networks can be incredibly effective Story so far Y(t+) Stock vector X(t) X(t+) X(t+) X(t+) X(t+) X(t+5) X(t+) X(t+7) Iterated structures are good

More information

CS230: Lecture 10 Sequence models II

CS230: Lecture 10 Sequence models II CS23: Lecture 1 Sequence models II Today s outline We will learn how to: - Automatically score an NLP model I. BLEU score - Improve Machine II. Beam Search Translation results with Beam search III. Speech

More information

A Discriminative Model for Semantics-to-String Translation

A Discriminative Model for Semantics-to-String Translation A Discriminative Model for Semantics-to-String Translation Aleš Tamchyna 1 and Chris Quirk 2 and Michel Galley 2 1 Charles University in Prague 2 Microsoft Research July 30, 2015 Tamchyna, Quirk, Galley

More information

NEURAL LANGUAGE MODELS

NEURAL LANGUAGE MODELS COMP90042 LECTURE 14 NEURAL LANGUAGE MODELS LANGUAGE MODELS Assign a probability to a sequence of words Framed as sliding a window over the sentence, predicting each word from finite context to left E.g.,

More information

Marrying Dynamic Programming with Recurrent Neural Networks

Marrying Dynamic Programming with Recurrent Neural Networks Marrying Dynamic Programming with Recurrent Neural Networks I eat sushi with tuna from Japan Liang Huang Oregon State University Structured Prediction Workshop, EMNLP 2017, Copenhagen, Denmark Marrying

More information

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook

Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recurrent Neural Networks (Part - 2) Sumit Chopra Facebook Recap Standard RNNs Training: Backpropagation Through Time (BPTT) Application to sequence modeling Language modeling Applications: Automatic speech

More information

Deep learning for Natural Language Processing and Machine Translation

Deep learning for Natural Language Processing and Machine Translation Deep learning for Natural Language Processing and Machine Translation 2015.10.16 Seung-Hoon Na Contents Introduction: Neural network, deep learning Deep learning for Natural language processing Neural

More information

Improving Sequence-to-Sequence Constituency Parsing

Improving Sequence-to-Sequence Constituency Parsing Improving Sequence-to-Sequence Constituency Parsing Lemao Liu, Muhua Zhu and Shuming Shi Tencent AI Lab, Shenzhen, China {redmondliu,muhuazhu, shumingshi}@tencent.com Abstract Sequence-to-sequence constituency

More information

Deep Learning for Natural Language Processing

Deep Learning for Natural Language Processing Deep Learning for Natural Language Processing Dylan Drover, Borui Ye, Jie Peng University of Waterloo djdrover@uwaterloo.ca borui.ye@uwaterloo.ca July 8, 2015 Dylan Drover, Borui Ye, Jie Peng (University

More information

An overview of word2vec

An overview of word2vec An overview of word2vec Benjamin Wilson Berlin ML Meetup, July 8 2014 Benjamin Wilson word2vec Berlin ML Meetup 1 / 25 Outline 1 Introduction 2 Background & Significance 3 Architecture 4 CBOW word representations

More information

Deep Learning for NLP

Deep Learning for NLP Deep Learning for NLP Instructor: Wei Xu Ohio State University CSE 5525 Many slides from Greg Durrett Outline Motivation for neural networks Feedforward neural networks Applying feedforward neural networks

More information

Cross-Lingual Language Modeling for Automatic Speech Recogntion

Cross-Lingual Language Modeling for Automatic Speech Recogntion GBO Presentation Cross-Lingual Language Modeling for Automatic Speech Recogntion November 14, 2003 Woosung Kim woosung@cs.jhu.edu Center for Language and Speech Processing Dept. of Computer Science The

More information

Recurrent Neural Networks. Jian Tang

Recurrent Neural Networks. Jian Tang Recurrent Neural Networks Jian Tang tangjianpku@gmail.com 1 RNN: Recurrent neural networks Neural networks for sequence modeling Summarize a sequence with fix-sized vector through recursively updating

More information

Deep Learning. Language Models and Word Embeddings. Christof Monz

Deep Learning. Language Models and Word Embeddings. Christof Monz Deep Learning Today s Class N-gram language modeling Feed-forward neural language model Architecture Final layer computations Word embeddings Continuous bag-of-words model Skip-gram Negative sampling 1

More information

Chapter 3: Basics of Language Modelling

Chapter 3: Basics of Language Modelling Chapter 3: Basics of Language Modelling Motivation Language Models are used in Speech Recognition Machine Translation Natural Language Generation Query completion For research and development: need a simple

More information

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3)

perplexity = 2 cross-entropy (5.1) cross-entropy = 1 N log 2 likelihood (5.2) likelihood = P(w 1 w N ) (5.3) Chapter 5 Language Modeling 5.1 Introduction A language model is simply a model of what strings (of words) are more or less likely to be generated by a speaker of English (or some other language). More

More information

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention

CS230: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention CS23: Lecture 8 Word2Vec applications + Recurrent Neural Networks with Attention Today s outline We will learn how to: I. Word Vector Representation i. Training - Generalize results with word vectors -

More information

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley

Algorithms for NLP. Language Modeling III. Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Algorithms for NLP Language Modeling III Taylor Berg-Kirkpatrick CMU Slides: Dan Klein UC Berkeley Announcements Office hours on website but no OH for Taylor until next week. Efficient Hashing Closed address

More information

DISTRIBUTIONAL SEMANTICS

DISTRIBUTIONAL SEMANTICS COMP90042 LECTURE 4 DISTRIBUTIONAL SEMANTICS LEXICAL DATABASES - PROBLEMS Manually constructed Expensive Human annotation can be biased and noisy Language is dynamic New words: slangs, terminology, etc.

More information

ANLP Lecture 6 N-gram models and smoothing

ANLP Lecture 6 N-gram models and smoothing ANLP Lecture 6 N-gram models and smoothing Sharon Goldwater (some slides from Philipp Koehn) 27 September 2018 Sharon Goldwater ANLP Lecture 6 27 September 2018 Recap: N-gram models We can model sentence

More information

with Local Dependencies

with Local Dependencies CS11-747 Neural Networks for NLP Structured Prediction with Local Dependencies Xuezhe Ma (Max) Site https://phontron.com/class/nn4nlp2017/ An Example Structured Prediction Problem: Sequence Labeling Sequence

More information

Lecture 6: Neural Networks for Representing Word Meaning

Lecture 6: Neural Networks for Representing Word Meaning Lecture 6: Neural Networks for Representing Word Meaning Mirella Lapata School of Informatics University of Edinburgh mlap@inf.ed.ac.uk February 7, 2017 1 / 28 Logistic Regression Input is a feature vector,

More information

statistical machine translation

statistical machine translation statistical machine translation P A R T 3 : D E C O D I N G & E V A L U A T I O N CSC401/2511 Natural Language Computing Spring 2019 Lecture 6 Frank Rudzicz and Chloé Pou-Prom 1 University of Toronto Statistical

More information

The Noisy Channel Model and Markov Models

The Noisy Channel Model and Markov Models 1/24 The Noisy Channel Model and Markov Models Mark Johnson September 3, 2014 2/24 The big ideas The story so far: machine learning classifiers learn a function that maps a data item X to a label Y handle

More information

Lab 12: Structured Prediction

Lab 12: Structured Prediction December 4, 2014 Lecture plan structured perceptron application: confused messages application: dependency parsing structured SVM Class review: from modelization to classification What does learning mean?

More information

Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation

Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation Journal of Machine Learning Research 12 (2011) 1-30 Submitted 5/10; Revised 11/10; Published 1/11 Exploitation of Machine Learning Techniques in Modelling Phrase Movements for Machine Translation Yizhao

More information

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24

N-grams. Motivation. Simple n-grams. Smoothing. Backoff. N-grams L545. Dept. of Linguistics, Indiana University Spring / 24 L545 Dept. of Linguistics, Indiana University Spring 2013 1 / 24 Morphosyntax We just finished talking about morphology (cf. words) And pretty soon we re going to discuss syntax (cf. sentences) In between,

More information

Discriminative Training. March 4, 2014

Discriminative Training. March 4, 2014 Discriminative Training March 4, 2014 Noisy Channels Again p(e) source English Noisy Channels Again p(e) p(g e) source English German Noisy Channels Again p(e) p(g e) source English German decoder e =

More information

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions

Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions 2018 IEEE International Workshop on Machine Learning for Signal Processing (MLSP 18) Recurrent Neural Networks with Flexible Gates using Kernel Activation Functions Authors: S. Scardapane, S. Van Vaerenbergh,

More information

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1

What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 What Do Neural Networks Do? MLP Lecture 3 Multi-layer networks 1 Multi-layer networks Steve Renals Machine Learning Practical MLP Lecture 3 7 October 2015 MLP Lecture 3 Multi-layer networks 2 What Do Single

More information

Naïve Bayes, Maxent and Neural Models

Naïve Bayes, Maxent and Neural Models Naïve Bayes, Maxent and Neural Models CMSC 473/673 UMBC Some slides adapted from 3SLP Outline Recap: classification (MAP vs. noisy channel) & evaluation Naïve Bayes (NB) classification Terminology: bag-of-words

More information

Language Model. Introduction to N-grams

Language Model. Introduction to N-grams Language Model Introduction to N-grams Probabilistic Language Model Goal: assign a probability to a sentence Application: Machine Translation P(high winds tonight) > P(large winds tonight) Spelling Correction

More information

Task-Oriented Dialogue System (Young, 2000)

Task-Oriented Dialogue System (Young, 2000) 2 Review Task-Oriented Dialogue System (Young, 2000) 3 http://rsta.royalsocietypublishing.org/content/358/1769/1389.short Speech Signal Speech Recognition Hypothesis are there any action movies to see

More information

Bayesian Paragraph Vectors

Bayesian Paragraph Vectors Bayesian Paragraph Vectors Geng Ji 1, Robert Bamler 2, Erik B. Sudderth 1, and Stephan Mandt 2 1 Department of Computer Science, UC Irvine, {gji1, sudderth}@uci.edu 2 Disney Research, firstname.lastname@disneyresearch.com

More information

Structured Prediction Theory and Algorithms

Structured Prediction Theory and Algorithms Structured Prediction Theory and Algorithms Joint work with Corinna Cortes (Google Research) Vitaly Kuznetsov (Google Research) Scott Yang (Courant Institute) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE

More information

Decoding and Inference with Syntactic Translation Models

Decoding and Inference with Syntactic Translation Models Decoding and Inference with Syntactic Translation Models March 5, 2013 CFGs S NP VP VP NP V V NP NP CFGs S NP VP S VP NP V V NP NP CFGs S NP VP S VP NP V NP VP V NP NP CFGs S NP VP S VP NP V NP VP V NP

More information

Conditional Language modeling with attention

Conditional Language modeling with attention Conditional Language modeling with attention 2017.08.25 Oxford Deep NLP 조수현 Review Conditional language model: assign probabilities to sequence of words given some conditioning context x What is the probability

More information

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018

Deep Learning Sequence to Sequence models: Attention Models. 17 March 2018 Deep Learning Sequence to Sequence models: Attention Models 17 March 2018 1 Sequence-to-sequence modelling Problem: E.g. A sequence X 1 X N goes in A different sequence Y 1 Y M comes out Speech recognition:

More information

Analysis of techniques for coarse-to-fine decoding in neural machine translation

Analysis of techniques for coarse-to-fine decoding in neural machine translation Analysis of techniques for coarse-to-fine decoding in neural machine translation Soňa Galovičová E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science by Research School of Informatics University

More information

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging

ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging ACS Introduction to NLP Lecture 2: Part of Speech (POS) Tagging Stephen Clark Natural Language and Information Processing (NLIP) Group sc609@cam.ac.uk The POS Tagging Problem 2 England NNP s POS fencers

More information

arxiv: v1 [cs.cl] 21 May 2017

arxiv: v1 [cs.cl] 21 May 2017 Spelling Correction as a Foreign Language Yingbo Zhou yingbzhou@ebay.com Utkarsh Porwal uporwal@ebay.com Roberto Konow rkonow@ebay.com arxiv:1705.07371v1 [cs.cl] 21 May 2017 Abstract In this paper, we

More information

A Support Vector Method for Multivariate Performance Measures

A Support Vector Method for Multivariate Performance Measures A Support Vector Method for Multivariate Performance Measures Thorsten Joachims Cornell University Department of Computer Science Thanks to Rich Caruana, Alexandru Niculescu-Mizil, Pierre Dupont, Jérôme

More information

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields

Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Statistical NLP for the Web Log Linear Models, MEMM, Conditional Random Fields Sameer Maskey Week 13, Nov 28, 2012 1 Announcements Next lecture is the last lecture Wrap up of the semester 2 Final Project

More information

Language Model Rest Costs and Space-Efficient Storage

Language Model Rest Costs and Space-Efficient Storage Language Model Rest Costs and Space-Efficient Storage Kenneth Heafield Philipp Koehn Alon Lavie Carnegie Mellon, University of Edinburgh July 14, 2012 Complaint About Language Models Make Search Expensive

More information

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention

TTIC 31230, Fundamentals of Deep Learning David McAllester, April Sequence to Sequence Models and Attention TTIC 31230, Fundamentals of Deep Learning David McAllester, April 2017 Sequence to Sequence Models and Attention Encode-Decode Architectures for Machine Translation [Figure from Luong et al.] In Sutskever

More information

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013

More on HMMs and other sequence models. Intro to NLP - ETHZ - 18/03/2013 More on HMMs and other sequence models Intro to NLP - ETHZ - 18/03/2013 Summary Parts of speech tagging HMMs: Unsupervised parameter estimation Forward Backward algorithm Bayesian variants Discriminative

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6

Machine Learning for Large-Scale Data Analysis and Decision Making A. Neural Networks Week #6 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Neural Networks Week #6 Today Neural Networks A. Modeling B. Fitting C. Deep neural networks Today s material is (adapted)

More information

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale

Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale Natural Language Processing (CSEP 517): Machine Translation (Continued), Summarization, & Finale Noah Smith c 2017 University of Washington nasmith@cs.washington.edu May 22, 2017 1 / 30 To-Do List Online

More information

Consensus Training for Consensus Decoding in Machine Translation

Consensus Training for Consensus Decoding in Machine Translation Consensus Training for Consensus Decoding in Machine Translation Adam Pauls, John DeNero and Dan Klein Computer Science Division University of California at Berkeley {adpauls,denero,klein}@cs.berkeley.edu

More information

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017

Deep Learning for Natural Language Processing. Sidharth Mudgal April 4, 2017 Deep Learning for Natural Language Processing Sidharth Mudgal April 4, 2017 Table of contents 1. Intro 2. Word Vectors 3. Word2Vec 4. Char Level Word Embeddings 5. Application: Entity Matching 6. Conclusion

More information

Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager

Feature Noising. Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager Feature Noising Sida Wang, joint work with Part 1: Stefan Wager, Percy Liang Part 2: Mengqiu Wang, Chris Manning, Percy Liang, Stefan Wager Outline Part 0: Some backgrounds Part 1: Dropout as adaptive

More information

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning

A Supertag-Context Model for Weakly-Supervised CCG Parser Learning A Supertag-Context Model for Weakly-Supervised CCG Parser Learning Dan Garrette Chris Dyer Jason Baldridge Noah A. Smith U. Washington CMU UT-Austin CMU Contributions 1. A new generative model for learning

More information

Introduction to Deep Neural Networks

Introduction to Deep Neural Networks Introduction to Deep Neural Networks Presenter: Chunyuan Li Pattern Classification and Recognition (ECE 681.01) Duke University April, 2016 Outline 1 Background and Preliminaries Why DNNs? Model: Logistic

More information

word2vec Parameter Learning Explained

word2vec Parameter Learning Explained word2vec Parameter Learning Explained Xin Rong ronxin@umich.edu Abstract The word2vec model and application by Mikolov et al. have attracted a great amount of attention in recent two years. The vector

More information

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park

Low-Dimensional Discriminative Reranking. Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Low-Dimensional Discriminative Reranking Jagadeesh Jagarlamudi and Hal Daume III University of Maryland, College Park Discriminative Reranking Useful for many NLP tasks Enables us to use arbitrary features

More information

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister

A Syntax-based Statistical Machine Translation Model. Alexander Friedl, Georg Teichtmeister A Syntax-based Statistical Machine Translation Model Alexander Friedl, Georg Teichtmeister 4.12.2006 Introduction The model Experiment Conclusion Statistical Translation Model (STM): - mathematical model

More information

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation

Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation Fast and Scalable Decoding with Language Model Look-Ahead for Phrase-based Statistical Machine Translation Joern Wuebker, Hermann Ney Human Language Technology and Pattern Recognition Group Computer Science

More information

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING

MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING MACHINE LEARNING FOR NATURAL LANGUAGE PROCESSING Outline Some Sample NLP Task [Noah Smith] Structured Prediction For NLP Structured Prediction Methods Conditional Random Fields Structured Perceptron Discussion

More information

Topics in Natural Language Processing

Topics in Natural Language Processing Topics in Natural Language Processing Shay Cohen Institute for Language, Cognition and Computation University of Edinburgh Lecture 9 Administrativia Next class will be a summary Please email me questions

More information

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer

CSE446: Neural Networks Spring Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer CSE446: Neural Networks Spring 2017 Many slides are adapted from Carlos Guestrin and Luke Zettlemoyer Human Neurons Switching time ~ 0.001 second Number of neurons 10 10 Connections per neuron 10 4-5 Scene

More information

Natural Language Processing (CSEP 517): Machine Translation

Natural Language Processing (CSEP 517): Machine Translation Natural Language Processing (CSEP 57): Machine Translation Noah Smith c 207 University of Washington nasmith@cs.washington.edu May 5, 207 / 59 To-Do List Online quiz: due Sunday (Jurafsky and Martin, 2008,

More information

CSC321 Lecture 16: ResNets and Attention

CSC321 Lecture 16: ResNets and Attention CSC321 Lecture 16: ResNets and Attention Roger Grosse Roger Grosse CSC321 Lecture 16: ResNets and Attention 1 / 24 Overview Two topics for today: Topic 1: Deep Residual Networks (ResNets) This is the state-of-the

More information

Multiword Expression Identification with Tree Substitution Grammars

Multiword Expression Identification with Tree Substitution Grammars Multiword Expression Identification with Tree Substitution Grammars Spence Green, Marie-Catherine de Marneffe, John Bauer, and Christopher D. Manning Stanford University EMNLP 2011 Main Idea Use syntactic

More information

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści

Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, Spis treści Deep learning / Ian Goodfellow, Yoshua Bengio and Aaron Courville. - Cambridge, MA ; London, 2017 Spis treści Website Acknowledgments Notation xiii xv xix 1 Introduction 1 1.1 Who Should Read This Book?

More information

Slide credit from Hung-Yi Lee & Richard Socher

Slide credit from Hung-Yi Lee & Richard Socher Slide credit from Hung-Yi Lee & Richard Socher 1 Review Recurrent Neural Network 2 Recurrent Neural Network Idea: condition the neural network on all previous words and tie the weights at each time step

More information

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov

Probabilistic Graphical Models: MRFs and CRFs. CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Probabilistic Graphical Models: MRFs and CRFs CSE628: Natural Language Processing Guest Lecturer: Veselin Stoyanov Why PGMs? PGMs can model joint probabilities of many events. many techniques commonly

More information