Learning to translate with neural networks. Michael Auli

Size: px

Start display at page:

Download "Learning to translate with neural networks. Michael Auli"

Amice Tamsin Harvey
5 years ago
Views:

1 Learning to translate with neural networks Michael Auli 1

2 Neural networks for text processing Similar words near each other France Spain dog cat

3 Neural networks for text processing Similar words near each other Changing model parameters for one example effects similar words in similar contexts dog cat France Spain

4 Neural networks for text processing Similar words near each other Changing model parameters for one example effects similar words in similar contexts Traditional discrete models treat each word separately dog cat France Spain

5 Neural networks Error 26% 15% (Krizhevsky 2012) Error 27% 18 % (Hinton 2012) Language Modeling PPLX (Mikolov 2011)

6 Neural networks Error 26% 15% (Krizhevsky 2012) Error 27% 18 % (Hinton 2012) Language Modeling Machine Translation PPLX (Mikolov 2011) This talk Le (2012), Kalchbrenner (2013), Devlin (2014), Sutskever (2014), Cho (2014),

7 What happened in MT over the past 10 years?

8 What happened in MT over the past 10 years? Learning simple models from large bi-texts is a solved problem!!!!!!!!! (Lopez & Post, 2013)

9 What happened in MT over the past 10 years? Learning simple models from large bi-texts is a solved problem!!!!!!!!! (Lopez & Post, 2013) WMT /10 times

10 Machine translation development and progress of the region.

11 Machine translation Translation modeling development and progress of the region.

12 Machine translation Translation modeling Language modeling development and progress of the region.

13 Machine translation Translation modeling Language modeling development and progress of the region. Optimization

14 Machine translation Translation modeling Language modeling development and progress of the region. Optimization Reordering

15 Machine translation Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014

16 Discrete phrase-based translation Koehn et al. (2003) development and progress of the region.

17 Discrete phrase-based translation Koehn et al. (2003) development and progress of the region.

18 Discrete phrase-based translation Koehn et al. (2003) development and progress of the region.

19 Discrete phrase-based translation Koehn et al. (2003) development and progress of the region. of the region development and progress

20 Discrete phrase-based translation Koehn et al. (2003) 70% 47% development and progress of the region. 23% Train Test of the region development and progress 0% Phrase Length

21 Discrete phrase-based translation Koehn et al. (2003) 70% 47% development and progress of the region. 23% Train Test of the region development and progress 0% Phrase Length

22 Discrete phrase-based translation Koehn et al. (2003) 70% Average in train: 2.3 words Average in test: 1.5 words 47% development and progress of the region. 23% Train Test of the region development and progress 0% Phrase Length

23 Kneser & Ney (1996) Discrete n-gram language modeling p(progress in the region) =

24 Discrete n-gram language modeling Kneser & Ney (1996) p(progress in the region) = Train data: development and progress of the region. in

25 Discrete n-gram language modeling Kneser & Ney (1996) p(progress in the region) = p(progress) p(in) p(the) p(region the) Train data: development and progress of the region. in 40% 30% 20% 10% 0% n-gram order Does not include out-of-vocabulary tokens

26 Discrete n-gram language modeling Kneser & Ney (1996) p(progress in the region) = p(progress) p(in) p(the) p(region the) Train data: development and progress of the region. in 40% 30% 20% 10% 0% Average: 2.7 words n-gram order Does not include out-of-vocabulary tokens

27 How can we improve this? Or: how to capture relationships beyond 1.5 to 2.7 words? Neural nets: distributional representations make it easier to capture relationships Recurrent nets: easy to model variable-length sequences dog cat France Spain

28 This talk Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014

29 This talk Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014

30 e.g. bigram LM Feed-forward network

31 Feed-forward network e.g. bigram LM and

32 e.g. bigram LM Feed-forward network U h t = (Ux t ) 1 (z) = 1+exp{ z} and

33 e.g. bigram LM Feed-forward network V U y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 h t = (Ux t ) 1 (z) = 1+exp{ z} and

34 e.g. bigram LM Feed-forward network p(progress and) V y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 U h t = (Ux t ) 1 (z) = 1+exp{ z} and

35 e.g. bigram LM Classification Feed-forward network p(progress and) V y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 Feature learning U and h t = (Ux t ) 1 (z) = 1+exp{ z}

36 e.g. bigram LM Classification Feed-forward network p(progress and) V y t = s(vh t ) s(z) = P exp{z} z exp{z 0 } 0 Still based on limited context! Feature learning U and h t = (Ux t ) 1 (z) = 1+exp{ z}

37 Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t ) U and

38 Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t ) Dependence on previous time step W U and

39 Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t + W h t 1 ) V W U U and

40 Recurrent network e.g. bigram LM p(progress and) V h t = (Ux t + W h t 1 ) V W U W U and

41 <s> Recurrent network

42 Recurrent network <s> p(development <s>)

43 Recurrent network <s> development

44 Recurrent network <s> development

45 Recurrent network <s> development p(and development,<s>)

46 Recurrent network <s> development and

47 Recurrent network <s> development and

48 Recurrent network <s> development and p(progress development,and, <s>)

49 Recurrent network <s> development and progress

50 Recurrent network History of inputs up to current time-step <s> development and progress

51 Recurrent network History of inputs up to current time-step <s> development and progress State of the art in language modeling (Mikolov 2011) More accurate than feed-forward nets (Sundermeyer 2013)

52 Combined language and translation model Auli et al., EMNLP 2013

53 Combined language and translation model Auli et al., EMNLP 2013

54 Combined language and translation model Auli et al., EMNLP 2013

55 Combined language and translation model s Auli et al., EMNLP 2013 Entire source sentence representation

56 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s>

57 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development

58 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development

59 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development and

60 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development and

61 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) s Auli et al., EMNLP 2013 Entire source sentence representation <s> development and progress

62 Combined language and translation model h t = (Ux t + Wh t 1 + Fs)

63 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window

64 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s>

65 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development

66 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) Source word-window <s> development

67 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development

68 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development

69 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and

70 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and

71 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and

72 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and progress

73 Combined language and translation model h t = (Ux t + Wh t 1 + Fs) ( ),, Source word-window <s> development and progress Le (2012), Kalchbrenner (2013), Devlin (2014), Cho (2014), Sutskever (2014)

74 Experimental setup WMT 2012 French-English translation task Data: 100M words Baseline: Phrase-based model similar to Moses Rescoring Mini-batch gradient descent Class-structured output layer (Goodman, 1996) 15

75 Does the neural model learn to translate? -Discrete translation model WMT 2012 French-English, 100M words, phrase-based baseline, n-best rescoring

76 Does the neural model learn to translate? Discrete translation model BLEU Phrase-based RNNLM RNNTM full-sent RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, n-best rescoring

77 Does the neural model learn to translate? Discrete translation model BLEU Phrase-based RNNLM RNNTM full-sent RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, n-best rescoring

78 Improving a phrase-based baseline Discrete translation model BLEU Phrase-based RNNLM RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, lattice rescoring

79 Improving a phrase-based baseline +Discrete translation model BLEU Phrase-based RNNLM RNNTM word-window Koehn et al. (2003) Mikolov (2011) WMT 2012 French-English, 100M words, phrase-based baseline, lattice rescoring

80 Combining recurrent nets with discrete models Auli & Gao, ACL 2014

81 Combining recurrent nets with discrete models Auli & Gao, ACL 2014 BLEU Phrase-based n-best rescore lattice rescore decoder integration

82 Combining recurrent nets with discrete models Auli & Gao, ACL BLEU Phrase-based n-best rescore lattice rescore decoder integration

83 Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order

84 Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words Neural models more robust when discrete models rely on sparse estimates? n-gram order

85 Neural nets vs discrete models Neural models more robust when discrete models rely on sparse estimates? Language modeling: n-gram LM and neural net LM two components of log-linear model of translation ê = argmax e X i ih i (f,e) h 1 (f,e) =log p lm (e) 40% 30% 20% 10% 0% Average: 2.7 words n-gram order h 2 (f,e) =log p rnn (e)

86 Neural nets vs discrete models Neural models more robust when discrete models rely on sparse estimates? Language modeling: n-gram LM and neural net LM two components of log-linear model of translation ê = argmax e X i ih i (f,e) h 1 (f,e) =log p lm (e) log p lm (e) = 40% 30% 20% 10% 0% Average: 2.7 words n-gram order h 2 (f,e) =log p rnn (e) Split each model into five features, one for each n-gram order s.t. 5X h i (f,e) log p rnn (e) = i=1 X10 i=6 h i (f,e)

87 Neural nets vs discrete models Neural models more robust when discrete models rely on sparse estimates? Language modeling: n-gram LM and neural net LM two components of log-linear model of translation ê = argmax e X i ih i (f,e) h 1 (f,e) =log p lm (e) log p lm (e) = 40% 30% 20% 10% 0% Average: 2.7 words n-gram order h 2 (f,e) =log p rnn (e) Split each model into five features, one for each n-gram order s.t. 5X h i (f,e) log p rnn (e) = Standard optimizer (MERT) to find weights for each n-gram order i=1 X10 i=6 h i (f,e)

88 Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order Optimzer: MERT

89 Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram order Optimzer: MERT

90 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight less important n-gram order Optimzer: MERT

91 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram RNN 0 less important n-gram order Optimzer: MERT

92 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram RNN 0 less important n-gram order Optimzer: MERT

93 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram RNN 0 less important n-gram model relies on unigram counts n-gram order Optimzer: MERT

94 more important Neural nets vs discrete models 40% 30% 20% 10% 0% Average: 2.7 words n-gram order 0.05 Feature weight n-gram RNN 0 less important n-gram model relies on unigram counts n-gram order Optimzer: MERT n-gram model has higher-order statistics

95 Recurrent Minimum Translation Unit Models development and progress of the region

96 Recurrent Minimum Translation Unit Models M1 M2 M3 M4 M5 M6 development and progress of the region

97 Recurrent Minimum Translation Unit Models M1 M2 M3 M4 M5 M6 development and progress of the region M1 M2 M3 the region of

98 Recurrent Minimum Translation Unit Models M1 M2 M3 M4 M5 M6 development and progress of the region M1 M2 M3 the region of n-gram models over MTUs:! p(m1) p(m2 M1) p(m3 M1,M2) Banchs et al. (2005), Quirk & Menezes (2006)

99 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s>

100 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development

101 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development

102 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development and

103 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development and

104 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 <s> development and progress

105 Recurrent Minimum Translation Unit Models Hu et al., EACL 2014 Singletons <s> development and progress

106 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s>

107 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s>

108 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s>

109 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development

110 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development

111 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development

112 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and

113 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and

114 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and

115 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and

116 Recurrent Minimum Translation Unit Models Reduce sparsity by bag of words representation <s> development and progress

117 Recurrent Minimum Translation Unit Models

118 Recurrent Minimum Translation Unit Models BLEU Phrase-based n-gram MTU RNN MTU Koehn et al. (2003) Banchs et al. (2005), Quirk & Menezes (2006)

119 Recurrent Minimum Translation Unit Models BLEU Phrase-based n-gram MTU RNN MTU Koehn et al. (2003) Banchs et al. (2005), Quirk & Menezes (2006)

120 This talk Translation modeling Auli et al., EMNLP 2013; Hu et al., EACL 2014 Language modeling development and progress of the region. Optimization Auli & Gao, ACL 2014 Reordering Auli et al., EMNLP 2014

121 Back propagation with cross entropy error Model distribution (y) delayed

122 Back propagation with cross entropy error True distribution (t) Model distribution (y) delayed

123 Back propagation with cross entropy error True distribution (t) Model distribution (y) J = X i t i log y i delayed Goal: Make correct outputs most likely

124 Back propagation with cross entropy error True distribution (t) Error vector (ẟ) i = y i t i Model distribution (y) J = X i t i log y i delayed Goal: Make correct outputs most likely

125 Back propagation with cross entropy error True distribution (t) Error vector (ẟ) i = y i t i Model distribution (y) J = X i t i log y i delayed Goal: Make correct outputs most likely

126 Back propagation with cross entropy error True distribution (t) Error vector (ẟ) i = y i t i Model distribution (y) J = X i t i log y i delayed Goal: Make correct outputs most likely

127 Back propagation through time <s> development and

128 Back propagation through time <s> development and

129 Back propagation through time <s> development and

130 Back propagation through time <s> development and

131 Optimization Likelihood training very common Auli & Gao, ACL 2014 Optimizing for evaluation metrics difficult, but empirically successful (Och 2003, Smith 2006, Chiang 2009, Gimpel 2010, Hopkins 2011)

132 Optimization Likelihood training very common Auli & Gao, ACL 2014 Optimizing for evaluation metrics difficult, but empirically successful (Och 2003, Smith 2006, Chiang 2009, Gimpel 2010, Hopkins 2011) Next: Task-specific training of neural nets for translation

133 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002)

134 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty

135 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region

136 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region

137 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region

138 BLEU Metric (Bilingual Evaluation Understudy; Papineni 2002) precision scores brevity penalty Human:! development and progress of the region System:! advance and progress of region

139 Expected BLEU Training (Smith 2006, He 2012, Gao 2014) L: max p(ẽ f; )

140 Expected BLEU Training (Smith 2006, He 2012, Gao 2014) Desired translation output L: max p(ẽ f; )

141 Expected BLEU Training (Smith 2006, He 2012, Gao 2014) Desired translation output L: max p(ẽ f; ) xbleu: max X e2e(f) sbleu(e, ẽ)p(e f; ) Generated outputs e.g. n-best, lattice Gain function Model probability

142 Expected BLEU Training Human:! development and progress of the region

143 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region

144 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu

145 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; )

146 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; )

147 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; )

148 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

149 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) t xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

150 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) t p t+1 (e f; ) xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

151 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) t p t+1 (e f; ) xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

152 Expected BLEU Training Human:! development and progress of the region advance and progress of the region development and progress of this province progress of this region sbleu p t (e f; ) t p t+1 (e f; ) xbleu = X e2e(f) sbleu(e, ẽ)p(e f; )

153 Results BLEU Phrase-based RNN Cross Entropy RNN +xbleu Koehn et al. (2003)

154 Results BLEU Phrase-based RNN Cross Entropy RNN +xbleu Koehn et al. (2003)

155 Scaling linear reordering models xbleu training of millions of linear features Auli et al., EMNLP 2014

156 Scaling linear reordering models xbleu training of millions of linear features Auli et al., EMNLP % % +0.7 BLEU % 26.0% news2011% % 24.5 Baseline 4.5K 9K 900K 3M Features 25.6% 2K% 4K% 8K% 16K% 32K% 64K% 136K%272K% Training set size

expansion with word embeddings using distributional word and phrase representations and by

157 My other neural network projects Social media response generation with RNNs building neural net-based conversational agents based on twitter conversations Semi-supervised phrase table expansion with word embeddings using distributional word and phrase representations and by mapping between distributional source and target spaces with RBVs CCG parsing & tagging with RNNs

158 Semantic CCG Parsing Zettlemoyer (2005, 2007), Bos (2008), Kwiatkowski (2010, 2013) Krishnamurthy (2012), Lewis (2013a,b) Marcel proved completeness NP : marcel (S\NP)/NP : λy.λx.proved(x,y) NP : completeness S\NP : λx.proved(x, completeness) S : proved(marcel, completeness) Combinatory Categorial Grammar (CCG; Steedman 2000)

159 How is this useful? Marcel proved completeness NP : marcel (S\NP)/NP : NP : λy.λx.proved(x,y) completeness S\NP : λx.proved(x, completeness) Answer S : proved(marcel, completeness) User query (Semantic) parsing Knowledge Base

160 How is this useful? Marcel proved completeness Parse failures and lexical ambiguity are a NP : (S\NP)/NP : NP : marcel λy.λx.proved(x,y) completeness major source of errors in semantic parsing S\NP : λx.proved(x, completeness) (Kwiatkowski 2013) S : proved(marcel, completeness) Answer User query (Semantic) parsing Knowledge Base

161 Integrated Parsing & Tagging with belief propagation, dual decomposition and softmax-margin training Auli & Lopez ACL 2011a,b Auli & Lopez EMNLP 2011 parsing factor p i,j = 1 Z f i,je i,j b i,j o i,j inside-outside supertagging factor forward-backward Marcel proved completeness

162 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

163 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

164 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

165 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

166 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

167 Labelled F-measure Integrated Parsing & Tagging F-measure loss for parsing sub-model (+DecF 1). Belief propagation for inference. Hamming loss for supertagging sub-model (+Tagger) Auli & Lopez EMNLP 2011 best CCG parsing results to date C&C 07 Petrov-I5 Xu 14 Belief Propagation +F1 Loss Fowler & Penn (2010)

168 Recurrent nets for CCG supertagging & parsing... with Wenduan Xu Marcel NP proved S\NP/NP

169 Recurrent nets for CCG supertagging & parsing... with Wenduan Xu Marcel NP proved S\NP/NP tagging parsing CRF (Clark & Curran 07) FFN (Lewis & Steedman, 14) RNN

170 Two RNN translation models Summary Neural nets help most when discrete models sparse Task-specific objective gives best performance Next: Better modeling of source-side, e.g., bi-directional RNNs, different architectures

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Google s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, et al. Google arxiv:1609.08144v2 Reviewed by : Bill